A Survey on Benchmarks for Big Data and Some ...

38 downloads 21429 Views 2MB Size Report
A big data benchmark suite is needed eagerly by customers, industry .... data-access patterns of a Hadoop cluster by running a mix of Hadoop jobs, including.
A Survey on Benchmarks for Big Data and Some More Considerations Xiongpai Qin1 and Xiaoyun Zhou2 1

2

School of Information, Renmin University of China, Beijing, 100872, China Computer Science Department, Jiangsu Normal University, Xuzhou, Jiangsu, 221116, China [email protected], [email protected], [email protected]

Abstract. A big data benchmark suite is needed eagerly by customers, industry and academia recently. A number of prominent works in last several years are reviewed, their characteristics are introduced and shortcomings are analyzed. The authors also provide some suggestions on building the expected benchmark, including: component based benchmarks as well as end-to-end benchmarks should be used together to test distinct tools and test the system as a whole; workloads should be enriched with complex analytics to encompass different application scenarios; metrics other than performance metrics should also be considered. Keywords: Big Data, Benchmark, Performance, Scalability, Fault Tolerance, Energy Saving, Security.

1

Introduction

With the going down of the price of storage devices, development of e-commerce and internet applications such as Twitter and Facebook, requirements of science research etc., people have collected some data sets of huge volumes that haven’t been seen before. Data comes in different forms (structured and unstructured), with different velocities (some data are generated automatically by sensors which needs to be processed timely). RDBMS (relational database management systems) can handle structured relational data well, although some unstructured data could be stored in RDBMS for later processing, RDBMS could not address all data management and processing challenges due to its scalability and other limitations. Various noSQL technologies are proposed and implemented in recent years for big data management and processing. A benchmark is important in that it not only helps customers evaluate similar products from different vendors, but also it is needed by vendors and researchers. Vendors could continuously improve their products by benchmarking. Researchers could try new techniques and use the benchmark to evaluate the techniques, which will facilitate the process of innovation. To select among new technologies in conjunction with RDBMS for efficiently processing of big data, customers, industry as well as academia need a new benchmark [1] [2] [3]. For decades, relational database has been the standard choice for data storage and data analytics in OLTP and OLAP applications. Industries, together with academia H. Yin et al. (Eds.): IDEAL 2013, LNCS 8206, pp. 620–628, 2013. © Springer-Verlag Berlin Heidelberg 2013

A Survey on Benchmarks for Big Data and Some More Considerations

621

have developed benchmark suits to evaluate products from vendors on a fair basis. The most prominent ones are the TPC-C benchmark and the TPC-H benchmark, for OLTP and OLAP scenarios respectively. Taking TPC-H as an example, it is a benchmark of TPC (Transaction Processing Performance Council) for decision support systems. It consists of industry-related data with 8 tables and 22 businessoriented queries. The performance of TPC-H is measured by QphH@Size (Query-perhour at different data sizes). Performance of different systems can be evaluated on the same size datasets. Many database systems running on various hardware platforms are compared using the metrics of QphH@Size and Price/QphH. Besides TPC-C and TPC-H, TPC-W is a benchmark for evaluation of OLTP-style web applications, TPCDS and SSB (Star Schema Benchmark) are analytic benchmarks for OLAP and decision support systems. TPC style benchmarks are kept static, which allows benchmarks to remain comparable across several decades. In the mean time, it has also incurred some criticism that the benchmarks are stale and not enough for benchmarking of recent big data applications, since big data is likely to evolve over time.

2

Some Recent Efforts

2.1

Survey of Recent Efforts

MapReduce has risen to become the de-facto tool for big data processing. Kim et. al. designed MRBench [4] for evaluating the MapReduce framework using TPC-H workloads. They implemented the needed relational operations, including selection, projection, cross product, join, sort, grouping, and aggregation. They have also conducted experiments by varying data sizes, number of nodes, and number of map tasks. MRBench could give people some insights into decisions on using MapReduce for data warehouse style queries. However some big data applications have gone beyond simple SQL queries to deeper analytics. A use case that comprises five representative queries for massive astrophysical simulations is proposed in [5]. The authors have implemented the use case in a distributed DBMS and in a Pig/Hadoop system. The performances of the two platforms are compared against each other. They found that both systems provide competitive performance and improved scalability relative to IDL-based methods. However, the use case could not be a fully functional benchmark in itself, because it only considers astrophysical applications, and the queries of the use case could not represent workloads from other domains. PUMA [7] is a benchmark suite proposed specifically for MapReduce. The workload of the benchmark includes term-vector, inverted-index, self-join, adjacencylist, k-means, classification, histogram-movies, histogram-ratings, sequence-count, ranked inverted index, Tera-sort, GREP, and word-count. There are some shortcomings for MapReduce specific benchmarks. While the application domains currently dominated by MapReduce should be a part of big data benchmarks, the existing MapReduce benchmarks aren’t complete representatives of all big data systems. They are inherently limited to measuring only a stripe of the full range of system behavior, since a real life cluster hardly runs only one job at a time or just a

622

X. Qin and X. Zhou

handful of some specific jobs. Benchmarks such as [6] and [8] are also designed specifically for MapReduce, they hold the shortcomings mentioned above. In the work of [9], the authors discuss many aspects of building a new benchmark for big data. In their opinion, a unified model should be used, which could describe several components of the targeted workloads, including the functions that the system must hold, the representative data access patterns, the scheduling and load variations over time, and the computation (basic operations, queries, analytic algorithms etc.) required to achieve the functional goal. They have laid down some standards for the expected benchmark, including: (1) representative: the benchmark should measure performance using metrics that are relevant to real life application domains, and under situations that similar to real life computing scenarios. (2) Portable: the benchmark could be ported to different kinds of systems that can serve the computing needs of the same application domain. (3) Scalable: the benchmark could measure performance for both large and small systems, and for both large and small data sizes. (4) Simple: the benchmark could be easily understood. (5) System diversity: big data systems store and manipulate many kinds of data. Sometimes diversity translates to mutually exclusive variations in system design. (6) Rapid data evolution: analytic needs evolve constantly and rapidly. The system should be able to scale at multiple layers, and the data should be able to scale across multiple sources and formats. The authors believed that identifying the functions of big data application domains is the first step toward building truly representative big data benchmark. Big data encompasses many application domains. OLTP is only one domain, thus application domains should also be included, i.e. (1) flexible latency analytics, for which MapReduce was originally designed, (2) interactive analytics with low latency like traditional OLAP, (3) semi-streaming analytics, which describes continuous computing processes. Yahoo! Cloud Serving Benchmark (YCSB) [10] is proposed with the goal of facilitating performance comparisons of the new generation of cloud data serving systems. Examples of systems for cloud serving include BigTable, PNUTS, Cassandra, HBase, Azure, CouchDB, SimpleDB, Voldemort, and many others noSQL databases. Considering the scaling out architecture, the elasticity, and the high availability characteristics of a cloud, the authors have discussed some trade-offs made by cloud data systems, including read performance vs. write performance, latency vs. durability, synchronous vs. asynchronous replication, and data partitioning. The YCSB benchmark evaluates the target system from three tiers, i.e. the performance tier, the scaling tier, and the availability tier. A core set of functions are defined and experiment results of four widely used systems, i.e. Cassandra, HBase, Yahoo’s PNUTS, and a simple sharded MySQL, are reported. The benchmark addresses the problem of benchmarking “cloud based OLTP” systems, which typically do not support ACID transactions. Smullen et. al. [11] believed that a large fraction of the data that will be stored and processed in future systems is expected to be unstructured, in the form of texts, images, audio etc. They presented an unstructured data processing benchmark suite. Detailed descriptions of the workloads are provided, including edge detection, proximity search, data scanning, and data fusion. The workloads could capture larger space of application characteristics. Although the benchmark only aims at

A Survey on Benchmarks for Big Data and Some More Considerations

623

unstructured data processing, it could be a component in a big data benchmark for evaluation of the target system specifically in the aspect. In the research of Huang and his colleagues [12], HiBench which is a benchmark specifically for MapReduce framework evaluation, is introduced. The authors have analyzed some previous benchmarks. Sorting benchmarks [13] such as TeraSort [14] are used by Google and Yahoo [15] to test MapReduce optimizations. However sorting is only one aspect of data processing although it is basically important. GridMix [16] is a synthetic benchmark in the Hadoop distribution, which models data-access patterns of a Hadoop cluster by running a mix of Hadoop jobs, including 3-stage chained MapReduce job, large data sort, reference select, indirect read, and text sort. GridMix is not a fully-blown benchmark for big data since its workloads are not representative enough to include other important characteristics of real world applications. The Hive performance benchmark [17] is used to compare the performance of Hadoop and parallel analytical database, which consists of five programs with GREP as the first one, the other four analytical queries are designed for traditional structured data analysis [18], including selection, aggregation, join and UDF aggregation. The DFSIO [19] is a file system level benchmark for Hadoop that tests how HDFS handles a large number of tasks performing writes or reads simultaneously. Based on their analysis, finally the authors propose HiBench [12] to accommodate several components, which make it different from other benchmarks, including micro benchmarks - sort, word count and TeraSort; Web search - Nutch indexing and page rank; machine learning - the Bayesian classification and K-means clustering; HDFS benchmark - a file system level benchmark. Extensive experimental results using their benchmark are provided. The most prominent characteristic of the benchmark is that it incorporates machine learning algorithms into the whole framework, which is coincident with the trend of complex analysis over the big data. Zhan et. al. have a strong background of HPC (high performance computing), they proposed CloudRank-D [20] for benchmarking cloud computing system for data processing. The benchmark suite includes basic operations for data analysis, classification, clustering, recommendation, sequence learning, association rule mining, and data warehouse queries. The benchmark suite also includes a representative application - ProfSearch (http://prof.ict.ac.cn/). Two simple metrics: data processed per second and data processed per Joule are used as two complementary metrics for evaluating cloud computing systems specifically. The metric of data processed per Joule is defined as the total amount of data inputs of all jobs divided by the total energy consumed during the duration from the submission time of the first job to the finished time of the last job. Inclusion of energy consumption metrics makes the benchmark different from others. Researchers from industry and academia gathered frequently [21] in recent two years, and considered the building of a truly new and comprehensive benchmark for big data. They discussed almost every aspects of big data benchmark, including representative workloads, data sets, new metrics, implementation options, tool set, data generation, benchmark execution rules, and the specification with scaling factor etc. The SIGMOD paper of [22] presented the BigBench, an end-to-end big data benchmark proposal. The proposal covers data models of structured, semi-structured and unstructured data, and addresses variety, velocity and volume aspects of big data systems. The structured part of the BigBench data model is borrowed from the TPC- DS benchmark, which is modified

624

X. Qin and X. Zhou

with semi-structured and unstructured data components added. The semi-structured part captures registered and guest user clicks on the retailer’s website. The unstructured data captures product reviews submitted online. The workload is designed around a set of queries against the data model. It covers different categories of big data analytics proposed by McKinsey, and spans three different dimensions including data sources, query processing types and analytic techniques. The authors reported results of their experiments on Aster Data database using SQL-MR. Some benchmarks, such as GRAPH 500 [23] and LinkBench [24] are only for graph database evaluation. There are only two kernels in the GRAPH 500 benchmark, i.e. a kernel to construct the graph from the input tuple list, and an additional computational kernel to perform breadth-first search on the graph. It is far from a general benchmark for graph database. LinkBench is released by FaceBook for social graph database benchmarking. However LinkBench is a graph-serving benchmark, not a graph-processing benchmark. The difference is that the former simulates the transactional workloads of an interactive social network service, while the latter simulates an analytics workload. In our opinion, ingredients of these benchmarks, especially graph analytic workloads should be incorporated into the big data benchmark to evaluate graph data processing component in the whole system. 2.2

Conclusion of the Survey

After above reviewing of recent works on big data benchmarking, the authors would like to put all of these works into a table for readers’ easy grasp. Table 1. Comparison of existing works of big data benchmark (* for reference) work

target

*TPC-C *TPC-H *TPC-W *SSB *TPC-DS

RDBMS RDBMS, Hadoop Hive RDBMS, noSQL RDBMS, Hadoop Hive RDBMS, Hadoop Hive

TeraSort YCSB REF 11

GRAPH 500

characteristics

transaction processing, simple query and update reporting, decision Web applications reporting, decision reporting query, ad hoc query, iterative query, data mining query RDBMS, Hadoop data sorting noSQL database cloud based data serving unstructured data unstructured data only management system edge detection, proximity search, data scanning, data fusion graph noSQL database graph data processing only

comment OLTP OLAP Web OLTP OLAP OLAP sorting only Web OLTP not representative enough

not representative enough LinkBench RDBMS, graph noSQL modeling Facebook real life application not representative database graph data processing only enough DFSIO Hadoop file system level benchmark not representative enough Hive Hadoop Hive GREP, selection, aggregation, join and UDF not representative performance aggregation only enough benchmark GridMix Hadoop Mix of Hadoop jobs not representative enough PUMA MapReduce term-vector, inverted-index, self-join, adjacency-list, comprehensive k-means, classification, histogram-movies, workload histogram-ratings, sequence-count, ranked inverted index, Tera-sort, GREP, word-count

A Survey on Benchmarks for Big Data and Some More Considerations

625

Table 1. (continued) MRBench HiBench

MapReduce MapReduce

CloudRank-D

RDBMS, Hadoop

BigBench

RDBMS, Hadoop

TPC-H queries micro benchmarks (sort, word count and TeraSort) Web search (Nutch Indexing and page rank) machine learning (Bayesian classification and Kmeans clustering) HDFS benchmark (file system level benchmark) basic operations for data analysis, classification, clustering, recommendation, sequence learning, association rule mining, and data warehouse queries covers data models of structured, semi-structured and unstructured data addresses variety, velocity and volume aspects of big data systems

OLAP comprehensive workload

comprehensive workload comprehensive workload

3

Additional Considerations

3.1

Representative Applications, Component or End-to-End Benchmark

To make the big data benchmark useful for evaluation target systems by customer, and help vendors and researchers to continuously improve their technologies, the workloads of the big data benchmark should be derived from real life applications. One of possible application prototypes would be internet-scale applications and data management systems of large internet companies such as Facebook, Google etc. There is a gap between privacy protection and willingness to be open when building a benchmark form real applications from these companies. Can a prototype represent thousands of applications in the big data field? A prototype could be built using a combination of a bottom-up method and a top-down method, which integrates characteristics of real life applications, and abstracts some basic operations from them in the mean time. Since only one tool can not fit all user requirements, there should be more than one component in a big data processing platform. In a big data system, various data is used in combination to extract useful information for decision making. The data is usually multi-structured. Following components are needed (according to real life application requirements, these components could be used by combination), (1) streaming computing engine, which is for stream data processing, (2) structured data processing engine, which could be built upon traditional OLAP-purpose RDBMS, (3) and unstructured data processing engine, which could be based on Hadoop and some noSQL databases. In certain applications graph analytics are indispensable. In such a setting, both component based benchmarking and end-to-end benchmarking are required. Customers can select a single benchmark from the benchmark suite to evaluate some specific tool of the platform, such as a stream computing engine or a graph database. On the other hand, the big data analysis is basically a workflow of analytic tasks instead of a single analytic task. These analytical processes can run across multiple tools in a big data platform. For example [25], social network interaction data can be loaded into Hadoop, sentiment data and social ‘Handles’ are extracted by text mining techniques. Then the sentiment is connected to a customer and scored, to add sentiment scores into the data warehouse. Analysis is done to identify unhappy

626

X. Qin and X. Zhou

customers and some actions could be taken to retain them. Moreover, the extracted social handles can be loaded into a graph database for further social network analysis to determine important relationships for possible cross-selling. So end-to-end benchmarking is also needed for cross-tools analytic. 3.2

Workloads – From Simple SQL Queries to Complex Analytics

The workloads should be carefully considered. On the whole, the workloads for a big data benchmark should not only contain simple SQL queries, machine learning, data mining, as well as information retrieval algorithms should also be included. In our opinion, graph analytics becomes more and more important. It should be an indispensable part of the workloads. In some application scenarios such as multi media data processing and scientific data processing, some image manipulation algorithms and array/matrix algorithms should also be considered. The workloads could be organized by a two-level of aggregations as depicted in figure 1. Various workflows could be composed using the basic analytic algorithms and SQL queries. Upon the basis, several typical big data application scenarios are built by combining the workflows. Customers could use some or all of the application scenarios to test the target system for their business objectives. Customers Vendors Researchers

Users

Metrics

Workload Model

Data Engines

Metrics for performance, scalability, fault tolerance, Energy consumption, security

Application Scenarios Simple SQL, Complex Analytics, and workflows

Stream Computing

RDBMS

Hadoop & noSQL (Graph)

Fig. 1. Building blocks of a big data benchmark

Note: graph storage and processing is an indispensable part of a big data platform. 3.3

Not Only Performance Metrics, But Also Other Metrics

Traditional benchmarks focus on performance metrics of target systems. For big data benchmarking that is not enough. More metrics should be considered, including scalability, fault tolerance, energy saving, and security guarantee. To tackle the problem of big data, the processing software needs to run on a large cluster, mostly the cluster is deployed on virtual nodes in a cloud environment. The scalability of the big data system should be tested. A cloud platform is elastic, not only it can scale out, but also it can scale in. The dynamic scalability asks people to test the target big data system when the size of underlying cluster dynamically changes according to cost and performance requirements.

A Survey on Benchmarks for Big Data and Some More Considerations

627

Fault tolerance is also an important factor when failures of nodes are common. In benchmarking, some failures should be purposely injected into the target system to see whether the big data processing platform can handle the failures to meet the service level requirement. In the big data era, using large number of machines to processing the big data may consume huge energy. Energy consumption is a critical metric that both vendors and researchers should take serious. Lastly security is also an important issue, some work should be done to incorporate security test into the benchmark framework.

4

Conclusions

The paper gives a review of recent works in big data benchmarking. Industry and academia are on the road toward a comprehensive big data benchmark suit. Big data benchmarking is different from traditional transaction processing benchmarking and analytic processing benchmarking in several aspects, such as data volume, data variety etc. Some considerations of us are provided for the historical process, including: component based benchmarking should be used together with applicationoriented end-to-end benchmarking, complex data mining and machine learning algorithms should be run against the system beyond simple SQL queries to test deep analytic capability of the target system, and other critical metrics such as energy consumption besides response time and throughput should also be measured. Acknowledgements. This work is funded by the NSF of China under Grants No. 61170013, NSF of Jiangsu Province under Grant No. BK2012578.

References 1. Ventana Research: Hadoop and Information Management: Benchmarking the Challenge of Enormous Volumes of Data (2013), http://www.ventanaresearch.com/ research/benchmarkDetail.aspx?id=1663 2. Big Data Top 100: An open, community-based effort for benchmarking big data systems (2013), http://bigdatatop100.org/benchmarks 3. Hemsoth, N.: A New Benchmark for Big Data (2013), http://www.datanami.com/datanami/2013-0306/a_new_benchmark_for_big_data.html 4. Kim, K., Jeon, K., Han, H., Kim, S.G., Jung, H., Yeom, H.Y.: MRBench: A Benchmark for MapReduce Framework. In: Proceedings of ICPADS, pp. 11–18. IEEE Press, Melbourne (2008) 5. Loebman, S., Nunley, D., Kwon, Y., Howe, B., Balazinska, M., Gardner, J.P.: Analyzing massive astrophysical datasets: Can Pig/Hadoop or a relational DBMS help? In: Proceedings of CLUSTER, pp. 1–10. IEEE Press, New Orleans (2009) 6. Moussa, R.: TPC-H Benchmark Analytics Scenarios and Performances on Hadoop Data Clouds. In: Benlamri, R. (ed.) NDT 2012, Part I. CCIS, vol. 293, pp. 220–234. Springer, Heidelberg (2012) 7. Ahmad, F., Lee, S., Thottethodi, M., Vijaykumar, T.N.: PUMA: Purdue MapReduce Benchmarks Suite. Purdue Technical Report TR-ECE-12-11 (2012)

628

X. Qin and X. Zhou

8. Chen, Y., Alspaugh, S., Ganapathi, A., Griffith, R., Katz, R.: SWIM - Statistical Workload Injector for MapReduce (2013), https://github.com/SWIMProjectUCB/SWIM/wiki 9. Chen, Y.P., Raab, F., Katz, R.H.: From TPC-C to Big Data Benchmarks: A Functional Workload Model. UC Berkeley Technical Report UCB/EECS-2012-174 (2012) 10. Cooper, B.F., Silberstein, A., Tam, E., Ramakrishnan, R., Sears, R.: Benchmarking Cloud Serving Systems with YCSB. In: SoCC, pp. 143–154. ACM Press, Indianapolis (2010) 11. Smullen, C.W., Shahrukh, I.V., Tarapore, R., Gurumurthi, S.: A Benchmark Suite for Unstructured Data Processing. In: Proceedings of the International Workshop on Storage Network Architecture and Parallel I/Os (in Conjunction with MSST), pp. 79–83. IEEE Press, San Diego (2007) 12. Huang, S.S., Huang, J., Dai, J.Q., Xie, T., Huang, B.: The HiBench Benchmark Suite: Characterization of the MapReduce-Based Data Analysis. In: ICDE Workshops on Information & Software as Services, pp. 41–51. IEEE Press, Long Beach (2010) 13. Nyberg, C., Shah, M.: Sort benchmark (2012), http://sortbenchmark.org/ 14. TeraSort. TeraSort Benchmark (2012), http://sortbenchmark.org/ 15. Malley, O.O., Murthy, A.C.: Winning a 60 Second Dash with a Yellow Elephant (2009), http://sortbenchmark.org/Yahoo2009.pdf 16. GridMix. GridMix Benchmark (2012), http://hadoop.apache.org/docs/r1.1.1/gridmix.html 17. Jia, Y., Shao, Z.: A Benchmark for Hive, PIG and Hadoop (2012), http://issues.apache.org/jira/browse/HIVE-396 18. Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: SIGMOD, pp. 165–178. ACM Press, Rhode Island (2009) 19. DFSIO program. DFSIO of Hadoop source distribution (2012), src/test/org/ apache/hadoop/fs/TestDFSIO 20. Luo, C.J., Zhan, J.F., Jia, Z., Wang, L., Lu, G., Zhang, L.X., Xu, C.Z., Sun, N.H.: CloudRank-D: benchmarking and ranking cloud computing systems for data processing applications. Frontier of Computer Science 6(4), 347–362 (2012) 21. UCSD Center for Large Scale Data Systems Research: Big Data Benchmarking Workshops (2013), http://clds.ucsd.edu/bdbc/workshops 22. Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., Jacobsen, H.A.: BigBench: Towards an Industry Standard Benchmark for Big Data Analytics. In: SIGMOD. ACM Press, New York (2013) 23. Graph 500: Graph 500 Benchmark 1 (2013), http://www.graph500.org/specifications 24. King, R.: Facebook releasing new Social Graph database benchmark: LinkBench (2013), http://www.zdnet.com/facebook-releasing-new-social-graphdatabase-benchmark-linkbench-7000013356/ 25. Ferguson, M.: Architecting a Big Data Platform for Analytics. A Whitepaper Prepared for IBM (2012)