Evaluating Cluster Configurations for Big Data Processing: An ...

Evaluating Cluster Configurations for Big Data Processing: An Exploratory Study Roni Sandel, Mark Shtern, Marios Fokaefs and Marin Litoiu Electrical Engineering and Computer Science, York University [email protected], [email protected], [email protected], [email protected]

which we try to point out in this work, is how to objectively evaluate the performance of a data processing cluster, namely the response time to data queries and analytics. The actual problem here is that, thanks to the flexibility provided by cloud computing, there is a large number of possible configurations for a big data cluster. The question is how to efficiently and objectively evaluate all these configurations to eventually pick the optimal one according to the end-user’s requirements. This leads to the second challenge of finding which factors would have an impact on the performance of a cluster configuration. For instance, does the number of virtual machines affect performance, how much and in what way? Does the data schema have a similar impact and how does one choose the optimal schema? Finally, performing real experiments to evaluate all the possible configurations can be inefficient and potentially costly. For example, measuring response times over a larger space can take an extensive amount of time. This leads to the third and final challenge, which is how to avoid these expensive experiments, possibly by using a model to predict response times in a larger space. Having a predictive model would allow for approximating response times with only a smaller space, allowing for shorter periods of experimentation. In this work, we conduct a series of experiments to study the performance of several configurations for Big Data clusters and confirm how expensive and challenging this evaluation process may be. During our experimental process, we formulated two research questions:

Abstract—As data continues to grow rapidly, NoSQL clusters have been increasingly adopted to address the storage and processing demands of these large amounts of data. In parallel, cloud computing is also increasingly being adopted due to its flexibility, cost efficiency and scalability. However, evaluating and modelling NoSQL clusters present many challenges. In this work, we explore these challenges by performing a series of experiments with various configurations. The intuition is that this process is laborious and expensive and the goal of our experiments is to confirm this intuition and to identify the factors that impact the performance of a Big Data cluster. Our experiments mostly focus on three factors: data compression, data schema and cluster topology. We performed a number of experiments based on these factors and measured and compared the response times of the resulting configurations. Eventually, the outcomes of our study are encapsulated in a performance model that predicts the cluster’s response time as a function of the incoming workload and evaluates the cluster’s performance less costly and faster. This systematic and effortless evaluation method will facilitate the selection and migration to a better cluster as the performance and budget goals change. We use HBase as the large data processing cluster and we conduct our experiments on traffic data from a large city and on a distributed community cloud infrastructure. Index Terms—cloud computing; big data; design; performance modelling and evaluation; migration

I. I NTRODUCTION Enterprises are increasingly adopting cloud computing because of its economic advantages and its ability to scale [1], [2], [3]. By eliminating up-front costs, the cloud allows companies to dynamically scale hardware and software resources based on their demands [1], [4]. These benefits have also allowed for improved management of Big Data. Today, Big Data is a popular term to describe the exponential growth and availability of data, both structured and unstructured [5]. Traditional database and data management systems are falling short of handling the large amount of data and computational requirements, which has led to the birth of a new class of systems referred to as NoSQL, which are being widely adopted [6], [7], [8], [9]. NoSQL removes some of the support found in traditional relational databases, such as SQL language and transactions, in exchange for faster reading, faster writing, larger storage, ease of expansion, and low cost [10]. Given that NoSQL technologies are relatively new, there are several challenges that may arise. One such challenge,

•

•

With respect to RQ1, we chose our experiments to focus on these three factors, namely data compression, the NoSQL data schema and the topology of the cluster as it is deployed in a cloud infrastructure, since we consider them to be significant for configuring a cluster and for its performance. Additionally, these factors are in the immediate control of the engineer that will configure the cluster as they are on the application and VM level and not on the network, volume or physical machine level, where the cloud provider gives no access. Moreover, these factors are independent of the application’s

Roni Sandel is currently with IBM Canada, but the work was conducted while at York University.

c 2015 IEEE 978-1-4673-7935-9/15/$31.00

RQ1: How do data compression, data schema and cluster topology affect the performance of the data processing cluster? RQ2: What is the relationship between the performance of a cluster and the incoming workload?

23

MESOCA 2015, Bremen, Germany

domain and apply on all HBase clusters. For the first factor, data compression, we considered two options, compressed and non-compressed data. For the data schema, we explored three alternatives. Finally, for the topology, we considered 8 scenarios with varying number and type of virtual machines, where all topologies summed up to the same total cost, in order to conform to a predefined budget. We combined the two data compression options with all data schema options (while keeping the topology constant) and with all topologies (while keeping the data schema constant). This resulted in a total of 22 experiments. The experiments ran with real data from the Connected Vehicles and Smart Transportation (CVST) project1 and the HBase clusters were deployed on the SAVI (Smart Applications on Virtual Infrastructure) testbed2 . As our data processing tools we used Cloudera3 , which uses Apache Hadoop4 Workload for the clusters was created based on synthetic but realistic HBase queries representing a number of users. The experiments showed that data compression was the factor with the less impact to response time compared to data schema or topology. Concerning the schema, we found that the MD5 cryptographic hash [11] in the key of the HBase rows significantly improves the performance of the cluster. With respect to the cluster topologies, an interesting finding was that a large cluster of small virtual machines could perform almost twice as fast as a small cluster of a few large virtual machines of equal cost, regardless of data compression. This was due to sub-optimal out-of-the-box configurations of the cluster. Concerning RQ2, our experiments showed that the relationship between the performance of a cluster, measured by its response time, and the incoming workload, in terms of number of users, is linear. This finding is consistent across all of our experiments, regardless of the specific configuration (data compression, data schema and topology). Based on this second outcome, we were able to construct a linear predictive model to estimate the performance of a cluster given a workload. We show that this model is highly accurate and its simplicity, in combination with the fact that the linear relationship is retained in every configuration, enables designers to evaluate all potential configuration to find the one that optimally satisfy the defined performance goals and budget. Therefore, this enables the systematic and effortless migration of data and its processing to a better cluster, as the performance and budget goals evolve. Given the exploratory nature of our study, this work does not aim to suggest methods to select the optimal topology or data schema, since this will most probably depend on the nature of the given data, processing method, performance goals and budget. Conversely, our goal is to set forth a systematic experimental process to enable the quick and efficient evaluation

of data cluster alternatives and facilitate the decision-making process. The rest of this paper is organized as follows. In Section II, we provide a broad overview of relevant research works. Section III outlines the details of the experimental process we followed to evaluate the various aspects of HBase cluster configurations, in terms of their impact on the cluster’s performance. Section IV presents the results of our experiments and our findings with respect to the posed research questions. Section V describes the derived model for the cluster’s response time and, finally, Section VI concludes this work. II. R ELATED W ORK Our work is related to research concerned with cluster configuration for Big Data processing, corresponding performance models for such clusters and the construction of optimal data schemas for NoSQL databases. Cloud environment allows for heterogeneous hardware and resource demands. Lee et al. [12] have found that it is important to exploit these features to make data analytics in the cloud efficient. They present a system architecture to allocate resources for a Hadoop data cluster in a cost effective manner. In this architecture, nodes are grouped into one of two pools: (1) long-living core nodes to host both data and computations and (2) accelerator nodes that are added temporarily to the cluster when more computing power is needed for workloads. A cloud driver is used to manage these nodes and makes decisions on adding/removing nodes based on the hints provided by the users when they submit the job. Hints include memory requirements, ability to use special features like GPUs, and the deadline. They experimented with two queries and found that using certain configurations had higher performance per cost compared to other configurations because some machines had faster CPUs at lower prices than “larger” machines. However, the machines with the lower price point had less memory, which might be of no use for jobs requiring a large amount of memory per machine. They also found that using more accelerators can cost less while having faster performance due to the fact that the instances are not being used for so long. The number of users who would use the data was not addressed, which can make a significant difference in how the topology should be created. Zaharia et al. [13] found that MapReduce does not perform well in heterogeneous Hadoop clusters. They state that heterogeneity of machines (mixed instances with various sizes) seriously impacts Hadoop’s scheduler. The scheduler uses a fixed threshold for selecting tasks to speculate (that is, if a node happens to be slow, the tasks are copied to a faster node to finish the computation sooner) and therefore, too many speculative tasks may be launched, taking away resources from useful tasks. Additionally, the wrong tasks may be chosen for speculation first because the scheduler ranks candidates by locality. The authors designed a Longest Approximate Time to End (LATE) scheduler which is a new speculative task scheduler to try to compete with this issue, which adds features to the Hadoop task scheduler. The primary feature

1 http://cvstproject.com/

Last accessed: 08-Jun-2015 Last accessed: 08-Jun-2015 3 http://www.cloudera.com/ Last accessed: 08-Jun-2015 4 http://hadoop.apache.org/ Last accessed: 08-Jun-2015 2 http://www.savinetwork.ca/

24

behind this algorithm is that it always speculatively executes the task that the system thinks will finish farthest into the future, because this task provides the greatest opportunity for a speculative copy to overtake the original and reduce the job’s response time. This is in contrast to the original heuristic that was used, which was comparing each task’s progress to the average progress which would have worked well for homogeneous environments where poorly performing nodes (stragglers) were obvious. In this case, LATE is robust to node heterogeneity as it only relaunches slowest tasks and only small number of tasks. It also takes into account node heterogeneity when deciding where to run speculative tasks. Lastly, LATE focuses on estimated time left rather than the progress rate. It speculatively executes tasks that will improve job response time rather than individual slow tasks’ response time. According to Zaharia et al.LATE can improve Hadoop response times by a factor of 2 in clusters with 200 virtual machines on Amazon EC2. With respect to modelling performance in Hadoop clusters, Song et al. [14] looked at proposing a simple framework to predict performance of Hadoop jobs. They found that the execution time for map and reduce had a linear relationship with the amount of data (64M to 8G for 4 different kinds of jobs). They also compared the prediction from smaller samples for both map and reduce tasks to actual values from the larger samples in order to see what the error rate is. The error rate was minimal, meaning that they can approximately predict the execution time for both map and reduce tasks. Bortnikov et al. [15] explored performance bottlenecks in MapReduce tasks. According to the authors, extremely slow tasks are a major performance bottleneck in MapReduce systems. They came up with a way to predict execution bottlenecks in MapReduce clusters. They came up with the slowdown predictor model, which is a “machine-learned oracle for MapReduce systems forecasting execution bottlenecks”. The predictor takes profiles of the tasks and the hardware, and then estimates the task’s deceleration. The predictor can be applied during the assignment of the task or during the execution. The predictor employs a popular gradient-boosted decision tree algorithm [16], which is an “additive regression model comprised of an ensemble of binary decision trees” [15]. In the case of the slowdown predictor model, each binary tree is split on some feature at a specific value, with a branch for each of the possible outcomes. Each leaf node contains a score, which corresponds to the decision path. The resulting prediction is the sum of the scores returned by individual decision trees. They evaluate their model on real-time data sets on a production Hadoop cluster at Yahoo!. They found that the prediction for mappers was more accurate than for reducers. Finally, as far as data schema’s impact on performance, Han and Stroulia [17] have studied performance of data schemas by running workloads on two different datasets. The first dataset was a cosmology dataset and consisted of 321,065,547 particles from 9 snapshots with a total size of approximately 14 GB binary format. Another dataset they used was Bixi, a

public dataset collected by a bicycle renting service in Montreal, Quebec, Canada which totalled 12 GB and contained 96,842 data-points for all the stations. Three schemas were used to test performance of queries on the data sets where the second two schemas would be three dimensional (with the timestamp/version being the third dimension). The version dimension would act as the third dimension. Along with the other two dimensions, the version dimension specifies a cell and, by default, HBase has 3 versions maximum per cell. If data with the same row-key and column as another data is imported, that older data will not be replaced, rather it will be “versioned”. In the case of Bixi data, if they wanted to store values by day, they would use the date and station id as their row-key (no time/hours/minutes). All the 1440 records for one day would be stored on the same cell through “versions” (hence there would 1440 versions for each cell). The authors found that using the third dimension of HBase improves performance and that the distribution of data across cluster nodes highly impacts performance. III. E XPERIMENTAL P ROCESS The purpose of our study is to explore the various characteristics of Big Data clusters, evaluate their performance and identify the key properties in a configuration among data compression, data schema and topology. Our data cluster of choice consists of HBase, as the NoSQL database and Cloudera, as the Big Data processing environment, an open-source implementation of the MapReduce algorithm. To deploy the HBase clusters in a cloud environment, we chose the SAVI testbed, an experimental research platform provided by the “Smart Applications on Virtual Infrastructure” strategic project by the Natural Sciences and Engineering Research Council (NSERC) in Canada. The SAVI testbed is equipped with an Openstack installation, which among others enables us to boot VMs on demand, deploy topologies and measure their performance. The process we follow is represented graphically in Figure 1. The first step is to load the experimental data in a relational (MySQL) database. The data is real traffic data from the CVST project and it contains measurements and events from a variety of sensors (traffic lights, speed traps, vehicles etc.) that come in regular intervals (hourly, daily or otherwise). The data comes in the form of XML files and a MySQL database makes it easy to bulk load the data in a persistent storage. Additionally, Sqoop [18] enables us to easily transport the data from the MySQL database to the HBase database, which is the target storage for our experiments. The seamless and automatic transfer enabled by MySQL and Sqoop allows us to run a multitude of experiments with various configurations fast and with minimal effort. Each experiment begins with defining whether the data will be compressed or not, the HBase schema and the cluster topology on the cloud. Then, the topology is deployed on the cloud platform, the data is transferred from MySQL to HBase, the workload generator queries the cluster and its performance is measured. Next, we will discuss in details the definition of

25

TABLE I S UMMARY OF EXPERIMENTAL TOPOLOGIES . Experiments e1 e2 e3 e4 e5 e6 e7 e8

m1.medium 0 8 0 2 4 2 4 6

m1.large 4 0 2 1 0 3 2 1

m1.xlarge 0 0 1 1 1 0 0 0

versions or more5 , which is why we do not use versioning for our application. The nature of our data implies that versioning would result in a much greater number of versions than recommended. We call this first schema the “default” schema. The second schema, called “switchid”, is simply a rearrangement of two elements of the key (the date and the sensor ID). According to the nature of the CVST data, these two elements have different granularity (there may be more time intervals for a measurement than there are sensors). Therefore, by choosing date as the first key element, the assumption is that it will optimize time-oriented scans. For the first two schemas, an MD5 [11] cryptographic hash will be added to the key in order to avoid region hot spotting6 . Hot spotting occurs when there are too many keys on one region server and if users are continually querying keys on the same region server. This phenomenon happens because HBase stores everything in lexicographical order and when the key is not randomized, you could have an unequal distribution of keys across region servers. This also results in RegionServers being underutilized. We use MD5 on the keys so that there is an equal distribution of data across RegionServers in order to avoid this situation. For the third schema, called “noMD5”, we remove the MD5 hash from the key.

Fig. 1. Iterative process

each of the three dimensions of the cluster configuration. For each experiment, two workloads are executed on each of the individual clusters by issuing a scan query on both compressed and non-compressed data. These workloads are executed by a Java application, developed in-house, that allows for inputting a maximum number of users and an increment number of users. For our experiments, we used a maximum of 5000 users with an increment of 500. It is important to note that the cloud may exhibit different behaviours at times (a term known as “cloud variability”) depending on the number of real users on the same physical machine and whether or not they are running an I/O intensive task. In order to eliminate this, the experiments are executed five times for each table (compression and non-compression table) and an average is taken for all five experiments. This makes a total of ten experiments for each cluster. The queries are executed in alternate order between compression and noncompression.

C. Cluster Topology on the Cloud For our experiments, we considered 8 topologies with different numbers and types of virtual machines. Table I gives a summary of the eight topologies. We impose a restriction on the budget of every topology in the amount of $0.84 per hour. Although we deploy the topologies in the free research platform of the SAVI testbed, we used the Amazon EC2 prices for similarly sized virtual machines, as shown in Table II. This predefined budget applies only on the worker nodes of the cluster, since this is the variable part of the cluster and the master nodes remain the same. SAVI offers three sizes for VMs (m1.medium, m1.large, m1.xlarge) and we came up with all the possible combinations of these three types that fit our budget. The imposed budget resulted in 8 possible combinations of the three sizes of VM for the worker nodes. In addition to the worker nodes for the cluster, each topology has two large virtual machines; one machine that hosts the

A. Data Compression On each topology, we compare the performance of utilizing compression for the data versus not using any compression. The compression that is used for all experiments is GZIP as it compresses the data to the smallest size and uses higher CPU utilization when unzipping the files than other compressions, which may affect behaviour [16]. B. Data Schema Following guidelines from related research works [17], we chose as the first schema of our experiments a 2-Dimensional schema due to its simplicity and support from built-in tools for validating that all the data is there in a quick manner of time. HBase’s website currently suggests not using 100s of

5 http://hbase.apache.org/book.html#schema.versions Last accessed: 08-Jun2015 6 http://hbase.apache.org/book.html#rowkey.design Last accessed: 08-Jun2015

26

instances). This is explained by Cloudera’s out of the box configuration as the Java Heap Sizes end up being more on the largest amount of instances than the smallest amount in total. An example of this is demonstrated with the cluster with the largest amount of instances totalling 4.25 Gigabytes versus 2.44 Gigabytes in total Java Heap Size for the cluster with the smallest number of instances. This means that clusters are underutilized when default Cloudera configurations are used. There are currently no guidelines for setting these heap size configurations. Overall, we see that the cluster with the largest number of instances, e2, has the fastest response time.

TABLE II A MAZON EC2 P RICING FOR DIFFERENT INSTANCES COMPARABLE TO SAVI Instance Type (SAVI) m1.medium m1.large m1.xlarge

Instance Type (Amazon) c3.large c3.xlarge c3.2xlarge

VCPU 2 4 8

RAM in GB 3.75 7.5 15

SSD Size in GB 32 80 160

Price per hour $0.105 $0.210 $0.420

the Cloudera Hadoop Manager (CDH) and another machine that hosts the HDFS Name Node, the HBase Master Node, Sqoop and ZooKeeper7 . All worker nodes in a topology host the HDFS Data Nodes and the HBase RegionServers. While constructing our topologies, we ensure that the machines running the Master nodes and Name nodes are the same capacity per machine across experiments but the RegionServer’s and Data node’s capacity can change from one experiment to the next, as well as the number of machines. This is done to define a scope that ensures that the comparisons are objective. An application is also deployed on an extra large instance external to the cluster and used to execute representative workloads (workloads are combinations of number of users and query types). After the workloads have finished, the application generates a data file, which shows the response times for each workload. The experiment is executed several times to reduce cloud variability in which performance can change from time to time depending on the amount of traffic on the cloud, how many users are using the same physical machine, or any other factors that may influence performance.

Fig. 2. Response Times for Queries on non-Compress Tables for all topologies

Figure 3 shows the graph representing response time for workloads executed against tables that are compressed. There is an overall slow down of about two times the noncompression average response time across all experiments, but there does not seem to be any other interesting pattern emerging. Overall, e2 is the fastest all around cluster as shown in both graphs. Even though e7 does perform better in Figure 2, it performs even slower on average than e2 in Figure 3. Therefore e2 is the ideal cluster when assessing the tradeoffs.

IV. E XPERIMENTAL R ESULTS Having defined the parameters and the nature of the intended experiments, we can now use the deployed application to execute the queries against the deployed configuration (one at a time) based on a varying number of users representing the workload. First, we execute the topology experiments and then we use the fastest topology to execute the data schema experiments. Both topology and data schema experiments are conducted on both compressed and non-compressed data. The configurations are evaluated and compared with each other based on the corresponding cluster’s response time to the queries. After executing all the queries, an average is taken for response times for each experiment in each cluster and the total average for each cluster is calculated. A. Topology Experiments Figure 2 shows the resulting graph from the queries executed on the table without compression for all topologies (from e1 to e8 as encoded in Table I). We can see in this figure that the larger amount of instances seem to have the fastest response time where as the lowest amount of instances have the slowest response time in most cases with the exception of e6 (5 instances) versus e1 and e4 (which both have 4 7 http://zookeeper.apache.org

Fig. 3. Response Times for Queries on Compression Tables for all topologies

B. Data Schema Experiments In the next step, we take the overall fastest cluster (e2) and execute queries on different schemas. The results are illustrated

Last accessed: 08-Jun-2015

27

in Figure 4 for non-compression and Figure 5 for compression.

more easily shown with scan queries. The second limitation is with respect to the input data. However, the use of real data from an independent project increases the credibility of our results. Further experiments and statistical analysis may be required in order to generalize these results to other datasets as well. An issue we identified but left open with respect to this work is suboptimality of out-of-the-box data cluster settings. In the future, we would like to extend our methodology to facilitate finding the best practice configurations/settings for different clusters in HBase in order to enable maximum utilization of clusters. After this is successful, we plan to do similar experiments with the clusters being fully utilized and see how it would behave. In addition to this, we would like to also see if there are other HBase functionalities that can be modelled in order to add to our existing methodology. V. P ERFORMANCE M ODEL FOR B IG DATA C LUSTER

Fig. 4. Non-Compression graph illustrating response times for different schemas

Our experiments have shown that a configuration’s response time is linear to the cluster’s workload, as it can be clearly seen in all Figures. In this section, we propose a linear model to quantify this relationship. With this model, one can calculate the performance of configurations with a small number of experiments and then extrapolate for greater workloads, for a more complete picture. This will minimize the effort and time required to evaluate possible alternatives for the configuration of a data cluster. In turn, this will facilitate the decisionmaking process and the migration of a data cluster to a better configuration. We assume the model is linear because we use scan queries, which, as we found, returned results on a first-in-first-out (FIFO) basis due to the query’s sequential nature [16]. This means, that there is a notion of queuing happening at each ServerRegion of the HBase cluster. The proposed model for the response time of a data cluster, given the workload, is: R(C) = αC ∗ x + βC

Fig. 5. Compression graph illustrating response time for different schemas

(1)

where x is the number of users, αC and βC are the model’s coefficients for a configuration C, and R(C) is the predicted response time. In order to estimate the model’s coefficients, we use the method of least squares [19]. If xi is the random variable representing the number of users and yi are the observed response times for a given number of users, then the error between the observed and calculated response time is:

Removing the MD5 confirms that there is region hot spotting as in both compression levels, the line begins to spike around 5000 users due to one of the region servers receiving too many requests. When the MD5 is left in, there is an equal key distribution across the region servers which allows for elimination of region hot spotting and for region servers to receive the same load. Also, the graphs demonstrate that MD5 is much faster in response time than non-MD5.

yi − R(C) = yi − (αC ∗ xi + βC )

(2)

C. Threats to validity and open issues Therefore the sum of squares of errors (SSE) of the y-values about their predicted values for all the n points is defined as: X SSE = [yi − (αC ∗ xi + βC )]2 (3)

We consider our experiments to be limited with respect to two assumptions that prevent the direct generalization of the results. First, our results are meaningful mainly within the domain of scan queries. The choice for this type of queries was conscious, since scan queries are the ones that may stress data processing to its limits and, as such, they helped us identify the performance limitations of a given configuration. Additionally, the correlation between (bad) performance and the schema is

The quantities of αC and βC that minimize the sum of of squared errors (SSE) are the coefficients of the final model: αC =

28

SSxy SSxx

(4)

βC = y¯ − αC x ¯

(5)

Our experiments have shown that, with respect to the HBase data schema, row keys with MD5 cryptographic hash were found to be significantly faster than row keys without MD5. This was due to the RegionServer hotspotting and keys not having a proper distribution across machines8 . Regarding the topology of the cluster, we have shown that clusters with higher number of small instances performed consistently faster due to cluster underutilization with out-of-the-box configuration. Finally, we found that data compression, although it had an impact on the performance of the cluster, it was the least significant factor of the three that we explored. In addition to these findings, our experiments showed that the response time is related linearly to the number of users that issue requests to the cluster. We took advantage of this relationship and proposed a linear model to estimate the response time for variable number of users. Given only a small number of data points, the model can extrapolate the response time of a particular configuration for even more and larger workloads. This enables the evaluation of the cluster configuration faster and with less effort.

where SSxy is the sum of cross-errors of xi and yi from their respective means and SSxx is the sum of squared errors of xi from its mean: X SSxy = (xi − x ¯)(yi − y¯) (6) SSxx =

X

(xi − x ¯)2

(7)

In order to evaluate the proposed model and its ability to extrapolate the performance of a configuration, we apply it on three of the clusters used in our experiments. Two of the clusters are non-compression (e1 and e8) and the other cluster is compression (e6). We first extrapolate a model by calculating the slope and intercept for the first three values of the workload (500, 1000 and 1500 users). After we have these two coefficients, we create an approximate linear model. We then calculate all the predicted values for each given number of users. In order to assess the accuracy of the model, we calculate the difference between the measured y values and predicted y values. As illustrated in Figure 6 for e1 as an example, the model closely approximates the measured values. More specifically, the mean error percentage of the seven predicted values is below 12.8%. The minimum percentage difference is below 5% and the maximum is approximately 16% for all three clusters, which adds evidence to the fact that the error rates are relatively low. Overall, for all of our experiments the respective models had fitness of over 99.5% (R2 ).

ACKNOWLEDGMENT This research was supported by IBM Centres for Advanced Studies (CAS), the Natural Sciences and Engineering Council of Canada (NSERC) under the Smart Applications on Virtual Infrastructure (SAVI) Research Network, and the Ontario Research Fund for Research Excellence under the Connected Vehicles and Smart Transportation(CVST) project. R EFERENCES [1] M. Litoiu, M. Woodside, J. Wong, J. Ng, and G. Iszlai, “A business driven cloud optimization architecture,” in Proceedings of the 2010 ACM Symposium on Applied Computing. ACM, 2010, pp. 380–385. [2] M. Shtern, B. Simmons, M. Smit, and M. Litoiu, “An architecture for overlaying private clouds on public providers,” in Proceedings of the 8th International Conference on Network and Service Management. International Federation for Information Processing, 2012, pp. 371–377. [3] D. J. Abadi, “Data management in the cloud: Limitations and opportunities.” IEEE Data Eng. Bull., vol. 32, no. 1, pp. 3–12, 2009. [4] M. A. Babar and M. A. Chauhan, “A tale of migration to cloud computing for sharing experiences and observations,” in Proceedings of the 2nd international workshop on software engineering for cloud computing. ACM, 2011, pp. 50–56. [5] SAS, “What is Big Data?” http://www.sas.com/en us/insights/big-data/ what-is-big-data.html, Last accessed: 08-Jun-2015. [6] B. F. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H.-A. Jacobsen, N. Puz, D. Weaver, and R. Yerneni, “Pnuts: Yahoo!’s hosted data serving platform,” Proceedings of the VLDB Endowment, vol. 1, no. 2, pp. 1277–1288, 2008. [7] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, “Dynamo: amazon’s highly available key-value store,” in ACM SIGOPS Operating Systems Review, vol. 41, no. 6. ACM, 2007, pp. 205–220. [8] D. Borthakur, J. Gray, J. S. Sarma, K. Muthukkaruppan, N. Spiegelberg, H. Kuang, K. Ranganathan, D. Molkov, A. Menon, S. Rash et al., “Apache hadoop goes realtime at facebook,” in Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. ACM, 2011, pp. 1071–1080.

Fig. 6. Measured vs Predicted Response Time for e1 Graph

VI. C ONCLUSION This paper presented a series of experiments with the goal to explore and argue about the impact of specific configuration properties, including data compression, data schema and topology, to the performance of HBase clusters. All experiments were executed on the SAVI platform acting as the cloud on real data provided by the CVST project.

8 http://hbase.apache.org/book.html#rowkey.design Last accessed: 08-Jun2015

29

[9] J. Huang, X. Ouyang, J. Jose, M. Wasi-ur Rahman, H. Wang, M. Luo, H. Subramoni, C. Murthy, and D. K. Panda, “High-performance design of hbase with rdma over infiniband,” in Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International. IEEE, 2012, pp. 774–785. [10] J. Han, E. Haihong, G. Le, and J. Du, “Survey on nosql database,” in Pervasive computing and applications (ICPCA), 2011 6th international conference on. IEEE, 2011, pp. 363–366. [11] R. Rivest, “The MD5 message-digest algorithm,” 1992. [12] G. Lee, B.-G. Chun, and R. H. Katz, “Heterogeneity-aware resource allocation and scheduling in the cloud,” Proceedings of HotCloud, pp. 1–5, 2011. [13] M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, and I. Stoica, “Improving mapreduce performance in heterogeneous environments.” in OSDI, vol. 8, no. 4, 2008, p. 7.

[14] G. Song, Z. Meng, F. Huet, F. Magoules, L. Yu, and X. Lin, “A hadoop mapreduce performance prediction method,” in High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPCC EUC), 2013 IEEE 10th International Conference on. IEEE, 2013, pp. 820–825. [15] E. Bortnikov, A. Frank, E. Hillel, and S. Rao, “Predicting execution bottlenecks in map-reduce clusters,” in Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing. USENIX Association, 2012, pp. 18–18. [16] L. George, HBase: the definitive guide. ” O’Reilly Media, Inc.”, 2011. [17] D. Han and E. Stroulia, “A three-dimensional data model in hbase for large time-series dataset analysis,” in Maintenance and Evolution of Service-Oriented and Cloud-Based Systems (MESOCA), 2012 IEEE 6th International Workshop on the. IEEE, 2012, pp. 47–56. [18] K. Ting and J. J. Cecho, Apache Sqoop Cookbook. O’Reilly Media, Inc., 2013. [19] J. T. McClave, P. G. Benson, and T. Sincich, Statistics for business and economics. Pearson/Prentice Hall, 2005.

30

Evaluating Cluster Configurations for Big Data Processing: An ...

Evaluating Cluster Configurations for Big Data Processing: An ...

Suggest Documents

An Interface for Biomedical Big Data Processing

BiDAl: Big Data Analyzer for Cluster Tracesâ

BiDAl: Big Data Analyzer for Cluster Tracesâ

An Access Control Scheme for Big Data Processing

Security Intelligence Centers for Big Data Processing

Parallel big data processing system for security

Big Data Processing Stacks

Big Data Processing Stacks

Data Processing on Big Data sets

Data Processing for Big Data Applications using Hadoop ... - ijarcce

Data Processing for Big Data Applications using Hadoop ... - ijarcce

Big Data Analytics for Concurrent Data Processing - International ...

Hadoop Data for Big Data Processing - IJITECH-International Journal ...

Data Processing for Big Data Applications using Hadoop ... - IJARCCE

Cluster RCE Shelf Cluster on Board Data Processing ...

Processing Geo-Dispersed Big Data in an Advanced ... - IEEE Xplore

Geospatial Big Data processing in an open source distributed ... - PeerJ

Towards an Optimized Big Data Processing ... - Distributed Systems

Pentaho High-Performance Big Data Reference Configurations - Cisco

A Scalable Inline Cluster Deduplication Framework for Big Data ...

Cluster Ensembles for Big Data Mining Problems - FICH-UNL

Metasynthesis-Based Intelligent Big Data Processing ...

Big Data Pre-Processing: A Quality Framework

Access Control Mechanisms in Big Data Processing

Evaluating Cluster Configurations for Big Data Processing: An ...