Provisioning and Evaluating Multi-domain ... - Semantic Scholar

3 downloads 46615 Views 118KB Size Report
when Hadoop applications are run on clouds built using multi- core blades. ... Extending cloud hosting into the network is a crucial step to enable on-demand ...
Provisioning and Evaluating Multi-domain Networked Clouds for Hadoop-based Applications Anirban Mandal, Yufeng Xin, Ilia Baldine, Paul Ruth, Chris Heerman

Jeff Chase, Victor Orlikowski, Aydan Yumerefendi

Renaissance Computing Institute University of North Carolina at Chapel Hill Chapel Hill, NC, USA

Computer Science Duke University Durham, NC, USA

Abstract—In this work, we have designed and implemented new algorithms and mechanisms that allow Hadoop-based applications to request and provision Hadoop clusters across multiple cloud domains and link them via bandwidth-provisioned network pipes – “on-demand” provisioning of Hadoop clusters on multidomain networked clouds. Our prototype implementation used an existing control framework that orchestrates leasing and acquiring of heterogeneous resources from multiple, independent cloud and network resource providers. We have experimented with various provisioning configurations based on varying bandwidth constraints and have done a thorough performance evaluation of representative Hadoop benchmarks and applications on the provisioned resource configurations. We have evaluated under what conditions multi-cloud Hadoop deployments pose significant advantages or disadvantages and carefully measured and analyzed Map/Reduce/Shuffle performance under those conditions. We have compared multi-cloud Hadoop deployments with singlecloud deployments and investigated Hadoop Distributed File System (HDFS) performance under varying network configurations. The results of our experiments show that networked clouds make cross-cloud Hadoop deployment feasible when we have high bandwidth network links connecting the clouds. Performance degrades greatly when there is poor inter-cloud bandwidth and the degradation in bandwidth-starved scenarios can be attributed to poor performance in shuffle and reduce stages of MapReduce computations. We have also shown that performance of Hadoop Distributed File System (HDFS) is extremely sensitive to available network bandwidth and Hadoop’s topology-awareness feature can be leveraged to optimize performance in hybrid bandwidth scenarios. We also observed that multi-core resource contention (I/O, memory contention) needs to be taken into consideration when Hadoop applications are run on clouds built using multicore blades. Index Terms—Cloud computing, Hadoop, network provisioning, performance analysis

I. I NTRODUCTION Pervasive virtualization at the edge and in the network core drives the evolution of the IT infrastructure towards a service-oriented model [1]. It permits a move from static arrangements of resources that persist over long periods of time to highly dynamic arrangements that respond to the needs of customers by dynamically provisioning necessary network and edge resources with some notion of optimality. Clouds are one type of virtualized service offered as a unified hosting substrate for diverse applications, using various technologies to virtualize servers and orchestrate their operation. Emerg-

ing cloud infrastructure-as-a-service efforts include Amazon EC2, Eucalyptus, Nimbus, Tashi, OpenCirrus, and IBM’s Blue Cloud. Extending cloud hosting into the network is a crucial step to enable on-demand allocation of complete networked IT environments. Other types of ”substrate” - storage and networks, are also developing control approaches that allow them to be subdivided and/or virtualized and are offering those capabilities through various control interfaces. In network virtualization mechanisms (circuits, wavelengths etc) are exposed via control planes such as MPLS, GMPLS, DOE IDC/OSCARS, NLR’s Sherpa, Internet2’s DRAGON and the emerging work on standardized network interface through OGF NSI. The next frontier in this work is enabling the creation of networked clouds: orchestrated arrangements of heterogeneous resources (compute, storage, networks, content, scientific instruments etc.), through a single interface. There are several efforts in that direction [2], [3], [4]. In our previous work [5], we have used an alternative meta control architecture designed independently of any a priori substrate assumptions and capable of driving multiple heterogeneous substrates using their native control/management interfaces and creating orchestrated arrangements of these resources acquired from multiple independent providers. This has made it possible to “stitch” together compute and network resources to build custom resource configurations for users, which can be built and torn down on demand. Cloud computing has gained significant traction both in the industry and in academia. At the same time, the Hadoop/MapReduce paradigm is being extensively used for distributed processing of large data sets using a cluster of computers. Hadoop is used heavily by large corporations like Yahoo, Facebook and Amazon. The combination of two technologies - Cloud computing and Hadoop/MapReduce, has opened new frontiers. Being able to set up a Hadoop cluster on demand on a compute cloud and running MapReduce computations on the cluster with low effort and cost is an extremely attractive proposition. In the last few years, we have seen widespread use of Hadoop on compute clouds like Amazon EC2. Note the well-publicized feat where the New York Times used Amazons EC2 compute cloud and an Hadoop application to crunch through four terabytes of

scanned archives from the paper converting them to PDFs for the Web. [?] Amazon has also launched the Amazon Elastic Cloud [?] service that enables running Hadoop on EC2 as a separate service. The combination of Hadoop and Cloud platforms has democratized large-volume distributed data processing for individual users, developers, scientists, researchers and small and large corporations. Networked clouds open up another dimension for Hadoop applications running on clouds. Hadoop is relatively insensitive to transit latencies and requires high bandwidth when operating on large datasets, which may allow it to operate on a number of widely distributed clouds interconnected by high-bandwidth networks. Hence, Hadoop applications are rarely run across multiple clouds because of bandwidth limitations in the commodity internet connecting the clouds. But, if bandwidth-provisioned network pipes connect compute clouds, as in the case of networked clouds, deploying Hadoop clusters across clouds and running MapReduce computations across clouds becomes feasible. Other than the obvious advantages of being able to leverage excess capacity from other clouds, this capability would be also be useful for cases when the data-set resides on one cloud site and another cloud needs it, or when data needs to be shared between clouds, or when clouds needs to be provisioned on sites separate from where data is generated (telescope or other instruments). However, the scenario of provisioning Hadoop clusters across multiple cloud sites has been rarely studied in the literature. Because provisioning multi-cloud Hadoop clusters is a relatively new capability, no one has till date done a comprehensive study of multi-cloud Hadoop deployments and performance evaluation of Hadoop distributed applications on the provisioned resource configurations. In this work, we have made the following contributions: •







We have designed and implemented new algorithms and mechanisms that allow Hadoop-based applications to request and provision Hadoop clusters across multiple cloud domains and link them via network transit domains satisfying guaranteed end-to-end bandwidth – “ondemand” provisioning of Hadoop clusters on networked clouds. Our “proof-of-concept” prototype implementation used an existing control framework that orchestrates leasing and acquiring of heterogeneous resources from multiple, independent cloud and network resource providers. We have experimented with various provisioning configurations and have done a thorough performance evaluation of representative Hadoop benchmarks and applications on the provisioned configurations. We have evaluated under what conditions multi-cloud Hadoop deployments pose significant advantages or disadvantages and carefully measured and analyzed Map/Reduce/Shuffle performance under those conditions. We have compared multi-cloud deployments with singlecloud deployments and investigated Hadoop Distributed File System (HDFS) performance and Hadoop topologyawareness under various networking configurations.

The rest of the paper is organized as follows. Section II describes our algorithms for embedding Hadoop master-worker topologies on networked clouds. In section III, we present details on our prototype implementation of “on-demand” provisioning of Hadoop clusters on multiple networked clouds. Section IV presents our evaluation of multi-domain Hadoop deployments using representative Hadoop applications and benchmarks and details our experimental results. Related work is described in section V and section VI concludes the paper. II. E MBEDDING T OPOLOGIES FOR H ADOOP - STYLE A PPLICATIONS For provisioning multi-cloud Hadoop deployments, we designed an algorithm for embedding a Hadoop master-worker topology request. The requests for embedding resource topologies are called virtual topology (VT) requests. The VT requests for a virtual cluster in a networked cloud environment pose some extra constraints than general VT requests: (1) All the nodes needs to be able to communicate with every other node in the cluster. This implies that they should be put in the same virtual network, either in a layer-2 Ethernet domain or a layer3 routed network. We take the former solution in this work, which will build the cluster in the form of a broadcast tree over the physical network infrastructure. The latter approach requires designing a virtual topology first and will be a topic of our future study; (2) The cluster consists of master node(s) and worker nodes that may run different software or the same software in different modes and therefore may have different requirements for CPU and memory resources. Ideally, the master node(s) should run first and the worker nodes need to know the IP address(es) of the master node(s). Based on our previous work on solving the general virtual topology embedding VTE problem [6], we address the Hadoop cluster embedding problem in this section. The request representation and algorithm can be easily extended to other types of clusters, like a Condor cluster. A. Virtual Cluster Request Representation In a cloud environment, cluster nodes will be explicitly provisioned in the form of VMs that may have different resource requirements. We classify the requests as bound or unbound. The bound requests specify the cloud sites for the VMs and the system determines the internal network connectivity in that particular site for these provisioned VMs. The unbound requests only describe the virtual cluster and the system either selects a suitable cloud site to embed the cluster or partitions the virtual cluster into multiple pieces to embed into multiple cloud sites when the requested cluster is too big. In the latter case, the system will automatically set up the inter-cloud network connectivities with guaranteed bandwidth. As the partitioning and embedding decision is made by the system, the request representation schema is designed to be very simple and flexible for users to form their request without need to know the underneath infrastructure. In a request, the cluster nodes need to be first grouped into different node groups, e.g. a master node group and a

worker node group. Then, each group can specify following requirements: (1) VM type and image. Customized VM type and image can be defined in the system. (2) number of nodes; (3) IP address range. Normally each group specifies a starting IP address and network mask and the system will automatically generate IP addresses for all the nodes sequentially. (4) Node group dependency. e.g., a slave group can specify its dependency on a master group, which will be provisioned first by the system. (5) Post boot script identifier. The system has already built in some customized post boot scripts for the Hadoop and Condor nodes to run after the VMs are provisioned. This ”one-shot” capability allows the automatic configuration and operation of the cluster (e.g., master or slave mode, cluster topology awareness, et al. ) after it is provisioned, and basically gives the user an operational cluster ready for running applications. B. Virtual Cluster Embedding and Stitching For the networked cloud environment, we distinguish between the cloud providers and transit network providers, which provide inter-cloud virtual network service via on-demand QoS-guaranteed inter-domain paths. Our approach relies on a broker service to compute the inter-domain level connections, leaving the individual resource providers to complete the intradomain segment computation locally. For a unbounded VT request that may need partitioning over multiple cloud sites, the algorithm relies on extensions to two key mechanisms we have developed in [6]: 1) VTE Embedding: The inter-cloud VTE algorithm runs an efficient heuristic that integrates a minimum k-cut algorithm to partition the virtual topology and subgraph isomorphism detection to embed the partitioned topology into the clouds. The partitions mapped to different cloud sites will be connected by the inter-domain connections. The partition decision (the value of k) is made such that it optimizes a load balance cost function based on the available resources from each cloud site. Particularly, for a virtual cluster VC request, the system would first compute if the partition of the cluster is needed. Iterating over the possible partition solutions, ultimately, a k-partition solution will be obtained in the form of a tree with k nodes whose embedding is our cluster embedding solution. 2) Resource Stitching from Multiple Providers: We have developed a coordinated stitching framework to configure the the requested virtual system after the resource computation (path computation or embedding) is done. The framework starts from building a directed dependency graph that reflects the sequence of provisioning the different pieces (slivers) from different providers and the parameters that a sliver needs from its predecessors before its configuration can be completed. More details can be found in [5]. The VC request added another dependency relationship derived from the node group dependency in the request. The new relationship has been included into the dependency tree build process

so that the master node will be always provisioned first and then its IP address will be passed to the worker nodes. We note that the slave nodes may be in the same or different cloud sites as the master nodes that need to be treated differently. III. P ROTOTYPE I MPLEMENTATION We implemented on-demand provisioning and deployment of a Hadoop cluster on networked clouds using a prototype provisioning system called ORCA (Open Resource Control Architecture) [?]. ORCA is an extensible platform for dynamic leasing of heterogeneous resources in a shared infrastructure. ORCA is one of the candidate control frameworks for the GENI project [5], [7], [8], [9] and has been deployed in an infrastructure testbed with multiple transit networks and Eucalyptus cloud clusters. ORCA uses a resource lease contract abstraction: resource providers advertise or delegate their resources to broker intermediaries, and users request resources from the brokers. For example, a cloud provider offers its resource in terms of unit counts and types of VMs and available virtual network connections between them; the transit network provider delegates its resources as a number of available virtual connections between the border interfaces with bandwidth constraints. We have implemented the embedding and stitching algorithms described in section II as policy modules within ORCA. Based on a user’s request for a Hadoop master-worker topology of a particular size with bandwidth constraints, the algorithm selects a good partitioning of resource requests between multiple providers (transit and cloud) and determines the ordering and details of requests to those providers to assemble the end-to-end topology. Using existing ORCA mechanisms, vms for Hadoop master and workers are instantiated across cloud sites and are linked together via bandwidth-provisioned VLAN circuits. For deploying cloud sites, we have used Eucalyptus [10] – an open-source cloud software, which mimics the behavior of Amazon EC2[11]. In order to provision virtual networks within a cloud site, we have developed NEuca [7], a Eucalyptus extension that allows guest VM configurations to enable virtual topology embedding within a cloud site. NEuca consists of a set of patches for Eucalyptus and additional guest configuration scripts installed onto the virtual machine image, that enhance the functionality of a private Eucalyptus cloud without interfering with its normal operations. It allows VMs instantiated via Eucalyptus to have additional network interfaces, not controlled by Eucalyptus. Using the –user-datafile command line option to euca-run-instances (or ec2-runinstances) command, a user passes additional guest configuration parameters to NEuca without changing Eucalyptus/EC2 API. The parameters are passed as a well-structured .INIformatted file that contains configuration information for the instance that NEuca can use to configure additional networking interfaces (eth1, eth2, etc) of the virtual machine. In the context of this effort, NEuca enables us to stitch together secondary interfaces of VMs provisioned by multiple

providers using private VLANs, while leaving the primary interface to be publicly exposed for administrative (SSH) access by the user. The private VLANs form a data-plane of the provisioned slice, which has guaranteed quality of service and repeatable performance critical both for network testbeds and for running distributed applications. In order to complete the stitching process connecting VMs together, ORCA configures the same VLANs on the switch to which the Eucalyptus cluster is attached. NEuca also allows VMs to execute custom scripts that are passed through topology requests at boot time, which are not built into the VM image being instantiated. We use this feature to start the Hadoop daemons on the instantiated VMs at boot time. A. Hadoop Scripts for Eucalyptus Clouds We have heavily modified a set of scripts that ship with the Hadoop source code and are placed in contrib/ec2. We have created a Eucalyptus image containing Hadoop software and these scripts are built into that virtual machine image. The scripts generate the Hadoop configuration files dynamically based on the IP address of the Hadoop master before starting the Hadoop daemons. With the help of NEuca, these scripts are automatically launched at virtual machine boot time. The “topology-awareness” information [?] is also passed through the NEuca scripts and is picked up by the Hadoop master daemons. When Hadoop workers boot up, they can communicate with and send heartbeats to the Hadoop master using the ‘data-plane’ network. The Hadoop master and workers within a Hadoop cluster are now tied with bandwidth-provisioned ‘data-plane’ that spans multiple clouds. When all the vms have finished booting, we have a multi-cloud Hadoop cluster ready for the applications. IV. E XPERIMENTAL E VALUATION Hadoop/MapReduce presents one extreme part of the spectrum of distributed applications due to its high degree of parallelization. It is relatively insensitive to transit latencies and requires high bandwidth when operating on large datasets which allows it to operate on a number of widely distributed clusters interconnected by high-bandwidth networks. In this section, we present our evaluation of on-demand, multicloud, networked Hadoop cluster deployments using several Hadoop benchmarks and applications. We experiment with various topologies based on provisioned bandwidths between and within the clouds and study the effect of these on the performance of Hadoop-based applications and benchmarks. We present several experiments to 1) evaluate under what conditions multi-cloud Hadoop deployments are feasible and pose significant advantages or disadvantages, 2) carefully measure and analyze Map/Shuffle/Reduce performance under those conditions, 3) compare multi-cloud deployments with single cloud deployments, 4) study Hadoop Distributed File System (HDFS) performance under various networking configurations, and

5) investigate the usefulness of the topology-awareness feature [?] in Hadoop cluster deployments, which tries to mitigate for low inter-datacenter/cloud bandwidth by carefully choosing ”DataNodes” for HDFS block replica placement and task assignment to ”DataNodes”. The following sections describe the details of the networked cloud infrastructure used for the experiments, experimental design of the various provisioning scenarios and our findings. A. Infrastructure Our experimental infrastructure consists of two Eucalyptus cloud sites, one located at the University of North Carolina at Chapel Hill - ”UNC Cloud”, and another located at the Renaissance Computing Institute at Chapel Hill - ”RENCI Cloud”. The UNC cloud is built using one single-core Eucalyptus head node and eight single-core Eucalyptus worker nodes, each a Dell 860 blade. Each worker node has one Intel Celeron (2.8GHz) processor with 1GB of memory. The UNC Cloud runs the Xen hypervisor and is capable of running up to 8 m1.large or 12 m1.small virtual machine instances. The RENCI Cloud is built using one Eucalyptus head node and two Eucalyptus worker nodes, each a Dell 2950 blade. One of the worker nodes is a 8-core Intel Xeon (2.66GHz) with 8GB of memory and the other worker node is a 4-core Intel Xeon (2.66GHz) with 4GB of memory. The RENCI Cloud runs the KVM hypervisor and is capable of running 12 m1.large or 48 m1.small virtual machine instances. Multiple blades in each Eucalyptus cloud are tied together with Juniper EX3200 switches, which are connected to the BEN (Breakable Experimental Network) testbed. BEN is a metro-scale dark fiber facility with several PoPs (Points of Presence) in North Carolinas Research Triangle. The RENCI and UNCI Points of Presence (PoPs) in the BEN network are equipped with a fiber switch, a DWDM gear from Infinera supporting OTN and a L2/L3 switch/router from Cisco. ORCA uses native TL1 and CLI interfaces of all of these network elements to create appropriate crossconnects or circuits to support the multi-layered topologies needed to create bandwidth provisioned layer 2 VLAN connections across BEN. The RENCI/UNC Eucalyptus clouds on the edge expose Amazon EC2 interfaces for creating VMs on specific VLANs. B. Experimental Design Using the infrastructure described above, we leveraged the ORCA control framework to request, provision and instantiate different resource configurations (slices) with varying inter- and intra-cloud network bandwidths and distribution of vm counts. We used three levels of bandwidths to be provisioned in our requests - 10Mbits/sec (”Low Bandwidth”), 100Mbits/sec (”Medium Bandwidth”), and 1000Mbits/sec (”High Bandwidth”). For each experiment, we requested to set up a Hadoop cluster of size 8 (1 Hadoop master and 7 Hadoop slaves). The Hadoop cluster could be instantiated on a single cloud site (”UNC Cloud” or ”RENCI Cloud”), or on multiple cloud sites (”Multi-site Cloud”). To test the topology awareness feature of Hadoop, we set different bandwidth

values for inter-cloud and intra-cloud cases. All experiments used Hadoop topology awareness except for one case where it was turned off. For the ”Multi-site Cloud” case, the splitting between the two cloud sites and mapping of Hadoop master and slaves to the Cloud sites were determined by the topology embedding algorithm. In the Multi-site cloud case, based on the available resources at the two cloud sites, 5 vms were instantiated at the RENCI Cloud and 3 vms were instantiated at the UNC cloud. Based on these parameters, we experimented with 9 scenarios: (1) Multi-site Cloud with High inter- and intra-cloud Bandwidth, (2) Multi-site Cloud with Medium inter- and intracloud Bandwidth, (3) Multi-site Cloud with Low inter- and intra- cloud Bandwidth, (4) Multi-site Cloud with Medium intra-cloud bandwidth and Low inter-cloud bandwidth (Hybrid BW), (5) Multi-site Cloud with Medium intra-cloud bandwidth, Low inter-cloud bandwidth (Hybrid BW)and topologyawareness turned off, (6) single site RENCI Cloud with High intra-cloud Bandwidth, (7) single site RENCI Cloud with Medium intra-cloud Bandwidth, (8) single site UNC Cloud with High intra-cloud Bandwidth, and (9) single site UNC Cloud with Medium intra-cloud Bandwidth.

the network characteristics. We ran the TestDFSIO benchmark with10 files of varying sizes (10MB to 2500MB total). 3) copyToFromHDFS: We wrote a simple microbenchmark called ‘copyToFromHDFS’, which uses “hadoop fs -copyFromLocal” to write files of various sizes to HDFS. There are no parallel map tasks in this case. This benchmark also exercises the network. We ran this benchmarks with different file sizes starting from 128MB up to 2048MB. 4) NCBI BLAST: We also experimented with a Hadoopbased scientific application called Hadoop-BLAST [?] obtained from NCBI and Indiana University. This application runs the ‘blastx’ application from the BLAST suite[?]. It is a purely compute intensive application. We varied the number of input files from 7 to 56 in steps of 7. One map task is executed for each input file and the result is written to HDFS. There is no reduce step.The volume of HDFS I/O is minimal and hence this application is not sensitive to network characteristics. D. Experimental Results Hadoop Sort Benchmark: Multi−Site Cloud 6000

C. Hadoop Benchmarks and Applications Execution Time (sec.)

4000

3000

2000

1000

0

128MB

256MB

512MB

1024MB

2048MB

Sort Data Size (MBytes)

Fig. 1.

Multi-site Cloud with different provisioned bandwidths (Sort)

copyToFromHDFS (write): Multi−Site Cloud 2500

2000

Execution Time (sec.)

We used the following benchmarks and applications for our experiments. 1) Hadoop Sort: We experimented with the Hadoop Sort benchmark included in the Hadoop source code distribution. We used Hadoop version 0.20.2. This benchmark is very useful to test a Hadoop deployment. It is all the more useful to evaluate network and HDFS performance because the entire data-set goes through the shuffle stage, which exercises the network between the slaves, and the sort result, which produces the same amount of data, is pushed to HDFS in the reduce step. Performance of writes to HDFS also depend on the network characteristics because block replicas are written on other ”DataNodes” and need to be transfered from other ”DataNodes”. The map outputs are written to the disk on the machine doing the map and the entire data-set is written in the shuffle step onto the machines doing reduce. So, twice the amount of data-set is potentially written to the disk when we have a sufficiently large data-set. In our experiments we observe that the entire data-set is ’spilled’ to the disk both on the map and reduce sides. This means the Sort benchmark also warrants reasonable disk performance. We ran the Sort benchmark for different sizes of input data-sets starting from 128MB up to 2048MB on all the resource configuration scenarios described above. 2) TestDFSIO: We also experimented with the TestDFSIO benchmark, which is also included in the Hadoop source distribution. This benchmark is used to test HDFS I/O performance. This benchmark takes as input the number and size of files to be pushed to HDFS. For each file, the benchmark runs a map job that writes a file of the given size into HDFS. The writes by the different map jobs happen in parallel. As above, the performance of the TestDFSIO benchmark also depends on

5000

High BW network Medium BW network Hybrid BW network (topology−aware) Hybrid BW network (topology−unaware) Low BW network

High BW network Medium BW network Hybrid BW network (topology−aware) Hybrid BW network (topology−unaware) Low BW network

1500

1000

500

0

32MB

64MB

128MB

256MB

512MB

1024MB

File Size (MBytes)

Fig. 2.

Multi-site Cloud with different bandwidths (copyToFromHDFS)

Hadoop Sort (2048MB): Reduce/Shuffle/Map Task Execution Times

Hadoop Sort (1024MB): Reduce/Shuffle/Map Task Execution Times

4000

4000 Reduce Task Shuffle Task Map Task

3000

2500

2000

1500

1000

500

0 LowBW−multi

Reduce Task Shuffle Task Map Task

3500

Average Execution Time (sec.)

Average Execution Time (sec.)

3500

3000

2500

2000

1500

1000

500

HybridBW−multi

MediumBW−multi

HighBW−multi

HighBW−single

Bandwidth Richness

Fig. 3.

0 LowBW−multi

HybridBW−multi

MediumBW−multi

HighBW−multi

HighBW−single

Bandwidth Richness

Reduce/Shuffle/Map average task times for different bandwidth scenarios: 2GB and 1GB

Figure 1 shows the results from running the Hadoop Sort benchmark on a Hadoop cluster provisioned on a Multisite Cloud with 5 vms provisioned at the RENCI Cloud, which includes the Hadoop master, and 3 vms provisioned at the UNC Cloud. The x-axis denotes the sort data size and the overall execution time is presented on the y-axis. The execution time does not include the time to write the input data into HDFS, which was done by a separate MapReduce job before sorting begins. For each sort data size, we show the execution times for different bandwidth scenarios - high bandwidth between all the vms, medium bandwidth between all the vms, medium bandwidth between vms in the same cloud & low bandwidth between the clouds (Hybrid BW) with and without topology awareness , and low bandwidth between all the vms. The results show that the Hadoop Sort benchmark is extremely sensitive to network bandwidth. High to medium bandwidth between all nodes in a Hadoop cluster is essential for reasonable performance of Hadoop Sort. The Hybrid BW cases roughly mimic the case if one tries to run a multi-Cloud Hadoop cluster using the commodity internet as the network between the clouds. The results show that there is loss of performance by at least a factor of 4 when transitioning from uniform Medium Bandwidth to Hybrid Bandwidth. The results indicate that when different clouds are connected through bandwidth provisioned high speed network pipes, deployment of a Hadoop cluster across clouds/datacenters becomes feasible. This capability enables leveraging excess capacity from other connected clouds. Comparing the execution times between the topology-aware and topology-unaware cases in the Hybrid BW scenario, we observe that toplogy-awareness in Hadoop gives a performance boost of about 11-17%, which implies that this feature should be used whenever there are multiple racks or datacenters in a Hadoop deployment. Figure 2 shows the results of running the “copyToFromHDFS” benchmark on a Hadoop cluster provisioned on a Multi-site cloud using the five bandwidth scenarios described in the Hadoop Sort results. The x-axis shows the

different file sizes and the overall execution time is shown on the y-axis. As with Hadoop Sort, we observe that writing a file into HDFS is very sensitive to network bandwidth characteristics. The HDFS replication factor was set to the default value of 3. So, each HDFS file block has to be replicated and written on three DataNodes. When block replicas are written, blocks are transferred over the network from one DataNode to the other exercising the network between the Hadoop DataNodes. At low bandwidths, this block replication process suffers a great deal results in poor performance. We observe that there is a factor of 3 to 8 loss of performance when transitioning from the Medium Bandwidth scenario to the topology-aware Hybrid Bandwidth scenario. We also observe that topology awareness helps because this feature enables Hadoop to do a better job on placement of block replicas on DataNodes. We wanted to investigate why the Hadoop Sort benchmark performs poorly in bandwidth starved scenarios. In figure 3 , we plot the average execution time of map, shuffle and reduce tasks in a Sort job (2G and 1G input sets) across different bandwidth scenarios ranging from multi-cloud Low Bandwidth to single-cloud High Bandwidth. We observe that for poor bandwidth scenarios, the reduce and shuffle tasks take a long time and contributes to the overall poor performance of Hadoop Sort. We observe that the average execution time of the map tasks is insensitive to provisioned bandwidth because map tasks operate on local data. However, since the shuffle step involves moving the entire data set between the Hadoop DataNodes/slaves using the network, it is sensitive to the network bandwidth. Also, the reduce step involves writing the entire sorted data-set to HDFS, which again exercises the network. The reduce step takes even longer because multiple block replicas traverse the network. Figure 4 shows the results of performance comparison between Hadoop deployments on single-site cloud and multisite cloud for the Sort application. The left graph represents the High bandwidth scenario and the right graph denotes the Medium bandwidth scenario. Our first observation is that

Hadoop Sort Benchmark: High Bandwidth

Hadoop Sort Benchmark: Medium Bandwidth

800

700

800 UNC Cloud − High BW network RENCI Cloud − High BW network Multi−Site Cloud − High BW network

700

600

Execution Time (sec.)

Execution Time (sec.)

600

500

400

300

500

400

300

200

200

100

100

0

UNC Cloud − Medium BW network RENCI Cloud − Medium BW network Multi−Site Cloud − Medium BW network

128MB

256MB

512MB

1024MB

0

2048MB

128MB

256MB

Sort Data Size (MBytes)

Fig. 4. TestDFSIO (write): High Bandwidth

TestDFSIO (write): Medium Bandwidth

450 400

350

350

Execution Time (sec.)

Execution Time (sec.)

2048MB

500 UNC Cloud − High BW network RENCI Cloud − High BW network Multi−Site Cloud − High BW network

400

300 250 200

250 200 150

100

100

50

50

10MB

100MB

1000MB

2500MB

UNC Cloud − Medium BW network RENCI Cloud − Medium BW network Multi−Site Cloud − Medium BW network

300

150

0

1024MB

Hadoop Sort Benchmark: High and Medium Bandwidth

500 450

512MB

Sort Data Size (MBytes)

0

10MB

Total File Size written (MBytes)

Fig. 5.

100MB

1000MB

2500MB

Total File Size written (MBytes)

TestDFSIO (write): High and Medium Bandwidth

the performance of Sort on multi-cloud deployments with Medium/High bandwidth is comparable to that on single-site cloud deployments. In most cases, the performance of the multi-site cloud case is between the performance of ”UNC Cloud” and ”RENCI Cloud”. For the High bandwidth case, we see an anomaly where the performance of Sort on the RENCI Cloud degrades greatly at and beyond 1024MB sort data size. To explain that, we need to take a look at the High bandwidth case for the TestDFSIO benchmark results in figure 5. We observe from the TestDFSIO results that for total file sizes beyond 1000MB, there is a rapid performance degradation for the RENCI Cloud. We know that in the reduce phase for the Sort benchmark, there are multiple reduce tasks writing data into HDFS, which is similar to parallel writes by different map tasks for the TestDFSIO benchmark. So, poor Sort performance on the RENCI cloud in the High bandwidth case is a result of poor HDFS write performance. But, why don’t we see similar rapid degradation for the UNC Cloud ? The explanation lies in the fact that

the backend worker nodes in the RENCI Cloud (on which the vms are instantiated) are two multi-core machines (one with 8 cores and other with 4 cores) while the backend worker nodes in the UNC Cloud are 8 single core nodes. Although with faster processor and more memory, the RENCI worker multi-core nodes suffer from severe resource contention, in this case I/O contention, and thrashing of I/O buffers because the simultaneous writes from the vms on multiple cores and from block replica writes from vms on another node are multiplexed on a single host OS for each node. The UNC Cloud doesn’t have the multi-core resource contention problem, but in general the performance is sluggish because of older CPU, slower disk. Now, if we observe the performance of the TestDFSIO benchmark for the Medium bandwidth case in figure 5, we notice that there is no performance anomaly on the RENCI Cloud. This is because the I/O contention bottleneck is not triggered in the medium bandwidth case because block replica writes are not arriving fast enough from the vms in the other node. So, there exists a complex set of factors existing

copyToFromHDFS (write): High Bandwidth

copyToFromHDFS (write): Medium Bandwidth

600

600 UNC Cloud − High BW network RENCI Cloud − High BW network Multi−Site Cloud − High BW network

UNC Cloud − Medium BW network RENCI Cloud − Medium BW network Multi−Site Cloud − Medium BW network 500

Execution Time (sec.)

Execution Time (sec.)

500

400

300

200

100

0

400

300

200

100

128MB

256MB

512MB

1024MB

2048MB

0

128MB

256MB

File Size (MBytes)

Fig. 6.

512MB

1024MB

2048MB

File Size (MBytes)

copyToFromHDFS (write): High and Medium Bandwidth

simultaneously, which would trigger I/O contention and poor HDFS I/O performance on clouds built from multi-core nodes - (a) large number of vms simultaneously doing HDFS writes on a multi-core blade, and (b) high bandwidth between vms participating in HDFS block replica writes. If either of the factors are absent, we won’t observe degraded performance. We never observe anomalous performance for the Multi-site cloud case because we don’t have enough number of vms simultaneously doing HDFS writes on a multi-core blade at the RENCI cloud. With five vms distributed between two backend nodes, it is not enough to trigger the contention. Coming back to the Sort benchmark, in the Medium bandwidth case in figure 4, we observe expected performance up to sort data size of 1024MB, with RENCI cloud outperforming the UNC Cloud and the Multi-site Cloud performance lying between the two. However, we see a jump in execution time for the RENCI Cloud at the 2048MB data size. We know from figure 5 that for Medium bandwidth, performance degradation doesn’t manifest for the RENCI Cloud. So, we know that the execution time does not increase due to I/O contention. We believe that this anomaly happens because of memory bandwidth contention due the memory-intensive merge operations in the reduce phase (we verified that reduce phase took longer), with several vms simultaneoulsy accessing the memory on the same multi-core node saturating the memory bandwidth. Figure 6 shows the results of running the “copyToFromHDFS” benchmark for the High and Medium bandwidth scenarios to compare the performance of single-site cloud over multi-site cloud. We observe that the execution time for the multi-site cloud case is comparable to that of single-site case. There is a small degradation in the High bandwidth case for the RENCI Cloud for sort data size at and beyond 1024MB. The reason is the same as described of the TestDFSIO benchmark. The reason for the effect being not as pronounced as in the case of TestDFSIO is that the “copyToFromHDFS” benchmark doesn’t have concurrent writes from map/reduce tasks from multiple vms. So, it doesn’t

meet the first factor for triggering resource contention, but it suffers minor performance degradation due to fast and simultaneous block replica writes from vms in the other nodes. From the results of the Sort, TestDFSIO and copyToFromHDFS benchmarks, we can infer that resource contention is a serious issue to consider when deploying clouds on multi-core blades and smart decisions need to be made in deciding the maximum number of vms to be instantiated on a single blade. For the Sort application, we hit the multi-core I/O bottleneck at high bandwidth and we hit the multi-core memory bandwidth bottleneck at Medium bandwidth. TODO: cite references on multi-core resource contention I/O and memory. In figure 7, we show the results of running the Hadoop version of the BLAST application on single- and multi-site clouds for the high bandwidth case. The results of all other bandwidth scenarios are exactly similar to the one presented in this figure. This application is purely computational and has very minimal I/O. It also does only maps. So, it rarely stresses the network and hence it is insensitive to network characteristics between the Hadoop slaves. We observe expected performance with increasing number of input files. The multisite cloud performance lies between that of UNC Cloud and RENCI cloud because the RENCI cloud is about 2.5 to 3 times faster than the UNC cloud for running “blastx” and the multisite cloud case uses 4 vms from RENCI Cloud and 3 vms from UNC Cloud. The reason for staggered increase in execution times with increasing number of input file is the way Hadoop maps Map tasks to DataNodes. Since the maximum number of Map tasks per DataNode was set to 3, there is a jump in total execution time for the RENCI and UNC Cloud after 21 and 42 because we had 7 slaves in total. The minimum total execution time depends on the number of Map phases required to complete processing all the input files. It might get worse than that because of failures on Map tasks. For the Multi-cloud case, outstanding Map tasks can be put on either of the two clouds depending on when free Map slots are available on

them. Hence, the performance lies between that of the singlesite clouds. So, we can conclude that if the workload is purely computational and if time required to write the input data set to HDFS is insignificant when compared to the total execution time for the application, multi-site cloud can be leveraged for excess capacity even when there is poor network connectivity between and inside the clouds. NCBI BLAST Application 3000 UNC Cloud − High BW network RENCI Cloud − High BW network Multi−Site Cloud − High BW network

Execution Time (sec.)

2500

2000

1500

1000

500

0

7

14

21

28

35

42

49

56

Number of input files

Fig. 7.

NCBI BLAST Application Execution Times

V. R ELATED W ORK There has been considerable work on provisioning and running Hadoop/MapReduce computations on public clouds like the Amazon EC2. White [12], a prominent Apache Hadoop developer, describes his work on running Hadoop MapReduce on Amazon EC2. Amazon has recently released a new service called the Amazon Elastic MapReduce [13] service, which allows users to request Hadoop clusters of various sizes to run MapReduce computations on Amazon EC2. Several researchers [14], [15], [16], [17] have worked on Hadoop provisioning for optimizing MapReduce performance in the cloud. All the above research have been done in the context of clouds belonging to a single domain. Hadoop On Demand (HOD) [18], another Apache initiative, is used to provision private MapReduce clusters over a large physical cluster using batch queue systems like Torque. HOD lets users share a common filesystem (running on all nodes) while owning private MapReduce clusters on their allocated nodes. Other researchers [19], [20], [21] have extended and improved provisioning of Hadoop clusters on traditional HPC resources. Borthakur [22] describes the architecture of HDFS and several researchers [23], [24] have investigated the performance of HDFS to determine its scalability and limitations. There are several pieces of existing work [25], [26], [27] on performance evaluations of I/O virtualization bottlenecks for clouds and evaluating MapReduce performance on virtualized infrastructure, which support our findings in section ??. There is also considerable work on addressing resource contention

and improving I/O performance [28], [29] for multi-core processors. Most recent cloud infrastructure-as-a-service provisioning systems have been focused on virtualized compute and storage systems within a datacenter or interconnected via public Internet. There is also some recent work [30], [31] on interdatacenter network traffic characterization. However, the network is either best-effort Internet or shared private network. To ensure deterministic and predictable network performance, using on-demand dynamic circuit (virtual network) to connecting specific substrate types, like Grids, has attracted a good amount of efforts in the past decade [2], [3], [4]. Unfortunately, dynamic inter-domain connection provisioning remains a technical challenge given the heterogeneous multi-layer nature and lack of programmability of providers’ network. Recent work around the NSF GENI effort [32] has promoted the ideas of resource federation and programmable network to support on-demand isolated virtual network slice. However, the effort to explicitly make the network as allocatable resource along with the edge compute and storage resource just started. Virtual topology embedding is very important for network science experiments, only a few preliminary work exists on a multi-provider and multi-cloud environment [6]. As far as we know, there is not existing ”application-aware” system that can provision Hadoop or Condor clusters in a multicloud environment. This is mainly due to their performance sensitivity to the network performance which can not be guaranteed by existing system. As we showed in this paper, emerging provisioning system capable of allocating QoSguaranteed high-speed virtual network connections along with the cloud resources will likely change this status quo. VI. C ONCLUSION We have described our approach to provisioning multi-cloud Hadoop clusters on networked cloud platforms, which allows Hadoop-based distributed applications to run on several clouds as a part of a single run. We implemented the algorithms using an existing control framework software that provides the capability for leasing and acquiring heterogeneous resources from multiple, independent cloud and network resource providers. We have described our experimentation with multi-cloud Hadoop clusters using representative benchmarks and applications. We have presented a thorough evaluation of performance of these benchmarks on multi-cloud Hadoop clusters connected via network links with varying bandwidth. We have shown that multi-cloud Hadoop deployment is feasible when the inter- and intra-cloud bandwidth are high. There is substantial performance degradation at low bandwidths because of poor performance of shuffle and reduce steps. We have also shown that HDFS performance is sensitive to network bandwidth characteristics and Haddop’s topology awareness is useful to mitigate bandwidth differences. We have also noted that multi-core resource contention plays a major role in determining performance of Hadoop applications when the underlying cloud platform is built from multi-core machines.

ACKNOWLEDGMENT This work is supported by the Department of Energy award #: DE-FG02-10ER26016/DE-SC0005286, the National Science Foundation award #:OCI-1032573, and the National Science Foundation GENI Initiative. R EFERENCES [1] M. A. Vouk, “Cloud computing - issues, research and implementations,” Journal of Computing and Information Technology, vol. 16, no. 4, Dec. 2008. [2] Y. Wu, M. C. Tugurlan, and G. Allen, “Advance reservations: a theoretical and practical comparison of gur & harc,” in MG ’08: Proceedings of the 15th ACM Mardi Gras conference. New York, NY, USA: ACM, 2008, pp. 1–1. [3] G. Zervas, “Phosphorus Grid-Enabled GMPLS Control Plane (G2MPLS): Architectures, Services, and Interfaces,” in IEEE Communications Magazine, Aug. 2008. [4] S. Thorpe., L. Battestilli, G. Karmous-Edwards, A. Hutanu, J. MacLaren, J. Mambretti, J. Moore, S. Sundar, Y. Xin, A. Takefusa, M. Hayashi, A. Hirano, S. Okamoto, T. Kudoh, T. Miyamoto, Y. Tsukishima, T. Otani, H. Nakada, H. Tanaka, A. Taniguchi, Y. Sameshima, and M. Jinno, “G-lambda and EnLIGHTened: wrapped in middleware coallocating compute and network resources across Japan and the US,” in GridNets ’07: Proceedings of the first international conference on Networks for grid applications, 2007. [5] I. Baldine, Y. Xin, A. Mandal, C. Heermann, J. Chase, V. Marupadi, A. Yumerefendi, and D. Irwin, “Autonomic Cloud Network Orchestration: A GENI Perspective,” in 2nd International Workshop on Management of Emerging Networks and Services (IEEE MENS ’10) , in conjunction with GLOBECOM’10, Dec. 2010. [6] Y. Xin, I. Baldine, A. Mandal, C. Heermann, J. Chase, and A. Yumerefendi, “Embedding virtual topologies in networked clouds,” in 6th ACM International Conference on Future Internet Technologies 2011, ser. ACM CFI ’11, June 2011. [7] “GENI-ORCA Control Framework,” http://geni-orca.renci.org. [8] J. Chase, L.Grit, D.Irwin, V.Marupadi, P.Shivam, and A.Yumerefendi, “Beyond virtual data centers: Toward an open resource control architecture,” in Selected Papers from the International Conference on the Virtual Computing Initiative (ACM Digital Library), May 2007. [9] D. Irwin, J. S. Chase, L. Grit, A. Yumerefendi, D. Becker, and K. G. Yocum, “Sharing Networked Resources with Brokered Leases,” in Proceedings of the USENIX Technical Conference, June 2006. [10] “Eucalyptus systems,” http://www.eucalyptus.com/. [11] Amazon.com, Inc., “Amazon Elastic Compute Cloud (Amazon EC2),” http://www.amazon.com/ec2. [12] T. White, “Running Hadoop MapReduce on Amazon EC2 and Amazon S3,” 2007. [Online]. Available: http://developer.amazonwebservices.com/connect/entry.jspa? externalID=873&categoryID=112 [13] “Amazon Elastic MapReduce,” http://aws.amazon.com/documentation/elasticmapreduce/. [14] K. Kambatla, A. Pathak, and H. Pucha, “Towards Optimizing Hadoop Provisioning in the Cloud,” in Workshop in HotCloud’09, 2009. [15] H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, and S. Babu, “Starfish: A Self-tuning System for Big Data Analytics,” in CIDR’11, 2011, pp. 261–272. [16] F. Tian and K. Chen, “Towards Optimal Resource Provisioning for Running MapReduce Programs in Public Clouds,” in IEEE Conference on Cloud Computing, 2011. [17] A. Verma, L. Cherkasova, and R. H. Campbell, “SLO-Driven RightSizing and Resource Provisioning of MapReduce Jobs,” HP Laboratories, Tech. Rep. HPL-2011-126, 2011. [18] “Hadoop On Demand Documentation.” http: //hadoop.apache.org/core/docs/r0.17.2/hod.html. [19] S. Krishnan, M. Tatineni, and C. Baru, “myHadoop - Hadoop-onDemand on Traditional HPC Resources,” San Diego Supercomputing Senter (SDSC), Tech. Rep. SDSC-TR-2011-2, 2011. [20] C. Zhang and H. De Sterck, “Cloudbatch: A batch job queuing system on clouds with hadoop and hbase,” in Cloud Computing Technology and Science (CloudCom), 2010 IEEE Second International Conference on, 30 2010-dec. 3 2010, pp. 368 –375.

[21] M. Zaharia, D. Borthakur, J. S. sharma, K. Elmeleegy, S. Shenker, and I. Stoica, “Job Scheduling for Multi-User MapReduce Clusters,” University of California at Berkeley, Tech. Rep. UCB/EECS-2009-55, 2009. [22] D. Borthakur, “The hadoop distributed file system: Architecture and design,” 2007. [Online]. Available: http://hadoop.apache.org/ [23] K. V. Shvachko, “HDFS Scalability: The Limits to Growth,” login: The Magazine of USENIX, vol. 35(2), April 2010. [24] J. Shafer, S. Rixner, and A. Cox, “The hadoop distributed filesystem: Balancing portability and performance,” in Performance Analysis of Systems Software (ISPASS), 2010 IEEE International Symposium on, march 2010, pp. 122 –133. [25] M. Rehman and M. Sakr, “Initial findings for provisioning variation in cloud computing,” in Cloud Computing Technology and Science (CloudCom), 2010 IEEE Second International Conference on, 30 2010dec. 3 2010, pp. 473 –479. [26] S. Ibrahim, H. Jin, L. Lu, L. Qi, S. Wu, and X. Shi, “Evaluating mapreduce on virtual machines: The hadoop case,” in Proceedings of the 1st International Conference on Cloud Computing, ser. CloudCom ’09. Berlin, Heidelberg: Springer-Verlag, 2009, pp. 519–528. [27] J. Shafer, “I/o virtualization bottlenecks in cloud computing today,” in Proceedings of the 2nd conference on I/O virtualization, ser. WIOV’10. Berkeley, CA, USA: USENIX Association, 2010, pp. 5–5. [Online]. Available: http://dl.acm.org/citation.cfm?id=1863181.1863186 [28] G. Liao, D. Guo, L. Bhuyan, and S. R. King, “Software techniques to improve virtualized i/o performance on multi-core systems,” in Proceedings of the 4th ACM/IEEE Symposium on Architectures for Networking and Communications Systems, ser. ANCS ’08. New York, NY, USA: ACM, 2008, pp. 161–170. [Online]. Available: http://doi.acm.org/10.1145/1477942.1477971 [29] S. Zhuravlev, S. Blagodurov, and A. Fedorova, “Addressing shared resource contention in multicore processors via scheduling,” in Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems, ser. ASPLOS ’10. New York, NY, USA: ACM, 2010, pp. 129–142. [Online]. Available: http://doi.acm.org/10.1145/1736020.1736036 [30] N. Laoutaris, M. Sirivianos, X. Yang, and P. Rodriguez, “Inter-datacenter bulk transfers with netstitcher,” in SIGCOMM, 2011, pp. 74–85. [31] Y. Chen, S. Jain, V. K. Adhikari, Z.-L. Zhang, and K. Xu, “A first look at inter-data center traffic characteristics via yahoo! datasets,” in INFOCOM, 2011, pp. 1620–1628. [32] GENI: Global Environment for Network Innovations, 2007, http://www.geni.net/.