2014 IEEE International Conference on Cloud Computing
Data Farming on Heterogeneous Clouds Dariusz Kr´ol, Renata Słota, Jacek Kitowski
Łukasz Dutka, Jakub Liput
AGH University of Science and Technology Department of Computer Science and Academic Computer Centre Cyfronet AGH Krakow, Poland Email:
[email protected]
AGH University of Science and Technology Academic Computer Centre Cyfronet AGH Krakow, Poland
Abstract—Using multiple Clouds as a single environment to conduct simulation-based virtual experiments at a large-scale is a challenging problem. This paper describes how this can be achieved with the Scalarm platform in the context of data farming. In particular, a use case with a private Cloud combined with public, commercial Clouds is studied. We discuss the current architecture and implementation of Scalarm in terms of supporting different infrastructures, and propose how it can be extended in order to attain a unification of different Clouds usage. We discuss different aspects of the Cloud usage unification including: scheduling virtual machines, authentication, and virtual machine state monitoring. An experimental evaluation of the presented solution is conducted with a genetic algorithm solving the wellknown Travel Salesman Problem. The evaluation uses three different resource configurations: using only public Cloud, using only private Cloud, and using both public and private Clouds.
I.
to coordinate experiments. A scientist often has a numerical simulation only, which takes input parameters and produces output, but does not have or need to develop software for coordinating simulation scheduling and data collecting. Hence in many real-life situations, scientists manually run simulations and collect output data, which can take a lot of precious time. The requirement of using dedicated software, which coordinates data farming experiments, is closely related to the requirement of having access to necessary computational and storage infrastructure, because the software has to be compatible with the infrastructure in terms of being able to schedule simulations onto and retrieve results from the infrastructure. Commonly, a scientist has access to a small infrastructure locally, e.g. an institutional cluster exposed as a private Cloud, which is sufficient for small-scale data farming experiments. However, to conduct large-scale experiments, the scientist needs to use another pool of resources. More and more often, a public Cloud, e.g. Amazon Elastic Compute Cloud (EC2) [4], is used as the another, virtually unlimited pool of resources. Hence, it is desirable to use in scientific experiments either a single Cloud or a combination of resources from multiple Clouds (both private and public) as a single environment. In addition, as the offer of available public Clouds is still expanding, the software coordinating data farming experiments should be easily extensible to embrace different Clouds in the future with minimal modifications required.
I NTRODUCTION
Cloud computing is currently near bottom of the ’Trough of Disillusionment’ phase of the hype cycle according to recent Gartner’s raport [1]. It is related to overly-inflated expectations of early Cloud adapters regarding the business value that Cloud computing should have provided. This situation is an expected part of Gartner’s adaption process of emerging technologies and it proves increasing maturity of Cloud computing. The maturity can also be noticed in more and more applications of Cloud computing in scientific studies, which were commonly associated with High Performance Computing (HPC) infrastructures, e.g. Grids or clusters. Studying complex processes, as a part of scientific research, often involves executing numerical simulations many times with different input parameters, and exploring results to develop better understanding of the studied processes [2]. New methodologies were proposed in recent years to facilitate this common form of scientific research. Data farming [3] is an example of such a methodology, which combines data exploration and analysis methods with efficient exploitation of modern computational infrastructures. Its main goal is to grow data in virtual experiments, which is then used to gain knowledge about the studied processes in a systematic manner.
This paper intends to describe how data farming experiments can be conducted with heterogeneous Cloud-based infrastructures. We argue that Cloud computing offers elasticity at the infrastructure level, required to conduct such experiments of different scales when no required upfront investment in the infrastructure is an important factor. We analyze existing software for data farming and present our own solution, which extends an available platform to provide a uniform access to different Clouds with minimal effort required.
Scientific research, which follows the data farming methodology is often referred to as a data farming experiment. Although data farming can increase efficiency of the research, it often imposes a need of using substantial amount of computational and storage resources. It is especially true, when dealing with a large parameter space, e.g. thousands of possible simulation inputs, each of which takes minutes or hours to execute. Another requirement of data farming concerns software
The rest of the paper is organized as follows: Section II presents a survey of systems supporting heterogeneous Cloudbased infrastructures for coordinating data farming experiments. In Section III we describe the Scalarm platform and our extension of the platform in the context of supporting heterogeneous Clouds. Section IV evaluates and the extension in various test cases. Section V concludes the paper and outlines future work.
978-1-4799-5063-8/14 $31.00 © 2014 IEEE DOI 10.1109/CLOUD.2014.120
873
II.
BACKGROUND S URVEY
applied to run scientific workflows in a ”Workflow as a Service” way. Moreover, the proposed system intends to support running workers in multi-cloud infrastructures, while preserving communication capabilities between workers (representing subsequent steps of workflows) with centralized, message queues. The system can estimate the execution time of different workflow phases by monitoring activities within the queues. This information can be used to provide autoscaling capabilities. The system was evaluated by running 30 workers, within the context of a biomedical study regarding sensitivity analysis of a blood pressure wave propagation in arteries. The system provides an environment for workflowbased task farming studies, though its scalability seems to be limited due to using a central server for the master part with messages queues as a communication hub between workers. The evaluation also suggests that the system prefers coarsegrained partitioning, i.e. running many workflows by a small number of workers, rather than a fine-grained one.
Although data farming is becoming quite a widespread methodology, software which fully supports it remains limited in scope. One of the first and most popular examples of such software is OldMcData - the Data Farmer (OMD) [5]. It is a small-scale platform supporting parameter space generation with Design of Experiment (DoE) algorithms and simulation execution. Currently OMD supports the following DoE algorithms: full factorial, Cartesian product, values specified in a CSV file and evolutionary programming. Moreover, several parameters can be grouped and assumed the same values for selected simulations. OMD schedules simulations to run either on a standalone computer or on distributed computational resources using the Condor software [6], whose goal is to provide mechanisms and policies that support High Throughput Computing (HTC) on large collections of distributed resources. The user of Condor (OMD in our case) submits jobs to an agent which is responsible for finding suitable resources. Once the job is completed its output can be moved from the resource to a designated point. Unfortunately, no data analysis methods are provided by OMD. Moreover, simulations execution is a batch-like process, i.e. a complete parameter space is submitted to the scheduler as a single package. There is no information about partial results and the user cannot modify the parameter space after the submission. Although Condor can be integrated with heterogeneous infrastructures, it lacks self-scaling features, which means that the infrastructure used to run the simulations has to be set up beforehand and cannot change at runtime.
Although task farming can be conducted with any computational environments, using Cloud-based environments is especially interesting due to the availability of various types of resources at a relatively small cost, and it is a popular research topic recently. Contrail is an EU-funded project which intends to design, implement, evaluate and promote an open-source system for Cloud Federations [9]. As a part of the Contrail project, the ConPaaS software stack is developed [10], which addresses the problem of porting existing applications to the Cloud. The project aims at supporting familiar programming models so that existing applications can be easily migrated to the Cloud. To achieve this goal ConPaaS provides services, which act as replacements for commonly-used runtime environments, e.g. MySQL databases or PHP runtimes.
Hardly any system is comprehensive enough to support all phases of a data farming experiment, hence in our survey, we include systems, which facilitate different phases of data farming. An important part of each data farming experiment is simulation execution, which involves scheduling computation to different resources along with results aggregation.
Another product of ConPaaS is a budget-constrained scheduler for task farming called BaTS [11]. The main objective of BaTS is to minimize the cost of running a bag of tasks using Cloud resources by allocating the most efficient resources for the given task. Each task represents a single execution of a simulation. BaTS estimates the required budget for the given bag of tasks by evaluating the efficiency of executing tasks using different resource types. Experiments conducting in BaTS are divided into two phases: the sampling phase which estimates the execution budget, and the execution phase which involves execution of the remaining tasks. During this phase online monitoring is used to refine the initial execution plan. Unfortunately, BaTS does not provide other functionality related to data farming, e.g. parameter space generation or data exploration. Also the scalability of the system seems to be limited due to a centralized scheduler component.
In [7], the author analysis a Cloud distributed infrastructure for executing cluster of jobs with the queueing theory. The infrastructure is modeled as a M/G/1/K queueing system, with a Poisson arrival α, generally distributed service times, and a system capacity K. The author provides an analytical model to determine at a given instance of time and under given workload conditions the minimal number of Cloud resources needed to satisfy the defined response time. The model is validated with experimental measurements for Cloud servers reported in the literature. However, the assumed architecture of a Cloud infrastructure is overly simplified in our opinion, due to including only one, central job scheduler and assuming that worker Virtual Machines (VMs) will be connected with a distributed storage system (in practice this assumption limits worker VMs to be located in a single region of a single Cloud).
There are a few software solutions available, which enable use of multiple cloud services with a unified API. One of them is the Apache jclouds project [21] which provides Java library with Java and Clojure API. Another solution is the Apache Deltacloud project [22], which consists of a server with REST API and additional client libraries for a few programming languages (officially supported are Ruby and C, as of 05/01/2014). Both solutions eliminate the need to become familiar with various Cloud interfaces, but they are imposing the integration of new technical solutions – using Java classes in case of use jclouds and running an additional REST server when using Deltacloud.
Executing a large number of independent simulations can be treated as a form of task farming, also referred to as a bag of task, which is a popular approach of enabling parameter sweep over large parameter spaces while using distributed infrastructures such as Grids and Clouds. Systems supporting this approach often follow the master-worker pattern, where the master part is responsible for coordinating and scheduling computations at the worker part. In [8], the authors describe how task farming can be
874
III.
T HE S CALARM P LATFORM
Visual Paradigm for UML Standard Edition(AGH University of Science and Technology)
InfrastructureFacade +start_monitoring() +schedule_simulation_managers() +prepare_configuration_for_simulation_manager()
Data farming requires dedicated tools and computational resources in order to be applied effectively. As described in the previous Section, existing software provides only partial support for conducting data farming experiments on heterogeneous Clouds. In the previous work [12], the authors developed a platform called Scalarm, whose main goal is to support data farming experiment phases starting from an experiment design phase, through simulation execution with heterogeneous computational infrastructures, to statistical analysis of results.
AmazonFacade +start_simulation_managers(user, instances_count, e... +start_monitoring() +get_running_simulation_managers(user, experiment) +add_credentials(user, params, session)
Scalarm was started as a module for data farming within the EDA EUSAS project [13], [14]. Within the project, Scalarm was used to enhance the training process of security forces through evaluating strategies used during missions. A sample simulation scenario involved controlling the access of civilians to a military base camp during elections in a mission abroad. In this scenario, civilians were waiting in front of a camp entrance with an intention to start a skirmish. The main objective of the scenario was to prevent the escalation of aggression by effective negotiations with civilian leaders. However, civilians might act differently depending on their personality traits represented by simulation input parameter values, hence actions performed by security forces should be adjusted to dynamically observer behaviour. A goal of a data farming experiment, which used this simulation scenario, was to find out how to minimize the number of injuries in such a scenario, regardless civilians behaviour.
Computations state persistance
AmazonVm
Fig. 1.
Although the first version of Scalarm supported different computational infrastructures including Amazon EC2, we found out that each supported infrastructure was handled with a separated, custom-implemented code. Thus adding support for a new type of infrastructure, e.g. another Cloud, imposed multiple modifications in the code base and required timeconsuming manual tests.
monitoring of the scheduled Simulation Managers and reacting to undesirable events to provide fault tolerance, e.g. hardware failures,
•
credentials and access rights management to enable previous two activities on the user behalf.
The current model of infrastructures support in Scalarm.
a facade class, which enables interaction with the infrastructure in regards to computations scheduling and monitoring in an infrastructure-dependent way,
•
a class representing Simulation Managers scheduled onto the infrastructure as a part of managing computations.
This approach is appropriate for representing different types of infrastructures, i.e. which share no concepts regarding computations, like Grids and Clouds. Each infrastructure is treated in a separate way and does not need to use the already developed code. However, it can be inefficient when dealing with a multiple similar instances of a concrete infrastructure type, e.g. different Clouds, which share basic concepts regarding computations, e.g. virtual machines. In such a case, the Scalarm model does not promote code reuse and requires substantial development effort. To support different Clouds, one has to provide representation of each Cloud in a dedicated way. To mitigate this issue, we developed a replacement of this model, which promotes code reuse and minimizes the required effort for providing support of different Clouds. Although this new model was developed with the Cloud-like infrastructures in mind, it is applicable to support other types of infrastructures, which share common concepts, e.g. different scheduling systems within Grid environments.
The actual model of Scalarm in regard to infrastructure support is depicted in Figure 1. Each infrastructure, e.g. a concrete Grid or Cloud, is represented in the platform by a separated facade class, through which the platform interacts with the infrastructure in the context of the following activities:
•
PlGridJob
•
A. Scalarm model supporting computational infrastructures
scheduling workers that execute simulations – Simulation Managers, which provide an abstraction of computations,
Computations state persistance
In the model, computations executed on different infrastructures are represented by separated infrastructure-dependent entities, e.g. in case of Amazon EC2 – information about a VM is stored. This information is used only by concrete infrastructure facades, which implies that the rest of Scalarm knows nothing about the executed computations. Moreover, this approach requires that each supported infrastructure will handle this information from scratch, i.e. no code reuse is possible. In order to provide support for another infrastructure with this model, e.g. a new Cloud, one has to implement:
Due to its comprehensive support for data farming and a service-oriented design, Scalarm was selected as a foundation for the work presented in this paper. In the following subsections, we describe the Scalarm model for supporting computational infrastructures; our extension of Scalarm to support heterogeneous Clouds, and its usage in conducting data farming experiments.
•
PLGridFacade +start_simulation_managers(user, instances_count, e... +start_monitoring() +get_running_simulation_managers(user, experiment) +add_credentials(user, params, session) +create_scheduler_facade(type)
875
!!
PL-Grid Cloud is a service running under the PLGrid PLUS Project [16] on top of the OpenNebula [17] software. It provides an environment for Polish scientists to execute their codes in a fully customizable environment, e.g. installing libraries or applications not available in the PL-Grid infrastructure, which constitutes a national Grid. Administrative tasks, e.g. manage images, templates and instances of VMs, can be performed with either a command-line client or a RESTbased programming interface. Scheduled VMs instances can be accessed via SSH within the PL-Grid VPN or with a provided mechanism of redirecting ports from private IP addresses of VMs to public ones.
!!
!!
Google Compute Engine is example of a commercial, public Cloud built on top of the Google infrastructure. Administrative tasks can be done with Google Cloud Console (a web application), gcutil (command line tools), or a RESTful API implemented by libraries in various programming languages. Scheduled VM instances can be accessed via SSH.
C. Heterogeneous Clouds support with the new Scalarm model An overview of the Cloud support in Scalarm, specifically the IaaS type, is depicted in Figure 3. Scalarm consists of a loosely coupled services each of which provides well-oriented capabilities. Experiment Manager has a Cloud-supporting module, which implements an interface for managing VMs in a Cloud specific manner, e.g. using a dedicated API in case of Amazon EC2 or using a REST-based API in case of PLGrid Cloud. Experiment Manager uses the module to manage Simulation Managers on different infrastructures. In case of Clouds, Experiment Manager injects Simulation Manager with a location of Information Service after scheduling a VM. Then, Simulation Manager communicates with Information Service to retrieve a list of all registered Experiment and Storage Managers; thereby they can retrieve input parameters for subsequent simulations, and can send simulation results and binary output back to Scalarm when a simulation is completed. Experiment Manager oversees VMs and can remove them when the experiment comes to the end.
Fig. 2. An overview of the improved model of infrastructure access in Scalarm.
B. Improved Scalarm infrastructure access model Based on an in-depth analysis of the Scalarm implementation, we replaced its internal infrastructure access model to be easily extensible in regard to support other Clouds. An overview of this new infrastructure access model in form of an UML diagram is depicted in Figure 2. There are two important changes: a uniform facade for different Clouds, and the separation of concerns regarding computations abstraction and the actual infrastructure access. The InfrastructureFacade class was modified to implement operations common for all infrastructures, e.g. the computations monitoring capability, instead of being delegated to a concrete infrastructure every time is now a template method, which provides a uniform algorithm from the Scalarm point of view, but delegates infrastructure-dependent operations, e.g. scheduling Simulation Managers, to concrete infrastructures.
A basic requirement here is that to use each supported Cloud a Scalarm’s user must have valid credentials for the Cloud and a prepared VM image with:
The abstraction of computations has been extracted to the new SimulationManager entity, which provides now a uniform interface to functionality concerning computations management and monitoring. It encompasses algorithms utilized by Scalarm to manage computations, e.g. status monitoring, but delegates infrastructure-dependent operations to concrete infrastructures, e.g. VM state checking. As in the former infrastructure access model, information about Simulation Managers is persisted in various MongoActiveRecord classes, whose now implement the SimulationManagerRecord interface.
•
software needed to run user’s simulation installed – the user is independent from Cloud software and can fully define dependencies for the simulation execution environment; the actual simulation is retrieved by Simulation Manager dynamically,
•
enabled access via SSH,
•
installed dependencies to launch Simulation Manager.
Based on our experience with different Cloud solutions and observed similarities in accessing different public Clouds, we unified the Cloud-handling Scalarm part in order to facilitate new Clouds embracement in the future. The existing software solutions described briefly in section II, which provide a unified Cloud API are imposing the integration of extraneous technical solutions, e.g. use of additional REST server. In this paper, we propose more lightweight and flexible solution, which is based on integration of dedicated APIs to different
As an example of Cloud-like infrastructures, the first version of Scalarm supported only Amazon EC2 – a commercial, public Cloud. Following the new infrastructure access model introduction, we provided support for two other Clouds: PLGrid Cloud – a national, noncommercial Cloud for scientists, and Google Compute Engine [23] – an Infrastructure as a Service (IaaS) solution provided by Google.
876
pattern to manage different Clouds, i.e. each facade to an actual Cloud has a dedicated CloudClient instance, which implements Cloud-dependent operations. CloudFacadeFactory instantiates at Scalarm startup CloudFacades for each supported Cloud by combining a generic CloudFacade instance with a specific CloudClient instance. The CloudFacade class provides by itself the following functionality common to all Clouds:
Fig. 3. An overview of Cloud support in Scalarm. SM stands for ”Simulation Manager”.
•
preparation of VMs to run Simulation Manager, which means VM creation, Simulation Manager injection and other necessary actions leading to provide access to the VM for Scalarm Experiment Manager,
•
VM instances monitoring – each Cloud-specific facade monitors asynchronously state of scheduled VMs; if necessary, e.g. due to the stop of a data farming experiment, it shuts down, restarts or deletes a VM instance in order to release unnecessary resources and consequently reducing the Cloud usage cost,
•
management of user credentials required to access the Cloud infrastructure and schedule VMs,
•
provisioning information about scheduled VM lifecycle states, to display in the Scalarm GUI.
#" $
#" $
Cloud-dependent operations are delegated to concrete CloudClient classes. For example, each Cloud checks the life-cycle state of a VM in a different way. In addition, humanreadable descriptions of possible state values are diversified among different Clouds, e.g. Amazon EC2 and Google Compute Engine VM instances have 6 separate states, while PLGrid Cloud instances have 11 possible states. Hence, each CloudClient transforms the Cloud-specific VM state to the Scalarm specific one.
#" $
Fig. 4.
To provide support for a new Cloud, one needs to use specific the Cloud-dependent API (e.g. aws-sdk [15] for Amazon EC2) to create a new client class implementing the CloudClient interface. In the case of providing a unified VM state, the VM states provided by the Cloud-dependent API should be mapped to the few generic ones supported by CloudFacade.
!"
An overview of the Clouds support model in Scalarm
Another aspect of Cloud access unification is credentials management. CloudClient uses a CloudSecrets entity to store Cloud credentials in a Cloud-dependent way. Methods that need to be implemented in CloudClient in context of concrete Cloud user are:
Clouds, which is facilitated by the refactored Scalarm infrastructure access model. Previously, adding support for an extra Cloud in Scalarm required an implementation of a separated facade class, which could cause code redundancy. The proposed model replacement uses polymorphism to promote code reuse and enable separation of Scalarm’s logic regarding computations management and infrastructure interaction, i.e. Scalarm internally uses a uniform interface to every Cloud, and adding a new Cloud support in Scalarm requires implementation of Clouddependent operations utilized by the interface. The general diagram of our own Cloud unification model is presented in Figure 4. Each class supporting a computational infrastructure, e.g. Grid, Cloud, private clusters, inherits from InfrastructureFacade just as before. However, instead of creating a separate facade for each new Cloud, we created CloudFacade, which uses the Dependency Injection design
877
•
instance_types() – fetch supported VM sizes (in terms of provided resources), which the user can instantiate, e.g. “f1-micro” on Google Compute Engine or “m3.medium“ on Amazon EC2,
•
all_images_info() – fetch list of operating system images supported by the Cloud,
•
all_vm_ids() – fetch list of VM instances IDs currently running for Cloud user,
•
state(vm_id) – read VM instance state and convert it to one of: initializing, running, deactivating, error,
•
exists(vm_id) – check if a VM instance with the given ID exists,
•
terminate(vm_id) – terminate a VM instance,
•
reinitialize(vm_id) – force a VM instance to restart and inject Simulation Manager once again,
•
public_ssh_address(vm_id) – fetch pair of public host and public port to set up a SSH connection.
•
IV.
CloudSecrets – stores Cloud credentials in a Cloud-dependent way, e.g. login and password in case of PL-Grid Cloud,
•
CloudImageSecrets – stores data needed to access scheduled VMs (image ID, SSH login and password).
•
CloudVmRecord – stores information about VM execution time constraints and Simulation Manager’s initialization status.
A. Providing support for a new Cloud This evaluation intends to measure and compare development effort required to provide support for a new Cloud in the first version of Scalarm with the version containing the new infrastructure access model. We started with adding PLGrid Cloud support with the old infrastructure access model. That was a relative long work on familiarization with the Cloud REST API and developing the facade code based on the previously developed Amazon EC2 support. Implementation in such way caused code redundancy but also was the starting point to recognize, which parts of different Cloud support implementation is similar. The whole process of becoming familiar with PL-Grid Cloud, implementing all the necessary code, and testing lasts slightly more than 30 person-hours.
Every sensitive information, e.g. passwords, secret tokens, are encrypted by Scalarm. In addition, we prepared an automatic end-to-end test, which validates operations implemented by CloudClient in real-life scenarios. With such a test, each new Cloud support can be easily tested if it is compatible with Scalarm. Prerequisite of this test includes providing credentials to the Cloud and information about size of a VM instance, which should be used during the test run. The test will automatically use credentials to create a VM instance, establish a SSH connection, validate essential operations execution and eventually terminate machine.
In accordance with the description in Section III-C, adding support for a new Cloud with the replaced model is reduced to wrapping a Cloud API and providing additional GUI forms related to that service authentication methods. Hence, the next step was to add support for another Cloud, namely Google Compute Engine, using the new infrastructure access model. It was done under the previous assumptions in the following stages:
D. Exploiting Scalarm’s Elasticity in Data Farming
1)
Our extended version of Scalarm intends to enable users to conduct data farming experiments with resources gathered from multiple Clouds. This is a common use case, when an organization utilizes a limited private Cloud to run small, inhouse experiments, but wants to use a public Cloud to boost large-scale experiments. An important element in such a use case is appropriate elasticity level of the platform.
2) 3)
In comparison with other, batch-like processing tools, Scalarm enables users to dynamically adjust the amount of resources dedicated to run an experiment. The adjustment can be performed with GUI or API in any moment of experiment’s execution. The user configures access to different Clouds only once as described in the previous Subsection, and then chooses the number of Simulation Manager instances, which have to be started in the Cloud.
4)
to boost a data farming experiment with additional computational resources, e.g. due to deadlines,
•
to migrate computations from one Cloud to another, e.g. to release resources from a private Cloud and use resources from a public Cloud,
familiarizing with Google Compute Engine API, which allows to use the Google REST API directly from our programming language of choice, implementing a GoogleCloud::CloudClient class which implements the CloudClient interface, creating graphical forms to support Google Compute Engine credentials management within Scalarm; other parts of Scalarm GUI like forms for scheduling computations or managing virtual machine images are generated automatically, end-to-end test configuration and execution, which proved the validity of the provided support for Google Compute Engine.
In summary, total time spent on adding Google Compute Engine support with the new infrastructure access model was measured to about 15 person-hours. In addition, correctness of the implementation was proven by real end-to-end tests where VMs was instantiated and accessed via SSH. The whole process took less than half of time needed to add PL-Grid Cloud support using the previous code base. Moreover, that task was easier, because there was no need to copy existing functionality what caused many small, but hard-to-detect errors previously.
This functionality can be used in a number of different cases: •
E XPERIMENTAL E VALUATION
To evaluate the implemented Scalarm extension, we conducted two types of tests: providing support for a new Cloud, and conducting data farming experiments with different resource configurations. The former test intends to evaluate the necessary cost in terms of development effort in order to provide support for a Cloud infrastructure within Scalarm. The latter test demonstrates Scalarm functionality in the context of boosting computations by using heterogeneous Clouds.
Besides API-unification wrapper classes, CloudFacade persists various information about Cloud resources: •
to pause or slow down an experiment, e.g. to temporary release private Cloud resources.
878
TABLE I.
R ESOURCE CONFIGURATIONS FOR THE EXPERIMENTAL
2) Testing Environment: Resources for simulation execution were provided by different Clouds, namely: the PL-Grid Cloud based on OpenNebula, a public Amazon EC2, and a hybrid environment including both private and public Clouds in two variants: PL-Grid Cloud with Amazon EC2, and PLGrid Cloud with Amazon EC2 and Google Compute Engine.
EVALUATION
Experiment no.
PL-Grid Cloud resources
Amazon resources
1
20
0
0
2
0
20 (Small instances)
0
3
10
10 (Small instances)
0
4
50
0
0
5
0
50 (Medium instances)
0
6
20
50 (Medium instances)
23
TABLE II.
A
EC2
Google Compute Engine resources
VMs within utilized Clouds were instantiated from similar images in one size with computational resources as follows: •
on PL-Grid Cloud - 1 vCPU (one virtual/logical core of an HT-enabled =2.5 GHz 2008 Xeon processor) ˜ 512 MB RAM,
•
on Amazon EC2 - Small instance (1 EC2 compute unit =˜ 1.2 GHz 2007 Opteron or 2007 Xeon processor) 1.7 GB RAM, and 3rd generation Medium instance (3 EC2 compute units and 3.75 GB RAM),
•
on Google Compute Engine - the ”n1-standard-”’ instance (1 vCPU compute unit =˜ 1 Hyperthread of a 2.6GHz Sandy Bridge CPU) - 3.75 GB RAM.
PARAMETRIZATION OF DATA FARMING EXPERIMENTS
Parameter
DoE method
Min value
Max value
Step
Initial population size
Full factorial
100
800
20
Iteration count
Full factorial
20
400
10
B. Data farming experiments with heterogeneous Clouds To evaluate Scalarm in the context of supporting heterogeneous Clouds, we conducted several data farming experiments with different resource configurations as depicted in Table I.
The master part of Scalarm was located at a private server. Subsequent data farming experiments differed from each other in the location of Simulation Managers, i.e. it was either a concrete Cloud or both Clouds together.
The main goal of the evaluation was twofold: to prove Scalarm capability to support heterogeneous Cloud-based infrastructures, and to demonstrate different use cases of using heterogeneous Clouds in a single data farming experiment. Before starting experiments in Scalarm, we prepared manually VM images for each utilized Cloud, with all the necessary dependencies to run our simulation of choice. Then, we used only Scalarm to manage resources within each conducted experiment.
3) Results: Evaluation results in regard to the experiment execution time, its cost and calculated distance, are collected in Table III. We can deduct when commercial Clouds are worthusing by analyzing execution time and cost of each experiment. The cost of each experiment was calculated based on public Cloud pricing only, since we did not pay for using the private Cloud.
The main criteria used to compare subsequent experiments was the execution time of all simulations from the generated parameter space. The time of scheduling VMs was negligible in compare with the simulation execution time - in all cases the VMs scheduling took about a dozen of seconds.
Lets suppose, that we have a small private Cloud, whose capacity is limited to 50 VMs but in most cases we can use up to 20 VMs due to sharing the Cloud with other users. We can use only the private Cloud to conduct the described problem in 18 [h] and 33 [min] (Experiment 1). Replacing the private Cloud with the same number of the cheapest VMs from a public Cloud is not a good choice because it generates cost and the experiment takes much more time to execute (Experiment 2). Choosing a hybrid Cloud with 10 VMs taken from private and public Clouds respectively (Experiment 3), generates less costs but still it takes more time to execute in compare with the private Cloud case.
1) The Problem Statement: As a problem to solve, we chose the Travel Salesman Problem (TSP) with a genetic algorithm-based solution. TSP is a classical, NP-complete, graph-theory problem, which asks about the shortest possible route through a given list of cities that each city is visited only once and the last visited city is also the starting point of a travel. In our case, we prepared a list of 500 random settlements (including cities, villages or scientific stations) from all over the world based on the actual location data available at [18]. The travel cost between the cities is specified by an approximation to the great circle distance [19] on Earth (treating Earth as a ball).
In general, public Clouds provide much more elasticity in terms of VM resource configurations and overall capacity than private Clouds. Private Clouds are rather limited installations mostly often, shared between many users. In case of saturating such a private Clouds, the only option to conduct an experiment with time constraints is to use public Clouds, which have much more capacity. Hence, we can schedule more VMs on a public Cloud each with more resources (Experiment 5). Even though it costs, we attained a decrease of the experiment execution time to half. A worth to mention notice here is that, even though we used instances with 3 ECU for 1 vCPU, it was still slower than the corresponding number of VMs from our private Cloud (Experiment 4).
To solve this problem, we selected an existing genetic algorithm provided by [20]. Thus, a simulation in our experiments will be a single run of the genetic algorithm with the initial population size and the iteration count as input parameters. Simulation results include three attributes: the calculated summary distance between all cities, the execution time of the simulation, and the route in form of a city list. Each conducted data farming experiment included 1 330 simulations, generated based on the parametrization from Table II. After starting an experiment, Simulation Managers were scheduled manually.
Public Clouds also have some quotas regarding VMs instances available for a single user. Hence, to boost our computations even further, we can use more than one public
879
TABLE III.
[2]
DATA FARMING EXPERIMENTS EXECUTION EVALUATION RESULTS
Experiment no.
Experiment execution time
Avg. simulation execution time
Shortest distance [km]
Cost [$]
1
18 [h] 33 [m]
15 [m] 56 [s]
2 666 008
0
2
38 [h] 11 [m]
32 [m] 34 [s]
2 674 448
46.8
3
23 [h] 16 [m]
20 [m] 28 [s]
2 665 448
14.4
4
6 [h] 57 [m]
15 [m] 25 [s]
2 677 661
0
5
9 [h] 01 [min]
20 [m] 13 [s]
2 657 141
31.5
6
3 [h] 44 [m]
14 [m] 39 [s]
2 684 101
20.44
[3]
[4] [5] [6]
Cloud in combination with resources from the private Cloud (Experiment 6). This hybrid Cloud with much more resources than used in previous experiments, turned out to be both the most efficient one in terms of the execution time and also very cost-effective.
[7]
[8]
The calculated shortest distances slightly varied in each experiment but were very similar in general. This was also expected since we used a genetic algorithm, which is a special case of stochastic optimization methods. V.
[9]
[10]
C ONCLUSIONS AND F UTURE W ORK
The requirement of supporting heterogeneous Clouds is an important aspect of conducting virtual experiments such as data farming. More and more often, in-house resources are limited to necessary minimum, thus public Clouds are the only pool of easily accessible computing power.
[11]
[12]
In this article, we described an extended version of the Scalarm platform, which supports a combination of public and private Clouds to conduct data farming experiments. We described how the execution flow of scheduling Simulation Managers was unified and how different Clouds can be utilized almost transparently in a single experiment.
[13]
[14]
The conducted experiments evaluated the implemented extension in terms of facilitating support for new Cloud infrastructures and conducting data farming experiments with heterogeneous Clouds. Though different Clouds were used in each conducted experiment, the rest functionality of Scalarm, e.g. the progress monitoring and data exploration, did not require any modifications.
[15] [16]
The presented support for heterogeneous Clouds can be further enriched with additional features regarding cost planning for data farming experiments in order to predict cost of conducting a particular experiment, based on Cloud pricing information, the amount of scheduled Simulation Managers, and the average simulation execution time
[17]
ACKNOWLEDGMENT
[20]
The research is partially supported by the VirtROLL project - EU Research Fund for Coal and Steel (RFCS), RFSC-CT2013-00007. RS and JK are also grateful for support to the AGH-UST grant no. 11.11.230.015.
[21]
[18] [19]
[22] [23]
R EFERENCES [1]
Gartner’s 2013 Hype Cycle for Emerging Technologies Maps Out Evolving Relationship Between Humans and Machines, http://www.gartner. com/newsroom/id/2575515, accessed: 02/01/2014.
880
J. Marco, I. Campos, I. Coterillo, et. al., ”The Interactive European GRID: Project Objectives and Achievements”, Computing and Informatics, 27(2), 2008, pp. 161–171. G. E. Horne and K.-P. Schwierz, Data farming around the world overview, in Proceedings of the 40th Conference on Winter Simulation, WSC 08, pp. 1442-1447, Winter Simulation Conference, 2008. Amazon Elastic Compute Cloud website, http://aws.amazon.com/ec2/, accessed: 10/01/2014. S. Upton, Users Guide: OldMcData, the Data Farmer, Version 1.1., http: //harvest.nps.edu/software.html, accessed: 02/01/2014. R. P. Bruin, T. O. H. White, A. M. Walker, K. F. Austen, M. T. Dove, R. P. Tyer, P. A. Couch, I. T. Todorov, and M. O. Blanchard, Job submission to grid computing environments, in Proceedings of the UK e-Science All Hands Meeting 2006, (Nottingham, UK), pp. 426-432, 2006. K. Salah, ”A Queueing Model to Achieve Proper Elasticity for Cloud Cluster Jobs,” Cloud Computing (CLOUD), 2013 IEEE Sixth International Conference on Cloud Computing, pp. 755–761, June 28 2013-July 3 2013. R. Cushing, S. Koulouzis, A. Belloum and M. Bubak, ”Applying workflow as a service paradigm to application farming”, Concurrency and Computation: Practice and Experience, 2013, doi: 10.1002/cpe.3073. E. Carlini, M. Coppola, P. Dazzi, L. Ricci, and G. Righetti, Cloud Federations in Contrail, in M. Alexander, et al. (Eds.), Euro-Par 2011: Parallel Processing Workshops, Lecture Notes in Computer Science, vol. 7155, pp. 159-168, Springer, 2012. G. Pierre and C. Stratan, ConPaaS: A Platform for Hosting Elastic Cloud Applications, Internet Computing, IEEE, vol. 16, no. 5, pp. 8892, 2012. A. Oprescu and T. Kielmann, Bag-of-Tasks Scheduling under Budget Constraints, in Cloud Computing Technology and Science (CloudCom), 2010 IEEE Second International Conference on, pp. 351359, 2010. D. Krol, M. Wrzeszcz, B. Kryza, L. Dutka, and J. Kitowski, Massively Scalable Platform for Data Farming Supporting Heterogeneous Infrastructure, in The Fourth International Conference on Cloud Computing, GRIDs, and Virtualization, IARIA Cloud Computing 2013, (Valencia, Spain), pp. 144-149, 2013. S. Dlugolinsky, M. Kvassay, L. Hluchy, M. Wrzeszcz, D. Krol, and J. Kitowski, ”Using parallelization for simulation of human behaviour”, in GCCP 2011: 7th International Workshop on Grid Computng for Complex Problems, Bratislava : Institute of Informatics SAS, 2011, pp. 258–265, ISBN: 978-80-970145-5-1. M. Kvassay, L. Hluchy, S. Dlugolinsky, M. Laclavik, B. Schneider, H. Bracker, A. Tavcar, M. Gams, D. Krol, M. Wrzeszcz, and J. Kitowski, An integrated approach to mission analysis and mission rehearsal, in Proceedings of the Winter Simulation Conference, WSC 12, pp. 1-2, Winter Simulation Conference, 2012. Amazon SDK for Ruby website, http://aws.amazon.com/sdkforruby, accessed: 02/01/2014. PLGrid PLUS Project website, http://www.plgrid.pl/en/projects/plus, accessed: 02/01/2014. S. Saini, S. Heistand, H. Jin, J. Chang, R. Hood, P. Mehrotra, and R. Biswas, An Application-based Performance Evaluation of NASAs Nebula Cloud Computing Platform., in G. Min, et al. (Eds.), HPCCICESS, pp. 336-343, IEEE Computer Society, 2012. World Travel Salesman Problem website, http://www.math.uwaterloo. ca/tsp/world/index.html, accessed: 02/01/2014. Great circle distance definition website, http://mathworld.wolfram.com/ GreatCircle.html, accessed: 02/01/2014. Genetics Algorithms in Ruby, http://www.ai4r.org/geneticAlgorithms. html, accessed: 02/01/2014. Apache jclouds website, http://jclouds.apache.org/, accessed: 05/01/2014. Apache Deltacloud website, http://deltacloud.apache.org/, accessed 05/01/2014. Google Compute Engine website, https://cloud.google.com/products/ compute-engine/, accessed 05/04/2014.