Journal of Grid Computing manuscript No. (will be inserted by the editor)
An SLA-based Broker for Cloud Infrastructures Antonio Cuomo · Giuseppe Di Modica · Salvatore Distefano · Antonio Puliafito · Massimiliano Rak · Orazio Tomarchio · Salvatore Venticinque · Umberto Villano
Abstract The breakthrough of Cloud comes from its service oriented perspective where everything, including the infrastructure, is provided “as a service”. This model is really attractive and convenient for both providers and consumers, as a consequence the Cloud paradigm is quickly growing and widely spreading, also in non commercial contexts. In such a scenario, we propose to incorporate some elements of volunteer computing into the Cloud paradigm through the Cloud@Home solution, involving into the mix nodes and devices provided by potentially any owners or administrators, disclosing high computational resources to contributors and also allowing to maximize their utilization. This paper presents and discusses the first step towards Cloud@Home: providing quality of service and service level agreement facilities on top of unreliable, intermittent Cloud providers. Some of the main issues and challenges of Cloud@Home, such as the monitoring, management and brokering of resources according to service level requirements are addressed through the design of a framework core architecture. All the tasks committed to the architecture’s modules and components, as well as the most relevant component interactions, are identified and discussed from both the structural and the behavioural viewpoints. Some encouraging experiments on an early implementation prototype deployed in a real testing environment are also documented in the paper. Keywords Cloud Computing, Cloud@Home, SLA, QoS, Resource Brokering. Antonio Cuomo, Umberto Villano Dipartimento di Ingegneria, Universit`a degli Studi del Sannio, Italy. E-mail: antonio.cuomo,
[email protected] Giuseppe Di Modica, Orazio Tomarchio Dipartimento di Ingegneria Elettrica, Elettronica ed Informatica, Universit`a di Catania, Italy. E-mail: giuseppe.dimodica,
[email protected] Salvatore Distefano Dipartimento di Elettronica e Informazione, Politecnico di Milano, Italy. E-mail:
[email protected] Antonio Puliafito Dipartimento di Matematica, Universit`a di Messina, Italy. E-mail:
[email protected] Massimiliano Rak, Salvatore Venticinque Dipartimento di Ingegneria dell’Informazione, Seconda Universit`a di Napoli, Italy. E-mail: massimiliano.rak,
[email protected]
2
Antonio Cuomo et al.
1 Introduction Among the several definitions of “Cloud computing” available in literature, one of the most authoritative is that provided by NIST [?]: “Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. This Cloud model promotes availability and is composed of five essential characteristics, three service models, and four deployment models.” It is important to remark that such definition identifies the availability as a key concept of the Cloud paradigm. This concept has to be categorized into the broader class of quality of service (QoS), service level agreement (SLA) and related issues, topics of primary and strategic importance in Cloud. In such context, the focus of this paper is on the Infrastructure as a Service (IaaS) provisioning model. IaaS Clouds are built up to provide infrastructures such as computing, storage and communication systems for free or by charge, with or without QoS/SLA guarantees. There are multiple frameworks able to provide computing IaaS services: Eucalyptus [?], OpenNebula [?], Nimbus [?], PerfCloud [?], Clever [?], OpenStack [?] to name a few. All of them, as well as the existing proprietary solutions (i.e., Amazon EC2, Rackspace, etc.), aggregate and manage powerful and reliable underlying computing resources (usually single or multiple interconnected datacenters) to build up the Cloud IaaS infrastructure. A different approach is instead proposed by Cloud@Home, a project funded by Italian Ministry for Education and Research [?]. Cloud@Home (briefly, C@H) aims at building an IaaS Cloud Provider using computing, storage and sensing resources also acquired from volunteer contributors. The basic assumption on which C@H relies is that the resources offered on a volunteer basis are not reliable and can not provide levels of QoS comparable to those offered by commercial public Clouds. We believe the C@H volunteer approach can provide benefits in both business and open contexts. In business environments one of the main source of cost and complexity for companies is related to expanding, maintaining, tuning and optimizing the hardware resources in order to effectively satisfy the highly demanding, domain-specific software and to ensure adequate productivity levels. The C@H technology will enable companies to organize their computing resources, which are sometimes distributed over several sites, in order to meet the demands of the mentioned software. Indeed, C@H allows a company to aggregate its sites into a federation, to which each site can contribute with its available and underexploited hardware resources to provide added value, guaranteed services. In open contexts a possible scenario that magnifies the C@H features and approach can be the academic one. Let us imagine that several universities or, in general, research institutions worldwide, need to collaborate on a scientific project that requires a huge amount of hardware resources. Moreover, let us assume that each institution owns a private datacenter, made up of heterogeneous computing resources, each having a different level of utilization (depending on their geographic coordinates and time zones, some datacenters may result underexploited with respect to others). C@H will provide the institutions with tools to build up a federation of datacenters acting as a Cloud broker, to which each partner can contribute with its own (i.e., not utilized or underexploited) resources according to its scheduled availability. Pushing the approach to its limits, one can also imagine a scenario where private users aggregate into a federation and share their resources for each other’s needs. In order to implement such ambitious idea, in the C@H project a three-phase roadmap was scheduled: i) development of quality of service (QoS) and service level agreement (SLA) brokering/federation mechanisms for private Cloud providers; ii) development of
SLA-based Cloud Broker
3
billing and reward mechanism for merging both private and public Cloud; iii) development of tools for involving single-resource (desktop, laptop, cluster) volunteer contributors. The focus here is on the first step of the roadmap. The paper discusses the implementation of a framework for federating and brokering private Clouds, able to provide QoS guarantees on top of best effort providers. The strength of the C@H proposal is that it offers a solution that is efficient from a cost perspective (when needed, resources can be “borrowed” from the system, thus avoiding to purchase them outside) and sufficiently reliable at the same time, as mechanisms for guaranteeing QoS are also provided. If we take a look at the state of the art in this research field, the management of service quality of the leased resources is sometimes partially covered by commercial Clouds (availability, reliability and high level performance is natively supported by very powerful and high-quality datacenters) and often neglected in actual Cloud frameworks. The objectives of the C@H project and its rationale, limits and application contexts have been critically discussed in [?], which is a preliminary version of this work mainly focusing on concepts and ideas behind C@H. The present work tries to further develop and implement such ideas, more specifically by: i) identifying and characterizing the system actors; ii) detailing the architectural design and its modules and components; iii) providing details on the interactions among the components implementing the C@H functionalities; iv) reporting on the current prototype implementation and the testbed. The remainder of the paper is organized as follows. A brief overview of the state of the art is first reported in Section 2. Then, Section 3 describes the C@H core architecture. Sections 4, 5, 6 and 7 delve into the details of the C@H architectural modules and components. Section 8 presents the system from a dynamic perspective, while Section 9 deals with a real testbed on which tests have been carried out. Conclusions and future developments are discussed in Section 10.
2 Related work C@H aims at implementing a brokering-based Cloud Provider starting from resources shared by different providers, addressing QoS and SLA related issues, as well as resource management, federation and brokering problems. Since such issues and problems involve different topics, in the following we identified some of them providing an overview of the current state of the art. Volunteer and Cloud Computing. The idea of volunteer Clouds recently emerges as one of the most interesting topic in Cloud computing. Some work is available in literature, also inspired by C@H that is one of the first attempt in such direction [?]. In [?] the authors present the idea of leveraging volunteer resources to build a form of dispersed Clouds, or “nebulas”, as they call them. Those nebulas are not intended to be general purpose, but to complement the offering of traditional homogeneous Clouds in some areas where a more flexible, less guaranteed approach can be beneficial, like in testing environments or in application where data are intrinsically dispersed and centralising them would be costly. Some requirements and possible solutions are presented. BoincVM [?] is an integrated Cloud computing platform that can be used to harness volunteer computing resources such as laptops, desktops and server farms, for computing CPU intensive scientific applications. It leverages on existing technologies (the BOINC platform and VirtualBox) along with some projects currently under development: VMWrapper, VMController and CernVM. Thus, it is a kind
4
Antonio Cuomo et al.
of volunteer-on-Cloud approach, whereas C@H can be classified as a Cloud-on-volunteer model. In [?] the authors investigate how a mixture of dedicated (and so highly available) and non-dedicated (and so highly volatile) hosts can be used to provision a processing tier of a large-scale Web service. They propose an operational model that guarantees long-term availability despite of host churn, by ranking non-dedicated hosts according to their availability behavior. Through experimental simulation results they demonstrate that the technique is effective in finding a suitable balance between costs and service quality. Although the technique is interesting and the results are encouraging, in the paper there is no evidence of either a possible implementation or an architecture design of the overall infrastructure framework that should implement the idea. An approach that can be categorized into the volunteer Cloud is the P2P Cloud. It has been proposed in several papers as the ones cited above, and particularly in storage Cloud contexts. An interesting implementation of such idea is proposed in [?]. In particular, this work specifically focuses on the peer reliability, proposing a distributed mechanism in order to enable churn resistant reliable services that allows to reserve, to monitor and to use resources provided by the unreliable P2P system and maintains long-term resource reservations through controlled redundant resource provision. The evaluation results obtained through simulation show that using KAD measurements on the prediction of the lifetime of peers allows for 100% successful reservations under churn with very low traffic overhead. As in the above case, there is no real implementation of the proposed solution. Anyway, the monitoring and prediction tools developed can be of interest for C@H. Federation, InterCloud and resource provisioning from multiple Clouds. Management of resources in Cloud is a complex topic. In [?,?] examples of resource management techniques are discussed, while in [?] a policy-based technique facing the resource management in Cloud environment is proposed. Even if Cloud computing is an emerging field, the need to move out from the limitations of provisioning from a single provider is gaining interest both in academic and commercial research. In [?], the authors move from a data center model (in which clusters of machines are dedicated to running Cloud infrastructure software) to an ad-hoc model for building Clouds. The proposed architecture aims at providing management components to harvest resources from non-dedicated machines already in existence within an enterprise. The need for intermediary components (Cloud coordinators, brokers, exchange) is explained in [?], where the authors outline an architecture for a federated network of Clouds (the InterCloud). The evaluation is conducted on a simulated environment modeled through the CloudSim framework, showing significant improvements in average turnaround time and make-span in some test scenarios. Federation issues in Cloud environments have been considered, and some research projects focuses on this specific topic actively investigate possible solutions, such as RESERVOIR [?] and more recently mOSAIC [?] and OPTIMIS [?]. With specific regards to [?,?], the approach they propose is to implement a brokering system that acquires resources from different Cloud Providers and offers them in a custom way to their users. As regards the brokering solution for federation, C@H pioneeringly identified and proposed such modality since 2009 [?]. SLA Management in Clouds. In service oriented environments several proposals addressing the negotiation of dynamic and flexible SLAs have appeared [?]. However, to the best of our knowledge, none of the main commercial IaaS providers (Amazon, Rackspace, GoGRID, ...) is offering negotiable SLA. What they usually propose is an SLA contract that specifies simple grants on uptime percentage or network availability. Moreover, most of the providers offer additional services (for example Amazon CloudWatch) which monitor the state of target resources (i.e., cpu utilization and bandwidth).
SLA-based Cloud Broker
5
Open Cloud Engine software like Eucalyptus, Nimbus, OpenNebula, also implement monitoring services for the private Cloud Provider, but do not provide solutions for SLA negotiation and enforcement. A survey of the SLAs offered by commercial Cloud Providers can be found in [?]. In [?] the authors describe a system able to combine SLA-based resource negotiations with virtualized resources, pointing out how in current literature there is no approach taking into account both these aspects. A global infrastructure aiming at offering SLA on any kind of Service Oriented Infrastructure (SOI) is the objective of the SLA@SOI project [?], which proposes a general architecture that can be integrated in many existing solutions. Anyway, this interesting solution is hard to be fully adopted in an infrastructure composed of unreliable resources such as the ones targeted by the C@H project. Recently, in the context of the mOSAIC project, some offerings exist in order to offer user-oriented SLA services to final users [?,?]. A critical analysis. Due to the large number of computational resources often available in different environments, (like scientific labs, office terminals, academic clusters) there is a clear need of building solutions which are able to reuse such resources in a Cloud computing fashion. The above described state of art illustrates that a few attempts were done in order to apply volunteer computing approaches towards such direction. The main limit of such solution is related to different aspects: (1) definition of the real use cases where the volunteer computing approach can be applied, (2) integration with resources that come from commercial providers and (3) a clear evaluation of the quality of services obtained with such resources. The first open issue needs a clear identification of the type of resources that can be shared with the volunteer approach and their possible usage. At the best of authors’ knowledge such analysis is not yet available, let alone [?], which is a work in progress paper which partly anticipates some of the ideas presented here. The second problem, instead, is strictly related to what is called Cloud federation, i.e., the idea of integrating resources from different Cloud providers using different techniques (like brokering or Cloud bursting). As outlined above, there is a lot of interest towards federation, but few results take into consideration the effect of integrating commercial-based and volunteer-based resources, and the different issues that arise in such a context. In any case, even though brokering solutions are now available, few of them are stable; furthermore the techniques to be adopted are open and the way in which the brokering functionalities should be offered is, at state of art, not clearly defined. The third problem is a well-known one in literature: how to grant service level agreement over Cloud providers? Even if a lot of effort exists in such direction, no one has come out with a stable proposal. Commercial Cloud Providers use natural language to describe the functionalities, the terms of use and the service levels of their offers. Research projects (like SLA@SOI, Contrail, Optimis, above described) try to offer frameworks that can be integrated in Cloud Providers, but are usually heavy to maintain and hard to customize. The only available standard (WS-Agreement) has only one stable framework implementing its features (WSAG4J). Moreover all such proposals focus on integration of SLA management system from the Cloud Provider perspective. Implementing SLA mechanisms on top of federated resources is still an open question. Existing solutions (i.e. SLA@SOI) focus on how to integrate SLA management framework in complex datacenters; the problem is translated into a resource optimization problem. Some results focus on how to optimize the resources at brokering level, i.e. after the resources are obtained and without control over the physical infrastructure. More recently solutions aiming at offering SLA functionalities on the top of brokering systems have been proposed [?,?,?,?]. To the best of the authors’ knowledge, none of them take into account volunteer and time-bounded availability scenarios.
6
Antonio Cuomo et al.
3 System Overview From a wider perspective, C@H aims at merging the Cloud and the Volunteer computing paradigms. C@H collects infrastructure resources from different providers and offers them to the end users through a uniform interface, in an IaaS fashion. As depicted in Fig. 1, C@H resources are gathered from heterogeneous providers, potentially ranging from commercial Cloud providers, offering highly reliable services, to single PCs, voluntarily shared by their owners, who, by their nature, are not able to provide guarantees on the QoS. Cloud@Home Admin Cloud@Home Users
C@H CP Amazon EC2
Clever Eucalyptus
OpenStack
Volunteer CP
PerfCloud
Yet Another CP
OpenNEbula
Fig. 1: Resource aggregation from different Cloud providers.
The main goal of C@H is to provide a set of tools for building up a new, enhanced provider of resources (namely, a C@H provider) that is not yet another classic Cloud provider, but instead acts as an aggregator of resources offered by third party providers. A C@H provider collects heterogeneous resources from different Cloud providers adopting diverse resource management policies, and offers such resources to the users in a uniform way. This results in an added value infrastructure that also provides mechanisms and tools for implementing, managing and achieving QoS requirements defined through, and managed by, a specific SLA process. Indeed, in order to deal with the heterogeneity and churn of resources, a C@H provider can use a set of tools and services dedicated to the SLA management, monitoring and enforcement. A goal of C@H is to release the above discussed tools to the Cloud community. Any interested organization may use such tools to build a C@H provider. Nothing prevents the instantiation of multiple C@H providers, each one collecting and aggregating resources by different resource providers. Furthermore, a resource provider is allowed to join any C@H system they wish. As explained in Section 5, the availability of resources for a given request is assessed by the C@H provider at run-time, and the management of the resource status (free, busy, etc.) is up to the resource provider itself. To implement the Cloud@Home wide and ambitious vision it was necessary to adequately design and organize the work into phases according to the project aims and goals. In this paper we focus on the first phase towards Cloud@Home, which aims at identifying,
SLA-based Cloud Broker
7
specifying and implementing the C@H building blocks and its core architecture restricting the scope to private Cloud providers.
3.1 Actors On the backend side, the C@H provider interfaces with Cloud providers and performs the brokering of their services. It has to deal with the different levels of the service quality they are natively able to deliver. On the frontend side, the C@H provider has to allow the final users to access the resources in a uniform way, providing them with the required, sustainable QoS specified through the SLA process. In such a context, it is possible to identify three main actors involved in the C@H management process: Users, Admins and Resource Owners. A C@H User interacts with the C@H provider in order to request resources along with the desired quality of service. The C@H Users are also provided with tools to negotiate the desired QoS and, at service provision time, to check that the promised QoS is actually being delivered. A C@H Admin builds up and manages the C@H provider. The C@H Admin is the manager of the C@H infrastructure and, in particular, is in charge of the infrastructure activation, configuration and management. The C@H Admin decides which services provided by the infrastructure must be activated/deactivated. Furthermore, in case of QoS/SLA enabled infrastructures and services, the C@H Admin specifies the policies that have to be adopted to carry out the SLA negotiation process and the QoS enforcement. A Resource Owner shares its resources with the C@H system. Besides private sharers, the category of Resource Owners also encompasses commercial offerers (e.g., mainstream IaaS Cloud providers). In other terms a Resource Owner is a potential Cloud provider for C@H, even if some Resource Owners are not able to provide any standalone Cloud service. The role of Resource Owners can be classified and specialized as public contributors (i.e., well known public Cloud automatically enrolled by the system) and volunteer contributors. Volunteer contributors are Resources Owners that voluntarily share their resources to build up a C@H provider. We can further categorize volunteer contributors as: – Private Clouds: standalone administrative domains, which may have own QoS-SLA management and other resource facilities and services. They can be voluntarily involved in the C@H according to their administrators needs and wills thus becoming contributors. – Individuals: anyone who wants to voluntarily share their own desktop, laptop, cluster or generic resource/device with a C@H community. In this paper we specifically focus on private Clouds, narrowing the issues and related solutions to such class of volunteer contributors. In other words, here we restrict the concept of volunteer to just private Cloud contributors.
3.2 A Modular Architecture According to the scenario above discussed, in the following we identify the main blocks and modules of the C@H architecture, just considering private Clouds as Resource Owners. This includes the basic functionalities and mechanisms for implementing a C@H provider on top of private Cloud contributors. It can be also considered as the core architecture, the starting
8
Antonio Cuomo et al.
Fig. 2: The C@H system architecture.
point to extend and generalize when public Clouds and/or individuals will be involved into C@H as Resource Owners. The main goal of the architecture is to address some of the issues arised above. The architecture offers a set of components that can be easily used to build up its own Cloud Brokering solution. In Fig. 2 the C@H core architecture is depicted: it is organized into modules each composed of units providing specific functionalities, named C@H components. C@H components are themselves delivered as resources hosted on Cloud providers. Following and applying the separation of concerns principle, four main modules have been logically identified, grouping the main C@H functionalities: the Resource Abstraction module, the SLA Management module, the Resource Management module and the Frontend module. As shown above, some of the main components are thought to deal with SLA issues. The Resource Abstraction module hides the heterogeneity of resources (computing, storage and sensor elements) collected from Resource Owners and offers the C@H User a uniform way to access them. It provides a layer of abstraction adopting a standard, implementationagnostic representation. It also implements drivers in order to convert requests expressed in the intermediate representation to actual invocations on the interface of the Resource Owner. On top of the Resource Abstraction module C@H provides tools for the management of the SLAs that have to be negotiated with the C@H Users. The definition of formal guarantees on the performance that the resources must deliver is achieved through the SLA Management module. C@H Users can negotiate the quality level of the requested resources. The negotiation process relies on performance prediction and on statistics of providers’ historical availability in order to assess the sustainability of C@H Users’ requests. Statistics are built from information collected on the actual performance and QoS recorded for the supplied resources. A mobile agent-based monitoring service is responsible for gathering those data. The Resource Management module is the core of the system. It is in charge of the provision of resources and of the SLA enforcement. The most important functionalities provided by such module are related to the resource management. In particular this mod-
SLA-based Cloud Broker
9
ule is responsible for resource enrolment, discovery, allocation/re-allocation, activation/deactivation. These activities are carried out in accordance with the SLA goals and applying the procedures defined in the SLA modules. Finally, the Frontend module acts as an interface to the C@H IaaS for both the C@H Admin and the C@H Users. It just collects C@H Admin and User requests and dispatches them to the appropriate system module. In terms of implementation, C@H components are able to interact with each other through standardized, service-oriented interfaces. Such a choice enables flexible component deploying schemes. Pushing the virtualization paradigm to its limits, a single C@H Component can even be offered as a customized virtual machine hosted by any Cloud provider.
4 Resource Abstraction C@H mainly acts as an intermediary, acquiring different types of infrastructural resources from different Resource Owners and delivering them to C@H Users. To cope with resources and providers heterogeneity, a resource abstraction’s scheme have been devised. The Resource Abstraction module encompasses the components providing the abstraction and the necessary logic to map resources into real implementations. Heterogeneous resources lack of uniformity in some specific characteristics, properties and aspects, as they consist of dissimilar or diverse elements. Such differences can be broadly categorized according to three resource aspects: Type - a possible classification could be based on the resource intended function (i.e., the resource type), distinguishing among computing, sensor and storage resources. Such specific aspect of resource heterogeneity has been investigated in [?]. In the present work, we specifically focus on computing resources, since we are mainly interested in describing the high level mechanisms and solutions to deal with QoS, SLA and resource management in C@H. However the proposed solutions do not depend on the type of resources and can be easily adapted to sensor and storage resources. Hardware - resources can physically differ in their characteristics, ranging from the internal architecture (CPU, symmetric multiprocessor, shared memory, etc.) to the devices (number of cores, controllers, buses, disks, network interfaces, memories, etc.) and so on. Hardware heterogeneity issues of computing resources can be mainly addressed by virtualization. Software - software environments can differ on operating systems, compilers, libraries, protocols, applications, etc. In Cloud context, as above, computing resource heterogeneity is overcome through virtualization. To cope with this complexity, we decide to adopt an implementation-agnostic representation and access interface for every class of resources, embracing current standardization efforts whenever possible. Another issue to adequately take into account is the delivery of resources from providers, for which the specific acquisition modality must be defined and reflected in the interface. Based on the abstractions discussed above, a computing resource interface has been designed to enable access to the provisioned infrastructure. First of all, computing resources are defined in terms of basic blocks like Virtual Machines and Virtual Clusters. A reference standard has been chosen as the native interface to be supported for computing resources management: the OGF Open Cloud Computing Interface (OCCI, [?]). As for the delivery of resources, currently C@H defines two acquisition modalities for computing resources:
10
Antonio Cuomo et al.
Charged - resources obtained from public Cloud provider at a certain cost or fees and providing guaranteed QoS levels; Volunteer - resources obtained from Resource Owners that voluntarily support C@H. These might range from Cloud-enabled academic clusters, which deliver their resources for free (possibly subjected to internal administration policies and restrictions), to laboratory providing single machines outside their working time, to single desktops willing to contribute to the infrastructure. Such resources are intermittent but, in case of private Cloud volunteer contributors the availability is slotted, i.e., the private Cloud provider defines a time window, an interval, or more specifically a series of intervals, in which and when it is available as C@H Resource Owner. Otherwise, in case of individuals, no guarantees are provided on the shared resources. In this paper, restricting the scope to just private Cloud volunteer contributors, we assume resources are provided specifying the availability time windows. Besides the just described computing resource interface, the Resources Abstraction module contains components to support the practical implementation of such interface. These are the Provider Drivers that implement tools for the acquisition of resources by enabling the interaction with the Resource Owners. They receive OCCI-compliant resource requests from other components, convert them into the target provider’s interface, perform the actual invocation to the Resource Owner and return the results that are again converted into OCCI. In this way, it is possible to interact with several open and commercial Cloud platforms too, like Amazon EC2 and Rackspace. An OCCI generic driver is provided too, so that Resource Owners whose infrastructure implements this interface are automatically able to interact with the higher level services by directly exchanging OCCI-compliant messages and requests. In the current implementation, mainly focusing on private Cloud contributors, we have implemented drivers for PerfCloud [?] and Clever [?] providers, two Cloud frameworks adopted in the context of the C@H project. PerfCloud is a solution for integrating Cloud and Grid paradigms, an open problem that is attracting growing research interest [?,?]. The PerfCloud approach aims at building an IaaS (Infrastructure as a Service) Cloud environment upon a Grid infrastructure, leasing virtual computing resources that usually (but not necessarily) are organized in virtual clusters. Clever builds a Cloud provider out of independent hosts through a set of peer-to-peer based services. The choice of a decentralized P2P infrastructure allows the framework to provide fault-tolerance with respect to issues like host volatility. In case a new Cloud framework wants to join C@H, it ought to provide an OCCIcompliant interface or implement a driver for its own infrastructure.
5 Resource management From a high level point of view, a C@H provider is an intermediary (or broker) for the acquisition of resources from different Resource Owners. In this way, it delegates the low-level management of the infrastructure to interface with such Resource Owners. Two important tasks of a C@H provider are therefore the search for Resource Owners (providers) and the acquisition of resources that eventually will be delivered to the final C@H Users. This section describes the C@H components implementing such tasks: the Registry and the Resource & QoS Manager.
SLA-based Cloud Broker
11
5.1 Registry The Registry component collects information on resource providers and on the way their offered resources can be assessed. It provides a simple interface through which resource providers can subscribe to C@H (i.e., decide to share their resources to C@H). At subscription time, resource providers must supply a Resource Provider Descriptor (RPD) file. As briefly shown in Listing 1, the file contains the following sections: CloudEngine - identifies the Cloud solution adopted by the provider. This information is needed by the C@H to set up the correct drivers. As discussed in Section 4, drivers for PerfCloud and Clever have been fully developed. Other engines must to use the generic OCCI driver in order to interact with C@H. WindowShare - contains the schedule of the time windows during which the resource provider is willing to share resources to the C@H. This information is used by the C@H when new user requests arrive, to filter out providers that are not willing to share their resources at the time requested by the C@H User. Security - contains the subsection(s) dedicated to the kind of credentials accepted by the provider. In the example proposed, the provider adopts the PerfCloud engine, which makes use of GSI credentials from the Globus Security Infrastructure [?]. In this case, a subset of the information contained in the PKI certificate is reported. In order to gain access, the user needs the credentials for the target Virtual Organization. AccessPoint - contains the information needed to access the target provider (IP address and qualified names, in the example). Listing 1: Resource Provider Descriptor [..] PerfCloud WorkingDay 6:00 PM CEST 6:00 AM CEST HolyDay 12:00 AM CEST 12:00 PM CEST Cloud@Home O=Grid, OU=GlobusTest, OU=simpleCA-CloudAtHome, CN =Globus Simple CA O=Grid, OU=GlobusTest, OU=simpleCA-CloudAtHome, CN=Globus Simple CA 193.206.100.159 Antares.ing.unina2.it Antares
12
Antonio Cuomo et al. [..]
At discovery time, i.e. when resources must be enrolled to satisfy a C@H User’s request, the Registry can be queried to retrieve the list of providers that are eligible to serve the request according to their WindowShare and Security settings.
5.2 Resource & QoS Manager The Resource & QoS Manager (RQM) is a crucial component in the C@H architecture, as it is responsible of acquiring the virtual resources from the providers and ensuring that the negotiated QoS is being delivered. As shown in Fig. 3, the crosscutting tasks of the RQM require it to be able to interface with all other subsystems. To this end, the RQM has been designed as an asynchronous event-based system, denoted as RQMCore, which reacts to requests coming from other components.
Fig. 3: The Resource & Qos Manager
The RQM core tasks are Request Management and SLA Enforcement. Request Management consists in the translation of C@H User requests into actual resource selection and allocation. Resource requests that do not involve SLA negotiations are directly forwarded to the RQM by the Frontend. SLA-based resource requests are handed to the RQM by the SLA Manager (described in subsection 6.1): the associated policies and procedures that must be used in SLA Enforcement are defined through the SLA Manager and stored in the RQM Core databases. To fulfill these tasks, the RQM performs many activities. A workflow view of the activities and the interactions they entail is described in section 8.2.2, while an overview from the RQM perspective is provided here. Activities related to Request Management include: Provider Selection. The RQM can query the Registry to obtain the list of subscribed providers. The query can include filtering criteria on parameters specified in the Resource Provider Descriptor of Listing 1. Resource acquisition. Once a suitable provider has been found, the Provider Drivers are set up to carry out resource acquisition.
SLA-based Cloud Broker
13
Logging. The RQM logs all its operations and their status for further inspection and bookkeeping. More complex activities involve cooperation between multiple modules and are oriented to the management of SLA-based requests and SLA Enforcement: Availability guarantees. To provide availability guarantees, the RQM needs a forecast of the availability level of resources shared by a provider. The availability of a provider resource can be described with the well-known equation: MT BF . (1) MT BF + MT T R where MTBF is the mean time between failures and (MTTR) is the mean time to repair of a single resource provided by the corresponding Cloud provider. To estimate the MTBF of a provider resource, the RQM firstly retrieves historical heartbeat provider data available from the monitoring subsystem, which is based on the Mobile Agents based GriD Architecture (MAGDA) described in the next section 6.3. The historical data can be used to obtain a forecast of the provider MTBF by invoking the forecast service provided by the Cloud@Home Autonomic Service Engine (CHASE), a component described in section 6.2. With regards to the MTTR, it can be defined as: ProviderAvailability =
MT T R = T f d + Tboot
(2)
where – T f d is the time required to detect a resource failure. This is related to the rate with which the monitoring subsystem performs checks, which is bounded by the MAGDA timeout, a parameter that specifies how often MAGDA agents have to report on the resource status. – Tboot is the time required for the system to boot up another virtual machine in substitution of the failed one. Such time depends on both the computing power of the virtual machine and on the complexity of the VM image. Again, the RQM can obtain a forecast for this value by feeding the forecast service in CHASE with historical boot time data (logs of virtual machines boot times) obtained from MAGDA. Once the MTBF and the MTTR are evaluated, the RQM is able to compute the provider availability through Equation 1. Alert reaction. The RQM uses alerts generated by the monitoring subsystem to activate the SLA Enforcement process. The policies to be activated are expressed through simple triples [< parameter >, < condition >, < procedure >], which formalize the procedures that have to be triggered when a given parameter satisfies a certain condition. The policies offered to the C@H administrator are represented in terms of a simple template, which enables the administrator to configure them: the responsibility of the correctness of the policies is up to the administrator. The current implementation uses a format akin to JSON. Let us describe an example use of policies. Through the MAGDA monitoring component, the C@H system uses heartbeat messages to verify if a node is alive. It stores the heartbeat information using two variables: HBfail, representing the number of failures, and HBsuccess, representing the number of consecutive success heartbeats. We wish to specify a policy to use the heartbeat results to detect a machine crash and perform a restart. The policy states that if the number of failed heartbeats hits a certain threshold X, the machine must be restarted. However, random failed heartbeats may happen, for example for a very short network unavailability, that are not symptoms of a crashed machine. To avoid the accumulation of randomly failed heartbeats, we reset the HBfail counter if there are at least Y consecutive success heartbeats. The described policy can be specified as follows:
14
Antonio Cuomo et al.
{Policy: [[Heartbeat,HBfail>X,restart], [Heartbeat,HBsuccess>Y,resetHBfail], [Heartbeat,HBsuccess>Y,resetHBsuccess]] } where restart, resetHBFail and resetHBsuccess are identifiers of procedures that respectively restart the virtual machine (the first) and reset the heartbeats counters (the other two). The procedure definitions are collected in a database local to the RQM. When multiple policies are applicable to the current situation (i.e. all their conditions evaluate to true), all the policies are applied without any order. Again, the responsibility of verifying that this does not lead to the application of conflicting policies is up to the C@H administrator. Performance guarantees. When the SLA involves application-level performance parameters, the RQM can provide guarantees through predictions on the performance of a resource configuration, provided by the CHASE simulation-based performance prediction service. To enable the predictions, the RQM must provide CHASE with a description of the application (included in the user request), the user QoS requirements (part of the SLA) and benchmark data of the provider machines (obtained through the monitoring subsystem).
6 Service Level Agreement management One of the most critical Cloud-related issues is the management of the Service Level Agreement. This is particularly true for C@H, since it tries to mix the Cloud and the volunteer paradigms. The volunteer contribution in C@H dramatically complicates the SLA management task: the “volatility” of resources (resources providers can asynchronously join or leave the system without any message) has to be taken into account when enforcing QoS requirements. In C@H, the SLA Management module is in charge of the negotiation and the monitoring of SLAs, and collaborates with the RQM component for the enforcement of the QoS. The SLA Management module is composed of three components: the SLA Manager, the Cloud@Home Autonomic Service Engine (CHASE) [?] and the Mobile Agent Based Grid Architecture (MAGDA) [?]. The features of these components are briefly discussed in the following.
6.1 SLA Manager The SLA Manager is in charge of managing the SLA templates to be offered to C@H Users. A C@H User can refer to these templates to start the negotiation procedure with the SLA Manager, that will eventually produce an SLA. In this case, resources (which are the negotiation object of the SLA) are virtualized, and in general can be provided by several, different Cloud providers enforcing different management policies: SLAs are then crucial to guarantee the quality of virtualized resources and services in such heterogeneous environment. We recall that the resource context we are addressing is heterogeneous, from different points of view. As discussed in Section 4, C@H aggregates resource providers that are by their nature heterogeneous: on the one hand the commercial providers, seeking to maximize the profit, and on the other one the volunteer providers, that just share their underutilized resources. Secondly, the provided resources themselves are heterogeneous in terms of computational power they are able to supply. The SLA management must take into account this
SLA-based Cloud Broker
15
heterogeneity, and enforce the appropriate strategy according to the nature of the providers and the resources being involved. Stated that resource availability is the only QoS parameter that C@H is currently addressing, when commercial providers are involved in the provision of resources the C@H’s SLA strategy will guarantee no more than what the providers’ proprietary SLA are claiming to guarantee, and will apply the very same penalties for unattended QoS. When, instead, providers voluntarily share resources, the SLA produced for a specific provision aims at just forecasting the minimum service level (again, in terms of resource availability) that C@H will likely be able to sustain. The client of volunteered resources is aware that the resources are volatile, and that C@H strives to guarantee the provision; should the forecast service perform worse than the agreed minimum level, no penalty would be applied. For volunteer scenarios, we are planning to develop an incentive mechanism (supported by the SLA framework itself) that awards those providers that are able to guarantee at best the “promised” QoS. The basic principle is that the better the SLA is honoured, the more credits the provider gains. Credits can then be used by providers to “acquire” new resources within the federation. The SLA Manager adopts the WS-Agreement protocol [?] for the user-C@H interactions. WS-Agreement compliant templates are used by the C@H Users to specify the required quality level. An example of template filled with the C@H User’s required functional and non functional parameters is reported in the following: Listing 2: Resource and Availability request in WS-Agreement x86 4 [...] compute1 x86 [...] compute2 [...] [...] 97.0 8
16
Antonio Cuomo et al. [..]
In the ServiceDescriptionTerm section the features of the needed resource are expressed (in this case the C@H User is asking for a cluster of two nodes). It is important to remark that the resource features’ request complies with the OCCI specification format. At acquisition time, this information will be extracted from the SLA and used to make explicit request to OCCI-compliant providers. As for the Duration parameter in the GuaranteeTerm section, it will be used at discovery time to filter out providers that do not share resources at the time, and for the duration, requested by the C@H User. Finally, through the NodeAvailability parameter the C@H User specifies the required QoS (non-functional parameters), which in this specific case is targeted at 97.0%.
6.2 CHASE CHASE (Cloud@Home Autonomic Service Engine) is a framework that allows to add selfoptimization capabilities to grid and Cloud systems. It evolves from an existing framework for the autonomic performance management of service oriented architectures [?]. The engine allows to identify the best set of resources to be acquired and the best way to use them from the application point of view. CHASE is a modular framework that follows the autonomic computing paradigm, providing components to fully manage a grid/Cloud computing element. For the operation of CHASE as a stand-alone autonomic manager, the interested reader is invited to consult the work in [?]. Focus here is on the design of the services provided by the framework to support the operativeness of C@H, namely the forecast service and the performance prediction service. The forecast service provides a forecast of future values from historical data. It takes as input a time series of values and produces forecast based on autoregressive methods. The forecast service is used when the RQM needs to evaluate the provider availability, for which estimates for MTBF and MTTR are required. For the MTBF, historical data of heartbeat failure are used to produce the forecast. For the MTTR, historical boot times of the specific virtual machine image on a specific provider are used. The collection of historical data is made through the MAGDA platform, described in the next subsection. The performance prediction service is a simulation-based estimator of application performance parameters, like execution time and resource usage. In particular, the CHASE simulator, fed with a) an application description, b) information regarding the current state of the system in terms of resource availability and load, and c) the user’s requested QoS, builds a parameterized objective function to be optimized. The optimization engine drives the simulator to explore the space of possible configurations in order to find a configuration that meets the demands. The performance prediction service is used during the negotiation to evaluate the sustainability of performance guarantees. It can be also invoked when the monitored QoS agreed in the SLA is at risk of violation. New simulations are run with up-to-date settings in order to search for alternative scheduling decisions (like migrating or adding more VMs) that can solve the QoS problem.
SLA-based Cloud Broker
17
6.3 Mobile Agent based Application Monitoring The MAGDA (Mobile Agent Based Grid Architecture) component constantly carries out the monitoring of the QoS level provided by the leased resources. MAGDA [?] is a mobile agent platform implemented as extension of JADE, a FIPA standard [?] compliant agent platform developed by TILAB [?]. The MAGDA toolset allows to create an agent-enabled Cloud in which mobile agents are deployed on different virtual machines which are connected by a real or a virtual network. The details of how the MAGDA platform can interact with Cloud environments are discussed in past work [?]. The emphasis here will be on the description of the MAGDA-based monitoring service. This has been designed as a multiagent system that distributes tasks among specialized agents. It contemplates both static and mobile agents: the former are responsible for performing complex reasoning on the knowledge base, so they are statically executed where the data reside; the latter usually need to move to the target resources in order to perform local measurements or to get system information. The Archiver is a static agent that configures the monitoring infrastructure, collects and stores measurements, computes statistics. According to the parameters to be monitored, the kind of measurement and the provider technology, the archiver starts different Meters, which are implemented as mobile agents that the Archiver can dispatch where it needs. Observers periodically check a set of rules to detect critical situations. They query the Archiver to know about the statistics and eventually notify applications if some checks have failed. Applications can use a Agent-bus service to subscribe themselves for being alerted about each detected event. They can also invoke MAGDA services to start, stop or reconfigure the monitoring infrastructure. Finally, applications can access the complete knowledge base to retrieve information about Cloud configuration, monitoring configuration, statistics and the history of past failed checks. In the current prototype, the Meters are used to collect three kinds of metrics: – Image boot times: a Meter is configured to start-up as soon as the MAGDA platform is loaded. This provides an estimate of the time required to boot the virtual machine. – Heartbeats: heartbeats are sent by the Meters to verify the liveness of the resource on which they are residing. – Benchmark figures: Meters are able to execute different kind of benchmarks, which vary from simple local data sampling (actual CPU utilization, memory available, etc.) to distributed benchmarking (evaluating distributed data collection, or evaluating the global state with snapshot algorithms). The MAGDA component poses a number of issue in terms of deployment. The agent platform, the Archiver and the bus service can be deployed as a dedicated virtual machine which coordinates all the available agents. MAGDA Meters, instead, must be installed inside the virtual machines of the user. A dedicated Java-based agent execution environment is installed and configured as a start-up service in the user VM. From the portal hosted in the C@H Frontend, it is possible to manage (start, stop, migrate, etc.) the mobile agents in order to control monitoring or other functionalities on top of all the computational resources hosting a MAGDA container. In the current implementation, if the user does not accept the installation of the agent platform and the roaming of mobile agents in its virtual machine, the system cannot provide the required monitoring and the associated services like health check and performance prediction.
18
Antonio Cuomo et al.
7 Frontend The main role of the C@H Frontend within the C@H architecture is to provide an access point to the system functionalities. The Frontend provides the reference links to all the C@H components, thus operating as a glue entity for the C@H architecture. It serves the incoming requests by triggering appropriate processes on the specific C@H components in charge of serving such requests. In order to provide a user-friendly and comprehensive interface to the system, the Frontend was implemented as an extensible and customizable Web portal. Furthermore, in accordance with the “everything-as-a-service” philosophy, the developed Frontend component, as the other C@H components, can be deployed as a virtual machine image, enabling the setup of an infrastructure access point with a modest amount of work (as shown in Section 8). More specifically, the Frontend component has to manage both the C@H User and the C@H Admin incoming requests. For this reason it has been split into two parts as discussed in the following.
7.1 User Frontend The main goal of the C@H User Frontend is to provide access to tools and the services implemented by the system in favor of the end users. Such tools base on and involve lower level functionalities and operations that are hidden to the users, masquerading the C@H internal organization and resource provisioning model in a Cloud-fashion. Moreover, ubiquity and fault tolerance have to be guaranteed, following a service oriented provisioning model. For such a reason the C@H User Frontend is implemented as a Web service, providing suitable SOAP and/or REST interfaces. The User Frontend exposes the main services allowing end users to access C@H resources. In case the C@H provider implements guarantees on the resource provisioning, adequate services for negotiating the SLA and for monitoring the agreed QoS level on the resources are required. In such cases the Frontend pre-processes, splits and forwards the incoming requests to the corresponding SLA and QoS management components according to the specified requirements. More specifically the C@H User Frontend exposes the following functionalities: – resource negotiation - through which the C@H User can trigger the resource negotiation process with the SLA Management module (in particular, downloading SLA templates, issuing SLA proposals, proposing SLA counter offers); – resource management - allowing the C@H User to manage (activate/deactivate) the leased resources obtained by the C@H IaaS through the Resource Management module; – SLA monitoring - providing the C@H User tools for monitoring the status of the SLA regulating the resource provision through a specific interface to the SLA Manager; – settings and preference management - allowing the C@H User to customize the user interface with specific user settings and configurations.
7.2 Admin Frontend From the provider perspective one of the most interesting characteristic of the C@H system is to provide a service for building up a C@H provider and the corresponding Infrastructureas-a-Service by selecting the basic components and services it has to provide. In this way
SLA-based Cloud Broker
19
a C@H Admin can customize its service (i.e. the IaaS Cloud provisioning), can decide to either include SLA/QoS management or provide the infrastructure without any guarantee (best effort), and finally can specify where they have to be deployed. The Admin Frontend implements the services to set-up, customize and implement such choices, driving the Admin through all the required steps for a correct C@H service establishment. More specifically such services are: – system management - allows the administrator to set-up, configure and deploy a C@H infrastructure and to customize all the services it provides, and therefore also the Frontend interface; – negotiation management - through which negotiation policies can be defined, fine-tuned and deployed; – QoS management - allows the administrator to specify the QoS policies that must be put in force to sustain the SLAs. The basic components that needs be included in any C@H provider configuration are those of the Resource Abstraction module. A minimal C@H system configuration, building an IaaS with no QoS guarantees, must provide such tools and services, that just implement the basic system management mechanisms. In case the Admin needs to set-up a C@H infrastructure providing QoS on top of resources, it is necessary to enhance the minimal configuration with the Resource and SLA management modules, thus deploying the whole configuration shown in Fig. 2. Different configurations can be identified to best fit and meet the Admin requirements, not necessarily involving all the C@H components identified above. It is important to remark that the component selection is not directly performed by the C@H Admin that just customizes the C@H provisioning model they want to implement. The component selection is therefore automatically performed by the Frontend tools according to the requirements specified by the Admin. The deployment is instead mainly in charge of the Admin.
8 C@H in action In the previous sections the C@H system has been described from a static perspective. The stress has been on the C@H architectural modules and components, and on the functionality that each analyzed entity is able to offer. The current section, instead, intends to give the reader a dynamic view of the C@H system, focusing on the activities that are triggered within the system and on the interactions taking place among the components of the architecture.
8.1 Process View Some activities must be carried out in order to create from scratch a new C@H system and to get it ready to work, as shown in Fig. 4. In the first stage (image retrieval) the overall C@H system image, made up of specific software components, must be retrieved. Several versions of the C@H system are available, each providing a different flavor of services, ranging from the very basic to the full-fledged. As discussed in Section 7, the basic version is equipped with just the Resource Abstraction module. The C@H User is aware that they will not get any guarantee on the performance
20
Antonio Cuomo et al.
Fig. 4: C@H infrastructure set-up process
of the resources provided, and can not claim support in case of QoS degradation. Thus, the service will be provided on a best-effort basis. The full-fledged version of the C@H system, instead, is featured with the SLA/QoS management service. In the following we will discuss the activities to be carried on in the case that a full-fledged C@H system must be set-up. Once the image is retrieved, the system profiling activity is performed, during which the C@H Admin specifies the policies and the strategies that will have to be adopted for SLA negotiation and enforcement respectively. After that, the components can be deployed (system deployment), configured in order to properly work with each other (system configuration), and run (system boot-up). The C@H system is then up and ready to accept C@H User requests. The overall request management process (shown in Fig. 5) involves many activities, some of which are optional, in the sense that they can be either triggered or not, depending on the C@H User’s specific request. Such process starts with the SLA negotiation of the resource functional parameters and, optionally, of the QoS level (non-functional parameters) that the C@H system must support at provision time. In the latter case, upon a successful termination of the negotiation activity, the C@H User will receive a formal guarantee (in the form of an SLA) that the requested resources will be provisioned and, if required, the QoS levels will be sustained. Should the negotiation fail, the request would be simply discarded and the C@H User would have to issue a new request. Otherwise, upon a successful negotiation, the requested resources are activated and assigned to the C@H User (resource delivery). From this point onwards, the C@H User can use the resources. If the SLA also requested QoS, a monitoring infrastructure is set-up to take under constant control the performance of the delivered resources (resource monitoring): in such activity, the non-functional parameters contributing to the overall QoS (namely, the availability) are monitored to ensure that none of the SLA terms is being violated. Whenever any QoS parameter is about to be violated, a recovery action is triggered (QoS recovery): in this step countermeasures are taken in order to bring the QoS back to “safe” levels. Upon a successful recovery, the originally requested QoS is restored and the monitoring process, which had been temporarily put on stand-by, is resumed in order to detect new possible faults. Should instead the recovery fail, the resource provision would be stopped and the SLA would be terminated. Finally, unless the C@H User or the C@H system decides (for any reason) to prematurely force the termination, the resource provision will end up at the termination time specified in the SLA, and the resources will be released (termination).
SLA-based Cloud Broker
21
Fig. 5: Request management process
8.2 Interaction view This subsection describes in detail through a practical approach how to set up a C@H provider and how this latter negotiates and enforces QoS on top of resources voluntarily shared by their Cloud providers. In subsection 8.2.1, we first show how a C@H Admin can build and set up the C@H components that implement the C@H infrastructure according to the IaaS service they want to provide (with or without fault tolerance, QoS-SLA management, etc.). The QoS negotiation and enforcement dynamics are described in subsection 8.2.2 and subsection 8.2.3, respectively. As regards the described scenarios, the target resources taken into account are academic clusters hosting a Cloud middleware (e.g., Eucalyptus [?], Nimbus [?], PerfCloud [?], Clever [?], OpenStack [?]) able to provide Virtual Clusters (VCs). We also assume that such clusters have a reliable frontend during the provider availability time windows, whereas computing nodes are unreliable since, for example, they might crash due to power outages or they might be periodically unavailable for a while to perform reconfiguration or maintenance. The scenario we consider implements a C@H infrastructure able to satisfy C@H User requests of VCs including specific QoS requirements. In particular, the QoS is expressed in terms of availability, i.e., the fraction of time (percentage) the node must be up and reachable with respect to a 24 hours base time, as specified in Section 5.2. In order to do that, we assume all the nodes belong to the same Cloud provider. Although C@H is able to pick resources from different providers, in this specific case all resources are acquired by the same provider. The VC images host the MAGDA agent-based monitoring service. 8.2.1 System Setup The C@H Admin is the actor in charge of the deployment and the setup of the C@H infrastructure. As already pointed out in Section 3, C@H components are provided with serviceoriented interfaces. Such a choice enables flexible deployment schemes. Inspired by the “as-a-service” paradigm, individual (or group of) components can be packaged together and run within virtual machines offered by any (even commercial) Cloud provider. The C@H Admin interacts with the C@H frontend in order to specify the services they want to enable in the C@H provider and to configure them with QoS/SLA policies and strategies. This operation is performed by just selecting the required functionalities and
22
Antonio Cuomo et al.
: C@H_Frontend : C@H Admin
: Cloud Provider
SLA CH : CP VM
Manage CH : CP VM
The Cloud Provider repository stores the pre-packaged C@H components' images
1 : buildup_C@H_PROV() 2 : activateImages()
3 : activate() 4
5 : activate() 6
7 : images_URL 8 : configure_SLA_CHASE_MAGDA_SENSOR
9 : configure_FRONTEND_RESOURCEDISCOVERY_RQM 10 : images_URL
The system is now ready to accept requests from C@H Users
Fig. 6: Cloud@Home system setup
specifying the policies, the strategies and the provider on which such services will be deployed. Therefore the Admin Frontend automatically selects the components of the C@H IaaS the Admin wants to provide. In order to implement the considered scenario, he must set up a C@H infrastructure able to provide SLA/QoS services. Thus, it is necessary to build up a C@H IaaS with all the modules and components shown in Fig. 2, and specifically with the SLA management module. The C@H project repository stores ready-to-run VMs (in the formats suitable for the different Cloud providers) pre-packaged with the C@H components for QoS management. Before deploying the system, such images must be retrieved from the repository. In the specific scenario taken into account, according to the C@H Admin requirements, the Frontend needs to get the following images: – SLA CH: packaged with the SLA Manager, CHASE and MAGDA components. – Manage CH packaged with the Frontend, Registry and RQM components. – VCFE and VCnode, i.e., the images of the virtual cluster frontend (VCFE) and virtual node (VCnode), respectively. The sequence diagram in Fig. 6 shows the setup workflow. Only one instance for each of the first two images (SLA CH and Manage CH) has to be deployed. The C@H Admin initially interacts with the Admin Frontend sending its requirements, policies, parameters and configurations and also the Cloud provider for deploying the C@H IaaS service (step 1). Thus, the Frontend invokes the activateImages primitive on the target Cloud provider (step 2) that retrieves and launches the SLA CH image (step 3-4). Afterwards, the Cloud provider retrieves and runs the Manage CH (step 5), obtaining the corresponding URLs (step 6) that forwards to the Frontend along with with the SLA CH VM one (step 7). In steps 8 and 9 the C@H software components running on the images just bootstrapped are configured. Once the Frontend terminates such configuration forwards the image URLs to the Admin that can directly interact with them (step 10). The C@H system is now up and ready to accept requests from C@H Users.
SLA-based Cloud Broker : C@H_Frontend
23
: C@H_SlaManager
: C@H_RQM
: C@H_Registry
: CHASE
: MAGDA
: Cloud Provider
: C@H User 1 : getTemplates()
4 : templates 5 : negotiateSLA()
2 : getTemplates()
3 : templates 6 : negotiateSLA()
7 : evaluateSLA()
: Cluster
8 : getAvailableProviders()
9 : AvailableProviders loop
AvailableProviders 10 : getHistoricalData() 11 : Data 12 : evaluateSustainability()
13 : SustainabilityResult 15 : new()
14 [SustainabilityOK] : acquireResource()
17 : resource_URL 16 : resource_URL : C@H_MAGDA_Agent
18 : [ResourcesAquired]: monitorResources() 19 : resourceURL
20 *[Node] : new()
21 22 : upload_MAGDA_Agents 23 : Ok 24 : resources_aquired
25 : createSLA()
26 : SLA_EPR 27 : SLA_EPR
Fig. 7: Resource negotiation and acquisition in C@H
8.2.2 Negotiation In Fig. 7, the (one-shot) negotiation and cluster activation process is shown, assuming that the VC images are already available on the target Cloud provider. The C@H User initially requests the C@H SLA templates (steps 1–4). WS-Agreement compliant templates are used to guide the C@H User in the specification of the required quality level as reported in Listing 2. The one-shot negotiation is performed in steps 5–27. The RQM, which coordinates the negotiation process in collaboration with the Registry and SLA Manager services, has to assess the sustainability of the SLA proposal. First, the Registry is queried to retrieve the candidate providers that can supply resources for the specific request (steps 8–9). For each candidate provider, availability and performance are estimated, as discussed in Section 5.2. In the selection process, candidates are iteratively checked one by one: the first candidate provider matching the required QoS is chosen to serve the request. With regards to the node availability, the RQM retrieves monitoring data from the MAGDA platform (steps 10–11), which are used to obtain a prediction of the MTBF and the MTTR from CHASE (step 12). The same interaction is used for performance guarantees: benchmark values are retrieved from MAGDA and sent to CHASE
24
Antonio Cuomo et al.
to obtain performance predictions. If CHASE finds a suitable configuration of resources, it sends back an acknowledgement to the RQM (step 13). In case a candidate matches the QoS requirements (SustainabilityOK), the RQM proceeds with the acquisition of the resources (steps 14–17). Should the selected provider have run out of resources, the next provider in the candidate list would then be checked against the QoS requirements and, if suitable, would be asked for supplying the resources. The process iterates until a candidate provider passes the sustainability test and is able to provision the requested resources. In that case (ResourcesAcquired), the resources are acquired and activated (step 18) and the URL is provided back to the C@H User through the Frontend (step 19). An agent-based monitoring infrastructure, instructed with the QoS target to be accomplished, is then set up (steps 20–23), and the outcome of whole process is notified to the SLA Manager (step 24). Finally, the SLA is created (step 25) and its endpoint-reference (EPR) is notified to the C@H User (steps 26-27). If no suitable provider is found, the C@H User will have to to fill out a new template to renegotiate the service level, and the overall process will restart from step 5. 8.2.3 Fault recovery and SLA enforcement Assuming the virtual cluster of nodes is up and that the MAGDA agents are running on the virtual nodes, reporting the observed status to the C@H MAGDA component, the use case here considered, described in Fig. 8, concerns a scenario in which a virtual node of the VC becomes unreachable. When a VM fails, the fault is detected by the C@H MAGDA component since the MAGDA agent on the faulty VM stops sending status reports. The failure could be due to temporary node unavailability, for example because of network problems, or to node reconfigurations or reboots. Otherwise, it could be a permanent fault due to hardware or software outages. Therefore, a recovery procedure is started by the MAGDA agent on cluster frontend in order to ascertain what is actually going on. If the MAGDA agent does not reply within a given time (the timeout shown in step 1, tailored on the availability requested in the SLA), the C@H MAGDA sends a critical recovery message to the RQM (step 2). According to the fault alert and to the QoS requirements established by the SLA policy to be enforced, the RQM decides if and when to begin the fault recovery process (step 3). The internal recovery procedure activates a new VM on the Cloud provider and integrates it into the existing cluster (steps 4–9). Finally, a new MAGDA agent is sent and launched into the new node (steps 10–14), and the successful QoS adjustment is notified to the C@H Frontend (step 15). The above described process is able to guarantee the required availability if the SLA is correctly negotiated and the system is correctly monitored. In such process we evaluate the steady state availability, defined on a long interval, ideally infinite. Thus we do not need fast recovery procedures but instead we have to negotiate on values that likely can be granted in the future. The monitoring system stores the history of the infrastructure and from the statistics thus collected it evaluates the detection and reaction times. In this way the system is able (thanks to MAGDA monitoring and CHASE predictions) to predict how many failures will occur. Furthermore the system is tuned to deal with the worst cases in order to respect the SLA, thus providing a tolerance above the required availability threshold. However, the SLA enforcement is triggered whenever a failure event is detected and not only when the SLA level (e.g. the system availability value) is below the threshold. It is up to the C@H administrator to correctly identify the SLA to grant and to define the penalties depending on the chosen business model.
SLA-based Cloud Broker
25
: C@H_Frontend
: RQM
: MAGDA
m1 : MAGDA_Agent
: Cloud Provider
: Node
: Cluster
: C@H User For some reason the MAGDA Agent on one of the node is not reachable (i.e., the connection has broken). A connection recovery is attempted. After a time-out, a fault is signalled. 1 : timeOut() 2 : node fault The strategy to face the QoS degradation is put in force 3 : faultRecovery() 4 : createResource()
m2 : MAGDA_Agent
5 : new()
6 7 : adNodeToCluster() 8
10 : monitorResource() 9 11 : new() 12 15 : notify
13 : upload_MAGDA_Agent
14
Fig. 8: Enforcing the QoS in C@H
9 Experimental results Preliminary experiments have been conducted on the C@H implementation in order to show the feasibility of the proposed approach and to demonstrate its effectiveness. A distributed C@H testbed composed of clusters located at the different universities involved in the project (University of Sannio - Unisannio, Second University of Naples - Unina2, University of Messina - Unime, and University of Catania - Unict) has been set up. At every site, a PerfCloud installation was performed, to expose a Cloud interface that allows the instantiation of virtual machines on the resources of the local cluster, actually implementing a Cloud provider. The MAGDA monitoring system has been launched on each cluster in order to regularly monitor the physical nodes. As described in Section 6.3, it sends heartbeat messages to each node of the clusters in order to check their resource availability. Assuming a user of the C@H system thus established wants to instantiate a virtual cluster. They therefore send a new resource allocation request that is notified to the C@H provider. Table 1 synthesizes such user request expressed in terms of a Service Level Agreement by instantiating the template of Listing 2 (see Section 6). The virtual machine image requested by the user is a HPC compute appliance whose features are reported in Table 2.
26
Antonio Cuomo et al. Number of virtual machines Cores per VM Architecture VM Availability Duration Request timestamp
5 4 x86 99.5% 10h 19.00.00 CET
Table 1: Sample Resource Request
Name OS Services Disk size
HPCNode Linux CentOS 5.1 OpenMPI, OpenMP, Hadoop 40 GB
Table 2: Sample VM configuration
When the C@H RQM is handed the request from the SLA Manager, it queries the Registry for a list of providers available in the time span of the request. The Registry, whose status is reported in Table 3, returns a list that comprises all the providers except the Unime cluster, since its sharing window is not compatible with the request duration. The RQM then loops on every listed provider to select one with the required availability. As discussed in Section 5.2, this process starts by querying MAGDA for three set of provider-specific values: – Timer is the interval of time between two MAGDA heartbeats (suitably defined by the provider and stored in the agent platform); – MT BFdata is a time series of heartbeats results for the provider; – Tboot data is a time series of boot times for the virtual machine of the request on the provider infrastructure. These low-level data can be used to calculate all the required components of the provider availability as described by Equation (1) of Section 5.2. The first component is T f d , the time required for fault detection, which is related to the Timer value defined by the provider. In the experiment, the Cloud@Home system considers a virtual machine faulty when 3 consecutive heartbeats have failed. Therefore, the value 3 ∗ Timer can be used as a worstcase estimate of T f d . The RQM sends the MT BFdata time series to the CHASE engine to obtain MT BFf c , a forecast of the MTBF for the provider. In the same way, CHASE provides a boot time forecast Tboot f c from Tboot data . With this values, the RQM is able to calculate MT T R f c as T f d + Tboot f c , and finally the ProviderAvailability. Table 4 reports all such parameters as collected in the experiments. In practice, as discussed in Section 8.2.2, the RQM does not need to calculate all the provider availabilities but can halt as soon as it finds a suitable provider, which for this request is the Unisannio cluster. Table 3: Cloud@Home: registered providers Cluster Unisannio Unina2 Unime Unict
WindowShare 06 PM CET - 06 AM CET 06 PM CET - 08 AM CET 05 PM CET - 10 PM CET 07 PM CET - 07 AM CET
10 Conclusions and Future Work This paper presented Cloud@Home, a novel architecture capable of building a Cloud infrastructure from volunteer resources, i.e., resources that are voluntarily shared by their owners.
SLA-based Cloud Broker
27
Table 4: Cloud@Home: availability assessment Cluster Unisannio Unina2 Unict
MT BFf c 50 hours 20 hours 20 hours
Timer 60 120 300
T f d (sec) 180 360 900
Tboot f c (sec) 188 175 150
MT T R f c 368 535 1050
ProviderAvailability 99.80 % 99.26 % 98.57 %
One of the main problems to be faced in volunteer environments is to provide reliable services to consumers, as the availability of the resources can not be guaranteed (resources unpredictably join and leave) and resources show a high degree of heterogeneity. We proposed the design and the implementation of a system that aggregates volunteer resources, and supplies on top of them an extra layer that takes care of QoS requirements. The current implementation of the system supports scenarios of private Cloud providers, i.e., Cloud providers that decide to federate and voluntarily share their resources. The system is provided with services to negotiate the desired level of QoS, to monitor the actual service provision level and to enforce actions to guarantee the contracted QoS level against unsatisfactory behaviour. The whole Cloud@Home architecture has been analyzed and discussed throughout the paper, from both the structural and the dynamic points of view. Details on experiment results conducted on a distributed test-bed have also been presented. A use case has been fully implemented, where a Cloud@Home provider has been deployed on top of a single volunteer resource provider (e.g., a university data center) willing to offer its computational resources in the form of cluster services. The Cloud@Home provider has proved to be able to add the management of a basic service level agreement (specifically, availability of cluster nodes) on the unmanaged infrastructure. In our future work, we will investigate on how to provide guarantees on a broader set of quality parameters (other than resource availability). Furthermore, we are planning to give evidence of the effectiveness and the suitability of the proposed approach also in scenarios where resources are gathered from many volunteer providers. As regards volunteer scenarios, we are working on the integration of a credit-based mechanism in the Service Level Agreement framework with the aim of stimulating providers to guarantee at best the “promised” QoS.
References 1. Alba Amato, Loredana Liccardo, Massimiliano Rak, and Salvatore Venticinque. Sla negotiation and brokering for sky computing. In Frank Leymann, Ivan Ivanov, Marten van Sinderen, and Tony Shan, editors, CLOSER, pages 611–620. SciTePress, 2012. 2. A. Andrieux, K. Czajkowski, A. Dan, K. Keahey, H. Ludwig, J. Pruyne, J. Rofrano, S. Tuecke, and M. Xu. Web services agreement specification (ws-agreement). In Global Grid Forum, pages 1–47. The Global Grid Forum (GGF), 2004. 3. A. Andrzejak, D. Kondo, and D.P. Anderson. Exploiting non-dedicated resources for cloud computing. In Network Operations and Management Symposium (NOMS), 2010 IEEE, pages 341–348, april 2010. 4. Rajni Aron and Inderveer Chana. Formal qos policy based grid resource provisioning framework. Journal of Grid Computing, 10:249–264, 2012. 10.1007/s10723-012-9202-y. 5. R. Aversa, M. Avvenuti, A. Cuomo, B. Di Martino, G. Di Modica, S. Distefano, A. Puliafito, M. Rak, O. Tomarchio, A. Vecchio, S. Venticinque, and U. Villano. The Cloud@Home project: Towards a new enhanced computing paradigm. In Workshop on Cloud Computing: Project and Initiatives (in conjunction with EUROPAR), volume 6586 LNCS of Lecture Notes in Computer Science, pages 555–562. Springer Verlag, 2011. 6. R. Aversa, B. Di Martino, N. Mazzocca, and S. Venticinque. Magda: A mobile agent based grid architecture. Journal of Grid Computing, 4(4):395–412, 2006.
28
Antonio Cuomo et al.
7. R. Aversa, B. Di Martino, M. Rak, and S. Venticinque. Cloud agency: A mobile agent based cloud system. pages 132–137, 2010. cited By (since 1996) 8. 8. P. Balakrishnan and T.S. Somasundaram. Sla enabled care resource broker. Future Generation Computer Systems, 27(3):265–279, 2011. cited By (since 1996) 4. 9. Fabio Bellifemine, Agostino Poggi, and Giovanni Rimassa. Jade: a fipa2000 compliant agent development environment. In Agents, pages 216–217, 2001. 10. R. Buyya, S.K. Garg, and R.N. Calheiros. Sla-oriented resource provisioning for cloud computing: Challenges, architecture, and solutions. pages 1–10, 2011. cited By (since 1996) 0. 11. R. Buyya, R. Ranjan, and R. Calheiros. Intercloud: Utility-oriented federation of cloud computing environments for scaling of application services. Algorithms and Architectures for Parallel Processing, pages 13–31, 2010. 12. Valentina Casola, Antonio Cuomo, Massimiliano Rak, and Umberto Villano. The cloudgrid approach: Security analysis and performance evaluation. Future Generation Computer Systems, 2011. . In press. 13. Valentina Casola, Emilio Pasquale Mancini, Nicola Mazzocca, Massimiliano Rak, and Umberto Villano. Building autonomic and secure service oriented architectures with MAWeS. In Autonomic and Trusted Computing, Lecture Notes in Computer Science, volume 4610, pages 82–93, Berlin (DE), 2007. Springer-Verlag. 14. A. Chandra and J. Weissman. Nebulas: using distributed voluntary resources to build clouds. In Proceedings of the 2009 conference on Hot topics in cloud computing, pages 2–2. USENIX Association, 2009. 15. CloudHome project, 2010. https://cloudathome.unime.it/. 16. V.D. Cunsolo, S. Distefano, A. Puliafito, and M. Scarpa. Volunteer computing and desktop cloud: The cloud@home paradigm. In Network Computing and Applications, 2009. NCA 2009. Eighth IEEE International Symposium on, pages 134 –139, july 2009. 17. A. Cuomo, G. Di Modica, S. Distefano, M. Rak, and A. Vecchio. The Cloud@Home architecture: Building a cloud infrastructure from volunteered resources. In Proceedings of the 1st International Conference on Cloud Computing and Services Science (CLOSER 2011), pages 424–430, 7-9th May 2011. 18. A. Cuomo, M. Rak, S. Venticinque, and U. Villano. Enhancing an autonomic cloud architecture with mobile agents. In Euro-Par 2011: Parallel Processing Workshops, pages 94–103. Springer, 2012. 19. G. Di Modica, O. Tomarchio, and L. Vita. Dynamic SLAs management in service oriented environments. Journal of Systems and Software, 82(5):759–771, 2009. 20. Distributed Systems Architecture Research Group. OpenNebula Project. Universidad Complutense de Madrid, 2011. http://www.opennebula.org/. 21. Ana Juan Ferrer, Francisco Hern´andez, Johan Tordsson, Erik Elmroth, Ahmed Ali-Eldin, Csilla Zsigri, Ra¨ul Sirvent, Jordi Guitart, Rosa M. Badia, Karim Djemame, Wolfgang Ziegler, Theo Dimitrakos, Srijith K. Nair, George Kousiouris, Kleopatra Konstanteli, Theodora Varvarigou, Benoit Hudzia, Alexander Kipp, Stefan Wesner, Marcelo Corrales, Nikolaus Forg´o, Tabassum Sharif, and Craig Sheridan. Optimis: A holistic approach to cloud service provisioning. Future Generation Computer Systems, 28(1):66–77, 2012. 22. K. Graffi, D. Stingl, C. Gross, Hoang Nguyen, A. Kovacevic, and R. Steinmetz. Towards a p2p cloud: Reliable resource reservations in unreliable p2p systems. In Parallel and Distributed Systems (ICPADS), 2010 IEEE 16th International Conference on, pages 27–34, dec. 2010. 23. A. Grama, V. Kumar, and A. Sameh. Foundation for intelligent physical agents., 2000. 24. A. Kertesz, G. Kecskemeti, and I. Brandic. An SLA-based resource virtualization approach for ondemand service provision. In Proceedings of the 3rd international workshop on Virtualization technologies in distributed computing, pages 27–34. ACM, 2009. 25. G. Kirby, A. Dearle, A. Macdonald, and A. Fernandes. An Approach to Ad hoc Cloud Computing. Technical report, University of St Andrews, February 2010. 26. Charles Loboz. Cloud resource usageheavy tailed distributions invalidating traditional capacity planning models. Journal of Grid Computing, 10:85–108, 2012. 10.1007/s10723-012-9211-x. 27. Michael A. Murphy and Sebastien Goasguen. Virtual organization clusters: Self-provisioned clouds on the grid. Future Generation Computer Systems, 26(8):1271–1281, 2010. 28. D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman, L. Youseff, and D. Zagorodnov. The eucalyptus open-source cloud-computing system. In Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, pages 124–131. IEEE Computer Society, 2009. 29. University of Chicago. Nimbus project, 2011. http://workspace.globus.org/clouds/ nimbus.html. 30. Open Cloud Computing Interface. http://occi-wg.org/. 31. Dana Petcu, Ciprian Craciun, and Massimiliano Rak. Towards a cross platform cloud api - components for cloud federation. In Frank Leymann, Ivan Ivanov, Marten van Sinderen, and Boris Shishkov, editors, CLOSER, pages 166–169. SciTePress, 2011.
SLA-based Cloud Broker
29
32. Peter Mell and Tim Grance. The NIST Definition of Cloud Computing. NIST Special Publication 800-145, January 2011. 33. Rajkumar Rajavel and T. Mala. Achieving Service Level Agreement in Cloud Environment Using Job Prioritization in Hierarchical Scheduling, volume 132 of Advances in Intelligent and Soft Computing, pages 547–554. Springer Berlin / Heidelberg, 2012. 10.1007/978-3-642-27443-5 63. 34. M. Rak, A. Cuomo, and U. Villano. Chase: an autonomic service engine for cloud environments. In 2011 20th IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprises, pages 116–121. IEEE, 2011. 35. Massimiliano Rak, Rocco Aversa, Salvatore Venticinque, and Beniamino Di Martino. User centric service level management in mosaic applications. In Michael Alexander, Pasqua D’Ambra, Adam Belloum, George Bosilca, Mario Cannataro, Marco Danelutto, Beniamino Di Martino, Michael Gerndt, Emmanuel Jeannot, Raymond Namyst, Jean Roman, Stephen L. Scott, Jesper Larsson Tr¨aff, Geoffroy Vall´ee, and Josef Weidendorfer, editors, Euro-Par Workshops (2), volume 7156 of Lecture Notes in Computer Science, pages 106–115. Springer, 2011. 36. Reservoir Consortium. Reservoir Project [URL], 2011. http://62.149.240.97/. 37. Bhaskar Rimal, Admela Jukan, Dimitrios Katsaros, and Yves Goeleven. Architectural requirements for cloud computing systems: An enterprise cloud approach. Journal of Grid Computing, 9:3–26, 2011. 10.1007/s10723-010-9171-y. 38. B.M. Segal, P. Buncic, D.G. Quintas, D.L. Gonzalez, A. Harutyunyan, J. Rantala, and D. Weir. Building a volunteer cloud. In Conferencia Latinoamericana de Computaci´on de Alto Rendimiento, September 2009. 39. SLA@SOI project: IST- 216556; empowering the service economy with sla-aware infrastructures, 2011. http://www.sla-at-soi.eu/. R Toolkit 4, pages 40. Borja Sotomayor and Lisa Childers. Gsi: Grid security infrastructure. In Globus⃝ 271–282. Morgan Kaufmann, San Francisco, 2006. 41. The Openstack Community. OpenStack Cloud Software: open source software for building private and public clouds, November 2011. http://www.openstack.org/. 42. F. Tusa, M. Paone, M. Villari, and A. Puliafito. Clever: A cloud-enabled virtual environment. In Computers and Communications (ISCC), 2010 IEEE Symposium on, pages 477–482, june 2010. 43. Constantino V´azquez, Eduardo Huedo, Rub´en S. Montero, and Ignacio M. Llorente. On the use of clouds for grid resource provisioning. Future Generation Computer Systems, 27(5):600–605, 2011. 44. L. Wu and R. Buyya. Service Level Agreement (SLA) in Utility Computing Systems. In V. Cardellini et. al. et. al., editor, Performance and Dependability in Service Computing: Concepts, Techniques and Research Directions. IGI Global, USA, 2011.