Eucalyptus-based private clouds: availability modeling and

2 downloads 0 Views 954KB Size Report
models for private cloud architectures based on Eucalyptus platform, and presents a comparison of .... clouds [14] and is interface-compatible with the commercial service Amazon EC2 [2]. This API ..... Amazon Inc. http://aws.amazon.com/ec2/pricing/. 4. ... In: 20121 IEEE/IFIP 42nd international conference on dependable.
Computing DOI 10.1007/s00607-015-0447-8

Eucalyptus-based private clouds: availability modeling and comparison to the cost of a public cloud Jamilson Dantas · Rubens Matos · Jean Araujo · Paulo Maciel

Received: 21 November 2014 / Accepted: 17 February 2015 © Springer-Verlag Wien 2015

Abstract High availability in cloud computing services is essential for maintaining customer confidence and avoiding revenue losses due to SLA violation penalties. Since the software and hardware components of cloud infrastructures may have limited reliability, the use of redundant components and multiple clusters may be required to achieve the expected level of dependability while also increasing the computational capacity. A drawback of such improvements is the respective impact on the capital and increase in acquisition and operational costs. This paper presents availability models for private cloud architectures based on Eucalyptus platform, and presents a comparison of costs between these architectures and similar infrastructure rented from a public cloud provider. Metrics for capacity-oriented availability and system steady-state availability are used to compare architectures with distinct numbers of clusters. A heterogeneous hierarchical modeling approach is employed to represent the systems considering both hardware and software failures. The results highlight that improvements on the availability are not significant when increasing the system to more than two clusters. The analysis also shows that the average available capacity is close to the maximum possible capacity in all architectures, and that it takes 18 months, in average, for these private cloud architectures to pay off the cost equivalent to the computational capacity rented from a public cloud.

J. Dantas (B) · R. Matos · J. Araujo · P. Maciel Informatics Center, Federal University of Pernambuco, Recife, Brazil e-mail: [email protected]; [email protected] R. Matos e-mail: [email protected] J. Araujo e-mail: [email protected] P. Maciel e-mail: [email protected]

123

J. Dantas et al.

Keywords Cloud computing · Availability · Capacity oriented availability · Analytical models Mathematics Subject Classification

60J20 · 68M15 · 68M01

1 Introduction A large proportion of worldwide IT companies have adopted cloud computing for various purposes. Software platforms such as Eucalyptus [15] provide ways to construct private and hybrid clouds in the infrastructure as a service (IaaS) style [4] on top of common on-premise hardware equipment. Cloud services need to maintain users’ confidence and avoid revenue losses but, as with any system, availability in the cloud may be affected by events such as hardware failures, planned maintenance, exchange of the equipment, software bugs or updates. Fault tolerance mechanisms, such as replication techniques, are prominent approaches to coping with software and hardware limited reliability. Since cloud-based applications aim to be accessible anywhere and anytime, dependability becomes even more important and yet more difficult to achieve in such environments [32]. Companies that provide this type of service seek to meet customers demands for the lowest cost. Hierarchical hybrid models are used in [33] to evaluate cloud-based data centers, combining reliability block diagrams (RBDs) and generalized stochastic petri nets (GSPNs) to analyze availability and reliability measures, for distinct workloads and server consolidation ratios. In [10], the authors propose a hierarchical model to analyze the availability of a Eucalyptus-based architecture tailored for e-government purposes. In [11] a hierarchical heterogeneous model—based on RBDs and a markov reward model (MRM)—was used to describe non-redundant and redundant Eucalyptus architectures composed of a single cluster. This paper develops hierarchical heterogeneous availability models for several Eucalyptus architectures, using metrics of capacity-oriented availability and system steady-state availability to compare those distinct architectures. Both hardware and software failures are considered in the proposed analytical models. The models are also used to obtain closed-form equations, enabling efficient computation of the aimed metrics. Moreover, the cost of a private cloud is compared to the cost of renting a similar processing power on Amazon. Thus, to obtain the cost of private cloud is considered the cost of energy consumed over a given time and the acquisition cost of computers. The remainder of the paper is organized as follows. Section 2 describes some related works. Section 3 describes main concepts regarding dependability, high availability techniques, and Eucalyptus cloud computing infrastructure. Section 4 describes the private cloud architectures which are the focus of this study, and the analytical models proposed to represent these infrastructures. Section 5 presents the evaluation of the proposed models and highlights the obtained results. Section 6 draws some conclusions and points to possible future works.

123

Eucalyptus-based private clouds

2 Related works Most studies aimed at assessing the dependability of cloud systems do not address software dependability nor consider the influence of adding new pieces of equipment to provide redundancy for existing architectures. In [9,23], the authors present studies on dependability prediction for open source clusters of servers. They aim to enhance the high availability (HA) feature to open source cluster application resources (OSCAR). The authors predict system reliability and availability through Stochastic Reward Nets (SRN). Hong et al. [18] adopted continuous time Markov chains (CTMC) to analyze the availability of a cluster system with multiple nodes. Their analysis considered both common mode failure (CMF) and no CMF for the cluster nodes. CTMCs are also used in [24] for modeling HA clusters. Callou et al. [7] proposed a set of models for the integrated quantification of sustainability impact, cost and dependability of data center cooling infrastructures and presents a case study which analyzes the environmental impact, dependability metrics as well as the acquisition and operational costs of five real-world data center cooling architectures. Wei et al. [33] adopted a hierarchical method and proposed hybrid models combining RBDs and GSPNs to analyze the relation between reliability and servers consolidation ratio, as well as the relation between availability and the workload experienced by the cloud-based datacenter. In [10], the authors propose a private cloud environment suitable for e-government purposes and provided a hierarchical model to predict the availability of the proposed Eucalyptus-based architecture. The hierarchical model proposed in [10] uses distinct Markov chains for cluster level and node level. In [11] we adopted a hierarchical heterogeneous model—based on RBDs and an MRM—to describe non-redundant and redundant Eucalyptus architectures. We also provided closed-form equations to compute the availability of those systems. This paper extends the previous study by investigating the benefits of using multiple clusters in private cloud architectures. This paper compares the architectures according to the capacity-oriented availability. Other original result is the definition of closed-form expressions for computing the steady state availability of architectures that were not presented in [11]. The closed-form expressions are obtained from the hierarchical analytical models, and enable the fast evaluation of various scenarios. 3 Background This section explains the main concepts needed to understand the architecures analyzed here as well as the measures used for evaluation. 3.1 Dependability measures and redundancy in high availability clusters Systems dependability can be understood as the ability to deliver a specified functionality that can be justifiably trusted [5]. An alternate definition of dependability is “the ability of a system to avoid failures that are more frequent or more severe, and outage durations that are longer than is acceptable to the user” [5]. Dependability encompasses measures such as reliability, availability, and safety. Due to the ubiquitous provision of services on the Internet and on cloud systems, dependability has become an attribute

123

J. Dantas et al.

of prime concern in hardware/software development, deployment, and operation [26], since such services require high availability, stability, fault tolerance and dynamical extensibility. Systems with stringent dependability requirements demand methods for detecting, correcting, avoiding and tolerating faults and failures. A failure in a large scale system can mean catastrophic losses. In the context of fault-tolerant systems, the availability is a measure of high interest. The availability metric can be understood as the probability that the system is found operational during a given period of time or has been restored after the occurrence of a failure event. Availability can be expressed as a percentage (e.g., 99.9876 %), or as real number between 0 and 1 (e.g., 0.999876). Steady-state availability is the metric used when considering a long-term probability, instead of a defined time interval. It may be considered as the “average” availability of the system, regardless of the time interval. The steay-state availability may be computed from the mean time between failures (MTBF) and mean time to repair (MTTR) of the system. Therefore, with a constant failure rate λ = 1/MTBF and a constant mean repair rate μ = 1/MTTR, [28], the steady-state availability is equal to: A=

MTBF = MTBF + MTTR

1 μ 1 λ

+

1 μ

=

μ λ+μ

(1)

The downtime is another common dependability metric, and is defined as the total time of unavailability in a given period of time. For example, the annual downtime (in minutes) for a system is computed as expressed in Eq. (2) (yearly downtime in minutes), where D is the downtime of the system and A is the availability of the system [31]. D = (1 − A) × 8760 h × 60 min (2) The mean capacity available is another important measure. It may be expressed by the capacity-oriented availability (COA) [17,27]. This metric allows to estimate how much service the system is capable of delivering considering failure states. COA may be calculated with Markov reward models (MRMs) by assigning a reward to each model state corresponding to the system capacity in that condition. In a private cloud system, COA may be associated with the number of computing nodes available representing the amount of service the system, is able to provide. Many techniques have been proposed and adopted to build failover clusters [25] as well as to leverage virtualization and cloud systems for addressing service dependability issues [8,34]. Many of those techniques are based on redundancy, i.e., the replication of components so that they work for a common purpose, ensuring data security and availability even in the event of some component failure. Three replication techniques deserve special attention due to its extensive use in clustered server infrastructures [9,26]: Cold Standby, Hot Standby, and Warm Standby. In the Cold Standby technique, the backup nodes are turned off on standby and will only be activated if the primary node fails. The positive point for this technique is that the secondary node has low consumption of energy and do not wear the system. On the other hand, the secondary node needs significant time to be activated, incurring in data loss, or long delays, in active user sessions, as well as rejection of new user

123

Eucalyptus-based private clouds

requests. The Hot Standby may be considered the most transparent of the replication modes. The replicated modules are synchronized with the operating module, thereby, the active and standby cluster participants are seen by the end user as a single resource. The change of equipment is not noticed when the primary node breaks. The Warm Standby technique tries to balance the costs and the recovery time delay of Cold and Hot Standby techniques. The secondary node is on standby, but not completely turned off, so it can be activated faster than in the Cold Standby technique. The replicated node is partially synchronized with the operating node, so users may lose some information in the exact moment of the switchover to the primary node. 3.2 Eucalyptus platform Eucalyptus [15] enables the implementation of scalable IaaS-style private and hybrid clouds [14] and is interface-compatible with the commercial service Amazon EC2 [2]. This API compatibility enables one to run an application on Amazon and on Eucalyptus without modification. In general, the Eucalyptus cloud-computing platform uses the virtualization capabilities (hypervisor) of the underlying computer system to enable flexible allocation of resources decoupled from specific hardware. Eucalyptus architecture is composed by five high-level components, each one with its own web service interface: Cloud Controller, Node Controller, Cluster Controller, Storage Controller, and Walrus [14]. Figure 1 shows an example of Eucalyptus-based cloud computing environment, considering two clusters (A, B). Each cluster has one Cluster Controller, one Storage

Fig. 1 Example of Eucalyptus-based environment [20]

123

J. Dantas et al.

Controller, and various Node Controllers. The components in each cluster communicate to the Cloud Controller and Walrus to serve the user requests. The Cloud Controller (CLC) is the front-end to the entire cloud infrastructure, and it is responsible for identifying and managing the underlying virtualized resources (servers, network, and storage) via Amazon EC2 API [2]. The Node Controller (NC) runs on each compute node and controls the life cycle of virtual machine (VM) instances running on the node, and it makes queries to discover the node’s physical resources (e.g., number of CPU cores, size of main memory, available disk space) as well as to probe the state of VM instances on that node [14,20]. The Cluster Controller (CC) gathers information on a set of VMs and schedules the VMs execution on specific NCs. The CC has three primary functions: scheduling incoming requests for execution of VM instances; controlling the virtual network overlay composed by a set of VMs, gathering information about the set of node controllers, and reporting their status to the CLC [14]. The Storage Controller (SC) provides persistent storage to be used by VM instances. It implements block-access network storage, that is similar to that provided by Amazon Elastic Block Storage (EBS) [1]. The Walrus is a file-based data storage service compatible with Amazon’s Simple Storage Service (S3) [14]. The Walrus provides a storage service for VM images. Root filesystem as well as the Linux kernel and ramdisk images used to instantiate VMs on NCs can be uploaded to Walrus and accessed from the nodes.

4 A basic private cloud architecture This study analyzes several possible architectures for building Eucalyptus-based private cloud systems. These alternatives are built upon a basic architecture. This section presents a basic architecture of an Eucalyptus-based private cloud composed of one cluster. Each particular component of this system is described, then, models for estimating availability, downtime and capacity oriented availability are proposed for this particular system, and further extended for larger and alternative architectures. Figure 2 shows the architecture of such a system composed of one cluster. A frontend computer is adopted as the “Cloud Subsystem” and configured with the Eucalyptus components known as Cloud Controller and Walrus. The cluster has one machine that is called hereinafter the “Cluster Subsystem”, which runs the Cluster Controller and Storage Controller components. The cluster has also three machines that run the Node Controllers, responsible for instantiating and managing the VMs, interacting directly with the hypervisor. The set of three nodes in the cluster is called “Nodes Subsystem”. The impact of implementing the redundancy in the Cloud Subsystem (composed of the, CLC and Walrus components) is considered for those systems. Results of previous analysis [11] show that the employment of redundancy in other subsystems cause little improvements in comparison to the Cloud Subsystem redundancy. The basic architecture, shown in Fig. 2, requires the acquisition of five computers plus one for the redundant Cloud Subsystem. The computers that compose each analyzed architecture have identical characteristics: same manufacturer, model, configuration and costs [12], shown in Table 1.

123

Eucalyptus-based private clouds Fig. 2 Private cloud architecture with one cluster

Table 1 Computer description

Brand/model

Components

Description

DELL/power edge

HD

1 TB

Memory

16GB

CPU

Intel Xeon E5-2420 1.9 GHz

Total cost (US$)

1424.88

Due to their simplicity and efficiency of computation, RBDs are used to analyze the steady-state availability of the architectures. However, due to the mechanism of active redundancy (warm-standby) and the need to assess the capacity-oriented availability, it was adopted a hierarchical heterogeneous model, composed of an RBD and an MRM representing the architectures with one, two, three, four and five clusters. The RBD describes the high-level components, whereas the MRM represents the components involved in the evaluation of capacity-oriented availability. The MRM also enables to obtain a closed-form equation for the availability of the subsystems. 4.1 RBD models The private cloud infrastructure depicted in Fig. 2 may be divided in three parts: The Cloud Subsystem, the Cluster Subsystem and Nodes Subsystem. The Cloud Subsystem

123

J. Dantas et al.

Fig. 3 RBD model Cloud Subsystem

Fig. 4 RBD model of the Cluster Subsystem

Table 2 Input parameters for the Cloud Subsystem and Cluster Subsystem

Component

MTBF (h)

MTTR

HW

8760

100 min

SO

2893

15 min

CLC and Walrus

788.4

1h

CC and SC

788.4

1h

Fig. 5 RBD model of one node in the Nodes Subsystem

is represented by a series RBD as well as the Cluster Subsystem. The Cloud Subsystem consists of hardware, operating system, and the Eucalyptus software components: CLC and Walrus, as shown in Fig. 3. Figure 4 depicts an RBD model for the Cluster Subsystem, which is composed of hardware, operating system, CC and SC. Table 2 presents the values of mean time between failures (MTBF) and mean time to repair (MTTR) used for the Cloud Subsystem and Cluster Subsystem models. Those values were obtained from [19,21], and were used to compute the dependability metrics for the subsystems and then for the whole system. Figure 5 shows the RBD model that represents one node in the Nodes Subsystem. Besides the hardware and operating system, which are also present in the Cloud and Cluster Subsystems, each node needs a hypervisor (e.g., KVM) and the Eucalyptus NC component in order to be available for the cloud. The Nodes Subsystem model assumes that the hardware and operating system of the nodes have the same dependability characteristics as in the Cloud and Cluster Subsystems, i.e., the same MTBF and MTTR. Therefore, Table 3 presents only the parameter values for the KVM and NC blocks [19,21].

123

Eucalyptus-based private clouds Table 3 Input parameters for the nodes

Component

MTBF (h)

MTTR (h)

KVM

2990

1

NC

788.4

1

4.2 Markov reward models We also propose Markov reward models (MRM) that represent behaviors which only pure RBDs are not able to capture. Our approach to compute COA is based on the amount of processor cores available in the system. This measure is tightly related to the amount of VMs that the cloud can run. We consider that the failure and repair events of the Cloud Subsystem do not significantly affect COA. This reasoning was verified through the evaluation of scenarios that considered Cloud Subsystem failures and those without Cloud Subsystem failures. There were not significant differences between the results of these scenarios, therefore we decided to use MRMs without the Cloud Subsystem, which are more concise and equally accurate for computing the capacity-oriented metrics. Figure 6 depicts a cloud system with one cluster, and three nodes in that cluster. Each node of this system has two processor cores, thus a single cluster has the capacity to run up to six VMs, if each core can be used by only one VM. An MRM is a labeled CTMC augmented with state reward and impulse reward structures. The state reward structure is a function ρ that assigns to each state s ∈ S a reward ρ(s) such that if t time-units are spent in state s, a reward of ρ(s) × t is acquired. The rewards that are defined in the state reward structure can be interpreted in various ways. They can be regarded as the gain or benefit acquired by staying in some state and they can also be regarded as the cost incurred by staying in some state. The impulse reward structure, on the other hand, is a function ı that assigns to each transition from s to s  , where s, s  ∈ S, a reward ı(s, s  ) such that if the transition from s to s  occurs, a reward of ı(s, s  ) is acquired. Similar to the state reward structure, the impulse reward structure can be interpreted in various ways. An impulse reward can be considered as the cost of taking a transition or the gain that is acquired by taking the transition. Figure 7 depicts the capacity oriented availability MRM model that describes the Cluster and Node Subsystems. The MRM model has eight states: CC F1 , CC F2 , CC F3 , Fig. 6 Capacity view of a cluster with 2 cores per node

123

J. Dantas et al.

Fig. 7 Capacity-oriented availability MRM model

CC F4 , 6, 4, 2 and 0K. The state 6K denotes the system is up, and has six processor cores available (2 cores per node) and one Cluster Subsystem operating. When the Cluster Subsystem fails, the system enters the state CC F1 and stops to providing service. In this state, as in other states that represent the cluster failure—CC F2 , CC F3 and CC F4 —the system is not offering the service, but nodes are running, and may still fail. Therefore, if one node fails, there is a transition from CC F1 to CC F2 , then if a second node fails, the system reaches the state CC F3 , and so on. When the system is in state 6K, the failure of one node triggers a transition to state 4K. If another node fails, the system transitions to the state 2K, and if the remaining node fails the system reaches the state 0K. When the system is in state CC F1 , the Cluster Subsystem may be repaired, so the state 6K is reached again. A similar process occurs when the system is at 4K, 2K or 0K states and the Cluster Subsystem fails. The rate λcc denotes the Cluster Subsystem failure rate. The corresponding repair rate is μcc , which is assigned to the output transitions from states CC F1 , CC F2 , CC F3 , and CC F4 to states 6, 4, 2, and 0K, respectively. The failure rate of one node is λn , so it is the rate assigned to the transition from 2K to 0K . The transition from 4K to 2K has a rate 2λn because any of two nodes may fail. The transition from 6K to 4K is 3λn due to a similar reasoning. The same can be considered for the transitions from CC F1 to CC F2 , CC F2 to CC F3 , and CC F3 to CC F4 , demonstrating failures of the nodes when the cluster controller is down. It is worth mentioning that in these conditions there is no transition that represents the repair of the nodes when the system is in states CC F1 , CC F2 , CC F3 and CC F4 , because we assume priority for the repair of the Cluster Subsystem. This model considers a single repair team, so the repair rate of the Nodes Subsystem is the same for all states, and is denoted by μn . Table 4 presents the values for all mentioned parameters of the MRM. The λcc and μcc values were obtained from the MTBF and MTTR computed using the RBD model of the Cluster Subsystem (see Fig. 4) and λn and μn are computed using the RBD model of the Nodes Subsystem (see Fig. 5). The state reward ρ(s) assigned to 6, 4 and 2K is equal to 1, 23 , and 13 respectively, indicating the proportion of cores that are available depending on the amount of active nodes. The state reward assigned to CC F1 , CC F2 , CC F3 , CC F4 and 0K (shaded states) is equal to 0, since the system is down in those states. Other MRM proposed is depicted in Fig. 8. This model describes the Cloud Subsystem employing an warm-standby replication. The redundant subsystem has a primary

123

Eucalyptus-based private clouds Table 4 MRM parameters

Parameter

Description

Value per hours

λcc

Cluster failure rate

1/333.7114

λn

Node failure rate

1/481.8276

μn

Node repair rate

1/0.911756

μcc

Cluster repair rate

1/0.93888

Fig. 8 Markov model for a warm-standby redundant Cloud Subsystem with two hosts

host (H1) which is active by default, and a secondary host (H2), which is the spare one. The MRM model has 5 states: UW, UF, FF, FU, and FW. In the state UW, the primary host (H1) is up and the secondary host (H2) is in a waiting condition. When H1 fails, the system goes to state FW, where H2 has not yet detected the failure of the H1. FU represents the state where H2 leaves the waiting condition and assumes the active role, whereas H1 is failed. If H2 fails before the repair of H1, the system goes to the state FF. In order to prioritize the repair of the main server there is only a single repair transition from FF, which goes to UF. If H2 fails when H1 is up, the system goes to state UF, returning to state UW with the repair of H2, or going to state FF in case of H1 also fails. The failure rates of H1 and H2 are denoted by λs1 and λs2 respectively. The rate λis2 denotes the failure rate of the H2 when it is inactive. The repair rate of H2 is μs2 . The transition rate sas2 represents the switchover rate, i.e., the reciprocal of the mean time to activate the H2 after a failure of H1. Table 5 presents the values for all the mentioned parameters of the MRM. The value of μs1 is equal to the value of μs2 , the rates λs1 and λs2 also have equal values. These values were obtained from the MTBF and MTTR computed using the RBD models for the non-redundant Cloud Subsystem. The failure rate of H2 when it is inactive is assumed to be 20 % smaller than the failure rate of an active host. The value of sas2 comes from default monitoring interval and activation times found in HA software such as Heartbeat [16].

123

J. Dantas et al. Table 5 Parameter values for the Markov chain model

Parameter

Description

Value per hours

λs1 = λs2 = λ

Active host failure rate

1/333.7114

λis2 = λi

Inactive host failure rate

1/400.4537

μs1 = μs2 = μ

Host repair rate

1/0.938883

sas2 = sa

Spare host activation rate

1/0.004166

The state reward ρ(s) assigned to UW, UF, and FU is equal to 1, since the Cloud Subsystem is available in those states. The state reward assigned to FF and FW (shaded states) is equal to 0, since the subsystem is down in those states. There are no impulse rewards in this model. Therefore, the steady-state availability of the subsystem can be computed as the steady-state reward of the MRM.  Let ACLC be the steady-state availability of the Cloud Subsystem, so ACLC = s∈S πs × ρ(s), where πs is the steady-state probability of being in the state s, and ρ(s) is the reward assigned to the state s. The MRM of Fig. 8 enables to obtain closed-form equations for steady-state availability of the redundant CLC subsystem, A RC LC , represented by Eq. (3). A RC LC =

sa(λ2

μ(λλi + (λi + μ)2 + sa(λ + λi + μ)) + λ(λi + μ) + μ(λi + μ)) + (λ + μ)(λλi + (λi + μ)2 )

(3)

It is important to stress that a closed-form equation can also be used for parametric sensitivity analysis. The determination of factors that are most relevant regarding the measures or output of a model can be of great assistance in establishing the critical components in a system. 5 Evaluation of architecture alternatives Based on the models shown in the Sect. 4.1, we propose hierarchical models to compute measures such as system steady-state availability, downtime, and capacity-oriented availability (COA) for some private cloud architectures. We evaluate five scenarios, with one, two, three, four and five clusters. Figure 9 show the top level RBD model with five Cluster Subsystems. Each red block, labeled as Cluster j , represent a cluster composed of three nodes, and has its parameters obained through the model described in Fig. 7. Scenario I consists of one Cloud Subsystem, one Cluster Subsystem, and three hosts in the Nodes Subsystem. Scenario II is composed of one Cloud Subsystem, two Cluster Subsystems, and six hosts arranged equally in two Nodes Subsystems. The Scenario III has one Cloud Subsystem, three Clusters Subsystems, and nine hosts divided into three Nodes Subsystems. The Scenario IV has one Cloud Subsystem, four Clusters Subsystems, and twelve hosts divided into four Nodes Subsystems. The Scenario V has one Cloud Subsystem, five Clusters Subsystems, and fifteen hosts divided into five Nodes Subsystems. For all scenarios, the system is available if the Cloud Subsystem is running and at least one Cluster Subsystem is available, with one or more nodes running in that cluster.

123

Eucalyptus-based private clouds Fig. 9 RBD model for the Scenario V: cloud system with five clusters

For each scenario, two variants were evaluated: (1) Non-Redundant Cloud Subsystem, (2) Redundant Cloud Subsystem. The variants with redundant Cloud Subsystem use the MRM presented in Fig. 8 to compute the steady-state availability of the CLC blocks in the correspondent top level RBD. The COA of each analyzed architecture may be computed using the Eq. (4): n COASystem =

i=1 (COAclusteri )

n

,

(4)

where (COAclusteri ) represents the COA of each of the n clusters composing the architecture and is computed as the steady-state reward of the corresponding MRM (see Fig. 7), expressed in Eq. (5). COAclusteri =



πs × ρs

(5)

s∈S

where πs is the steady-state probability of being in a given state s of the MRM. ρs is the reward rate assigned to this state s and is equivalent to the fraction of the total number of cores available in that state, as presented in Sect. 4.2. It is worth to notice that in this case study all clusters have the same COA because they are identical (i.e., same number of nodes, same number of processor cores per node). Therefore, the

123

J. Dantas et al. Table 6 Measures for all architectures and their variants Arch.

Availability Non-redundant

Downtime (h) Redundant

Non-redundant

AAC Redundant

I

99.439671

99.716768

49.08

24.81

II

99.718659

99.996533

24.64

0.30

5.97 11.94

III

99.719441

99.997318

24.58

0.23

17.91

IV

99.719443

99.997320

24.58

0.23

23.88

V

99.719443

99.997320

24.58

0.23

29.85

COA of the system is equal to the COA of the clusters. Despite such particular case, Eqs. (4) and (5) are valid for any configuration of clusters and nodes. The average available capacity (AAC), in number of cores, of a private cloud architecture is computed through the Eq. (6). AAC = COASystem × nc,

(6)

where nc is the total number of processor cores in that architecture, i.e., the maximum possible capacity. It is also possible to obtain a closed-form equation for computing the availability of the whole cloud system ( ASystem ), from the corresponding RBD models. Equation (7) denotes how to compute the availability of the system, according to the rule of composition of series and parallel components [22]. This is a general equation for an architecture composed of k clusters. ACLC is the steady-state availability of the block representing the Cloud Subsystem. When dealing with a redundant Cloud Subsystem, the value of ACLC will be computed through Eq. (3). Acluster j is the steady-state availability of each Cluster Subsystem evaluated using the model of Fig. 7. ⎛

ASystem

⎞ k    (1 − Acluster_ j ) ⎠ = (ACLC ) × ⎝1 −

(7)

j=1

Table 6 shows the results of this study, considering the steady-state availability and annual downtime for each scenario, with both variants (non-redundant and redundant Cloud Subsystem). The value of capacity-oriented availability (COA)—not presented in Table 6—is 0.995 for all architectures because the failure of Cloud Subsystem does not affect this metric, and the clusters added from one architecure to another are identical one to each other. The result of average available capacity (AAC) will vary according to the size of each architecture. The existence of a single point of failure explains the poor results of steady-state availability for the non-redundant variants, whereas, the results for redundant systems show that a simple replication in the critical component can increase system availability, reducing downtime in a significant proportion (more than 99 % of decrease). On the other hand, the addition of clusters showed not to cause large improvements. Notice that the availability values

123

Eucalyptus-based private clouds Table 7 Monthly cost of equivalent infrastructure on Amazon

Cost (U S$)

Scenario I

509.76

II

1028.16

III

1546.56

IV

2056.32

V

2574.72

among Scenarios III, IV, and V are much close one to each other. Increasing the number of clusters in this kind of private cloud system is only justified by the capability of accepting larger workloads, i.e., larger amounts of VM instantiations. The AAC results indicate that the eventual occurrence of failures has no significant impact on the available capacity, since in all scenarios the AAC is close to the maximum capacity. It can be seen that the scenario V has the best AAC, but on the other hand it presents an annual downtime of about 24 h when there is no redundancy in the Cloud Subsystem to circumvent eventual failures. Therefore, the analysis of the proposed models show that adding more clusters to private cloud environments, beyond the number of 3, is an action that makes sense only if the aim is to increase the capacity. If the aim is to have high availability, it is necessary and enough to invest in redundancy for the Cloud Subsystems, or even employing other fault tolerance mechanism, not studied in this paper. We also use the AAC of each architecture to compute the cost of renting a similar capacity in a public cloud provider. The price of CPU time utilization for a VM type m1.medium in Amazon (US$0.120 per hour) [3] is the base for the comparison. This price was multiplied by the AAC of each architecture and by the number of hours in a month, to obtain the monthly costs presented in Table 7. In order to check the time that will take to compensate the budget invested in each private architecture, we also need to compute the acquisition and operational costs of each redundant private cloud architecture. Non-redundant variants are not considered, since they would not be able to provide availability in similar levels to a public cloud provider such as Amazon. Besides the cost of equipment, the energy consumption cost needs to be considered in a private cloud. In our study, the total cost of a private cloud is the sum of these two costs. For this evaluation, the power consumption cost (PCC) is calculated from Eq. (8), adapted from [6]. PCC = T × Cenergy ×

M 

Pinputi × (Ai + α(1 − Ai )),

(8)

i=1

where T is the observation time in hours; Cenergy is the average cost of electricity per kilowatt-hour; M refers to the number of machines in use by the environment; Pinputi is the input power in kilowatt (kW) of each machine i. In our case, the input power is the same for all machines, and its value is 134 W [13], considering an environment with 220 AC input voltage and 25 ◦ C. We adopted an average electricity cost of US$0.165 per kilowatt hour [29]. Ai means the availability of each machine, and α is

123

J. Dantas et al. Table 8 Cost per architecture

Scenario Computers PC (Kwh) AC ($) 8, 549.28

PCC ($) per month

I

6

0.804

II

10

1.34

14, 248.80 161.40

96.62

III

14

1.876

19, 948.32 225.96

IV

18

2.412

25, 647.84 290.52

V

22

2.948

31, 347.36 355.08

a factor that represents the fraction of energy that continues to be consumed after that component has failed, representing a probability of power consumption even in failure state. Therefore, knowing that one machine in standby mode consumes an average of 21.13 W [30], we consider that α = 0.2, reflecting an input power of 26.8 W when the machine is not fully operational due to any failure. Table 8 shows the power consumption (PC) in kWh, besides the acquisition cost (AC) and power consumption cost (PCC) per month for each architecture. The results depicted in Fig. 10 show that the cost of the public cloud becomes higher than the private cloud architectures from a moment between 12 and 24 months. In most cases the investment in the private cloud compensates in 18 months (see Fig. 10b). It is also important to highlight that the larger is the architecture the sooner is the moment when the private cloud begins to cost less than the public cloud. As verified in Fig. 10, the Scenario I, which has only one cluster, takes about 24 months for having lower costs in a private cloud than in the public one. Whereas in the Scenario V, which has five clusters, the costs of both solutions, private and public cloud, are almost the same in 12 months and from such a moment ahead the cumulative costs are lower with the private cloud. Note that yet, the Fig. 10c–e are showing the cost increase about 50 % after 24 months in compare the private cloud with the public cloud. Thus, despite the high costs involved in increasing the capacitiy by adding clusters and nodes to a private infrastrucutre, the investment may be worthy in the long term, when compared to the rental of the same computational capacity in a public cloud such as Amazon.

6 Final remarks This paper presented models to evaluate steady-state availability and capacity-oriented availability (COA) for five scenarios of private cloud environments. In addition, we conducted a comparison of costs between private cloud architectures and equivalent capacity rented on a public cloud provider. The study takes into account the results of COA to compute the average available capacity (AAC) of each private cloud architecture, and therefore determine the processing capacity which would be rented on a public cloud. A hierarchical heterogeneous approach was used to model the private cloud availability, combining Reliability Block Diagrams and Markov Reward Models (MRM). An MRM enabled the evaluation of COA whereas another MRM allowed the analysis of redundant subsystems. Closed-form equations were obtained through the analytical models, and used to compare the availability of all proposed architectures.

123

Eucalyptus-based private clouds

Fig. 10 Cost comparison per architecture—private cloud vs. Amazon. a Comparison private cloud-I vs. Amazon-I. b Comparison private cloud-II vs. Amazon-II. c Comparison private cloud-III vs. Amazon-III. d Comparison private cloud-IV vs. Amazon-IV. e Comparison private cloud-V vs. Amazon-V

123

J. Dantas et al.

Fig. 10 continued

The results indicate that the aim of increasing steady-state availability is not achieved by adding clusters beyond the number of 3. On the other hand, the increase in AAC is significant and is affected by system failures only in a small degree, so the average number of available cores is very close to the maximum possible for all architectures. The comparison of costs showed that it takes 18 months, in average, for the studied private cloud architectures compensate the cost of an equivalent computational capacity in Amazon. The analysis also evidenced that, considering acquisition cost and power consumption, building private clouds is better than renting capacity in public clouds when high computational power is needed during a large period of time. For other situations, usage of public clouds is likely to cost less than building and running private ones, considering at least the acquisition and energy budget. Further works might evaluate, the data consistency between replicated servers of private clouds, verifying the reliability of the warm-standy redundancy mechanisms for such infrastructures. Comparisons including other public cloud providers and additional budget issues might also extend the work presented here.

123

Eucalyptus-based private clouds

References 1. 2. 3. 4. 5. 6.

7.

8.

9. 10.

11.

12. 13. 14. 15. 16. 17. 18. 19.

20. 21.

22. 23. 24.

25. 26.

Amazon (2012) Amazon elastic block store (EBS). Amazon.com, Inc. http://aws.amazon.com/ebs Amazon (2012) Amazon elastic compute cloud (EC2). Amazon.com, Inc. http://aws.amazon.com/ec2 Amazon (2014) Amazon ec2 pricing. Amazon Inc. http://aws.amazon.com/ec2/pricing/ Armbrust M, Fox A, Griffith R, Joseph AD, Katz R, Konwinski A, Lee G, Patterson D, Rabkin A, Stoica I et al (2010) A view of cloud computing. Commun ACM 53(4):50–58 Avizienis A, Laprie JC, Randell B, Landwehr C (2004) Basic concepts and taxonomy of dependable and secure computing. IEEE Trans Dependable Secur Comput 1(1):11–33 Callou G, Maciel P, Tutsch D, Ferreira J, Araújo J, Souza R (2013) Estimating sustainability impact of high dependable data centers: a comparative study between brazilian and us energy mixes. Computing 95(12):1137–1170 Callou G, Maciel P, Tutsch D, Araujo J (2012) Models for dependability and sustainability analysis of data center cooling architectures. In: 20121 IEEE/IFIP 42nd international conference on dependable systems and networks workshops (DSN-W). IEEE, pp 1–6 Chaudhary V, Cha M, Walters J, Guercio S, Gallo S (2008) A comparison of virtualization technologies for hpc. In: 22nd international conference on advanced information networking and applications, 2008. AINA 2008. IEEE, pp 861–868 Chen R, Bastani FB (1994) Warm standby in hierarchically structured process-control programs. IEEE Trans Softw Eng 20(8):658–663 Chuob S, Pokharel M, Park JS (2011) Modeling and analysis of cloud computing availability based on eucalyptus platform for e-government data center. In: 2011 5th international conference on innovative mobile and internet services in ubiquitous computing (IMIS). IEEE, pp 289–296 Dantas J, Matos R, Araujo J, Maciel P (2012) An availability model for eucalyptus platform: an analysis of warm-standy replication mechanism. In: 2012 IEEE international conference on systems, man, and cybernetics (SMC). IEEE, pp 1664–1669 Dell (2012) Dell computers. http://www.dell.com/. Accessed 10 March 2014 DELL (2014) Datacenter capacity planner configuration. Dell. http://www.dell.com/html/us/products/ rack_advisor_new/. Accessed 10 March 2014 Eucalyptus (2009) Eucalyptus open-source cloud computing infrastructure—an overview. Eucalyptus systems, Goleta Eucalyptus (2014) Eucalyptus—the open source cloud platform. Eucalyptus systems. http://open. eucalyptus.com/. Accessed 5 March 2014 Heartbeat (2012) Linux-HA project. http://www.linux-ha.org. Accessed 5 March 2014 Heimann D, Mittal N, Trivedi K (1991) Dependability modeling for computer systems. In: Proceedings annual reliability and maintainability symposium, 1991. IEEE, Orlando, pp 120–128 Hong Z, Wang Y, Shi M (2012) Ctmc-based availability analysis of cluster system with multiple nodes. In: Advances in future computer and control systems. Springer, Berlin, pp 121–125 Hu T, Guo M, Guo S, Ozaki H, Zheng L, Ota K, Dong M (2010) Mttf of composite web services. In: 2010 international symposium on parallel and distributed processing with applications (ISPA). IEEE, pp 130–137 Johnson D, Murari K, Raju M, Suseendran RB, Girikumar Y (2010) Eucalyptus beginner’s guide, uec edn Kim DS, Machida F, Trivedi KS (2009) Availability modeling and analysis of a virtualized system. In: 15th IEEE Pacific Rim international symposium on dependable computing, 2009. PRDC’09. IEEE, pp 365–371 Kuo W, Zuo MJ (2003) Optimal reliability modeling: principles and applications. Wiley, New York Leangsuksun CB, Shen L, Liu T, Scott SL (2005) Achieving high availability and performance computing with an ha-oscar cluster. Future Gener Comput Syst 21(4):597–606 Leangsuksun C, Shen L, Song H, Scott SL, Haddad31 I (2003) The modeling and dependability analysis of high availability oscar cluster system. In: High performance computing systems and applications. NRC Research Press, p 285 Liu T, Song H (2003) Dependability prediction of high availability oscar cluster server. In: Proceedings of the 2003 Int. Conf. on parallel and distributed processing techniques and applications Maciel P, Trivedi K, Matias R, Kim D (2011) Performance and dependability in service computing: Concepts, techniques and research directions, ser. In: Premier Reference Source. Igi Global

123

J. Dantas et al. 27. Matos R, Maciel PRM, Machida F, Kim DS, Trivedi KS (2012) Sensitivity analysis of server virtualized system availability. IEEE Trans Reliab 61(4):994–1006 28. O’Connor P, Kleyner A (2011) Practical reliability engineering. Wiley, New York 29. of Energy, U.D.: City of palo alto utilities—palo alto clean. Clean local energy acessible now (2013). http://energy.gov. Accessed 21 March 2014 30. Power S (2015) Laerence berkeley national laboratory. http://standby.lbl.gov/. Accessed 3 Feb 2015 31. Sathaye A, Ramani S, Trivedi KS (2000) Availability models in practice. In: Proc. of intl. workshop on fault-tolerant control and computing (FTCC-1) 32. Sun D, Chang G, Guo Q, Wang C, Wang X (2010) A dependability model to enhance security of cloud environment using system-level virtualization techniques. In: 2010 1st international conference on pervasive computing signal processing and applications (PCSPA). IEEE, pp 305–310 33. Wei B, Lin C, Kong X (2011) Dependability modeling and analysis for the virtual data center of cloud computing. In: 2011 IEEE 13th international conference on high performance computing and communications (HPCC). IEEE, pp 784–789 34. Yeow WL, Westphal C, Kozat UC (2010) A resilient architecture for automated fault tolerance in virtualized data centers. In: 2010 IEEE Network operations and management symposium (NOMS). IEEE, pp 866–869

123

Suggest Documents