Availability Study on Cloud Computing. Environments: Live Migration as a Rejuvenation. Mechanism. Matheus Melo, Paulo Maciel, Jean Araujo, Rubens Matos ...
Availability Study on Cloud Computing Environments: Live Migration as a Rejuvenation Mechanism Matheus Melo, Paulo Maciel, Jean Araujo, Rubens Matos and Carlos Ara´ujo Informatics Center, Federal University of Pernambuco Recife, Brazil Email: {mdetm, prmm, jcta, rsmj, cjma}@cin.ufpe.br Abstract—With the increasing adoption of cloud computing environments, studies about high availability in those systems became more and more significant. Software rejuvenation is an important mechanism to improve system availability. This paper presents a comprehensive availability model to evaluate the utilization of the live migration mechanism to enable VMM rejuvenation with minimum service interruption. Live migrations are performed observing a time-based trigger. We evaluate five different scenarios, with distinct time intervals for triggering the rejuvenation. The results show that the live migration can significantly reduce the system downtime. Keywords—Cloud computing, software aging and rejuvenation, live migration, availability.
I.
I NTRODUCTION
Virtualization brings some benefits like better utilization of resources and fault tolerance. Due to these capabilities, virtualization became essential for various types of platforms, and leveraged the cloud computing paradigm [1]. An important feature for cloud systems is the ability to move virtual machines (VMs) from one physical host to another. This characteristic is called VM migration [2]. Many systems which rely on cloud computing platforms require nearly uninterrupted service. Therefore system availability is an important concern for cloud platforms. In this context, software rejuvenation is an auspicious technique to achieve high availability [3]. In cloud environments the VMs run on a Virtual Machine Monitor (VMM), also called hypervisor. This component is liable to suffer failures or hangs due to software aging. When an application goes continuously, its performance may degrade and failure rate may increase [4]. In this situation, software rejuvenation mechanism can be performed as a fault prevention action. On cloud computing environments, the applications are hosted on VMs. Rejuvenation actions directed on the VMM lead to VMs paralysis or termination. Therefore, the system availability will decrease due to VMM rejuvenation. So, the rejuvenation process needs to conform with the required availability levels. To achieve minimum downtime rejuvenation, we have virtual machine migration, which consists to move VM to other host before apply rejuvenation actions. If the VM is moved using live migration [2] the downtime reaches minimum levels. However, if the live migration is intensive on a cloud environment, the availability may degrade as well.
978-1-4799-0181-4/13/$31.00 ©2013 IEEE
This paper proposes a comprehensive availability model of a cloud computing environment with time-based rejuvenation supported by the live migration mechanism. The main objective is to evaluate the impact that different rejuvenation policies based on live migration produce on the steady-state availability. For this purpose, we created five scenarios with different rejuvenation policies and evaluated them to find steady-state availability and expected annual downtime. The models also consider some non-aging-related failures. Sensitivity analysis shows that steady-state availability is substantially improved with the appropriate trigger interval for the rejuvenation actions. The remaining sections are organized as follows. Section II explains the fundamental concepts of software aging and rejuvenation. Section III describes the architecture of the private cloud system analyzed in this paper, whereas Section IV exposes the models developed to represent the cloud system. Section V presents the results obtained through model analysis, with focus on the availability metrics for each scenario. Section VI concludes the paper, presenting also some possible future works. II.
S OFTWARE AGING AND R EJUVENATION
Software aging can be defined as a growing degradation of software’s internal state during its operational life [5]. The causes of software aging have been verified as the accumulated effect of software faults activation [6] during the system runtime [4]. Aging in a software system is an accumulative process. In long-running execution, a system suffering from software aging increases its failure rate due to the accumulation of successive errors that degrade the system internal state integrity. Problems such as data inconsistency, numerical errors, and exhaustion of operating system resources are examples of software aging consequences [5]. Since the notion of software aging was introduced [4], many researches have been conducted in order to characterize this phenomenon in many kinds of systems. The software aging effects in cloud computing environments were addressed in [7] and [8]. These paper demonstrated the occurrence of faults in a private cloud infrastructure due to the accumulation of memory leaks in some software components of the Eucalyptus platform. [9] also shows that there is an increase of CPU utilization during consecutive attachments
of remote block storage volumes by means of Eucalyptus commands. Such an aging effect degrades the performance of applications running on Eucalyptus-based clouds and may lead to service unavailability. Once the aging effects are detected, mitigation mechanisms might be applied in order to reduce their impact on the applications or the operating system. The employment of such mitigation mechanisms is known as software rejuvenation [10]. Since the aging effects are typically caused by hard to track software faults, rejuvenation techniques look for reducing the aging effects during the software runtime, until the aging causes (e.g., a software bug) are fixed permanently. Examples of rejuvenation approaches may be software restart or system reboot, which are effective actions but may cause service downtime, since the application or system is unavailable during the execution of these actions. In [11], rejuvenation strategies have been proposed for mitigating the downtime caused by the aging effects in the Eucalyptus cloud computing framework. III.
S YSTEM A RCHITECTURE
This study considers a system with three main components: Main Node, Standby Node and Management Server. The Main Node represents the main host of the environment, containing a VMM which runs the VM with a desired application. Standby Node is a spare host which assumes Main Node role when a VM migrates to it. This mechanism is similar to an warm standby replication [12]. Management Server is a component responsible for controlling the entire cloud environment, by means of a specific cloud management tool. The system’s organization is presented on the figure 1. Besides the mentioned three main components, there is a remote storage volume which is accessed by the VM and managed by the Management Server. All components are interconnected in a private network.
properly. Management Server needs to be up and running, because it controls the whole environment. If the Standby Node fails, the system does not stop, only the migration is disabled. It is worth highlighting that the roles of Standby Node and Main Node are swapped when the VM migrates, therefore the host availability becomes essential to system availability as soon as an incoming migration is completed. The study considers that the VMM is affected by software aging [13]. The consequences of software aging comprises hangs to total system failures. When the Main Node is up and running, the VMM is aging (what can lead to failure), and the VM or the other Main Node components (hardware and operating system) may also fail. The Standby Node can fail if its hardware or operating system goes down. If the Main Node or Standby Node suffer a non-aging failure, all the aging effects are cleared because all the repair mechanisms involve restart of related components, and subsequently the restart of the VMM. In order to clear aging effects on VMM, a rejuvenation is periodically scheduled, supported by VM live migration to minimize the downtime. When a VM migration is requested, the Main Node moves the VM to the Standby Node. As the VM migration completes, the Standby Node assumes the role of the Main Node and a rejuvenation process is performed on the previous Main Node. When this process finishes, the original Main Node assumes the Standby Node role. The rejuvenation process allows to clear aging status, taking the VMM on the node to a fresh state, ready to receive the VM again when needed. IV.
AVAILABILITY M ODELS
The availability models are built using extended Deterministic Stochastic Petri Nets (SPNs) [14] [15] and Reliability Block Diagrams (RBD) [16] formalisms. We consider the occurrence of both, non-aging and aging-related failures in the system. Such a characteristic enables the analysis of the rejuvenation impact on the whole system. For the hosting nodes (MainNode and StandbyNode) and the ManagementServer, we obtained the mean time to failure (MTTF) and mean time to repair (MTTR) through the analysis of RBD models described in Figures 2(a) and 2(b). These RBD models consider only the non-aging failures. Figure 2(a) shows that a node fails if either the hardware (HW) or the operating system (OS) fails. Figure 2(b) shows that besides hardware and operating system, a failure in the management tool may also bring the Management Server down.
(a) RBD Node
Fig. 1.
System Architecture (b) RBD MS (Management Server)
The system operational mode is described as follows. The MainNode and its VM need to be running and working
Fig. 2.
RBDs MS and Node
Our SPN model is intended to assess the impact of rejuvenation process on the steady-state availability of the system. On the evaluations performed the model does not take into account failure detection times and details of the VM live migration process. Failures of the Remote Storage Volume and network problems are also neglected. Figure 3 contains the SPNs model for the system under study. The model is composed for three sub-models: a) ManagementServer Model, b) Clock Model and c) System Model. The ManagementServer Model is a simple availability model which represents the behavior of the Management Server of the cloud. Times to failure and repair for this sub-model (MS_fail and MS_repair, respectively) are retrieved from RBD Model depicted in Figure 2(b). The Clock Model employs an idea that is similar to that presented in [3]. This model represents the rejuvenation schedule for the system. The transition Trigger fires after an established deterministic time, and deposits a token in the place ReadyToMigrate. At last, we have the ResetClock transition which has a guard function enabling the transition only when the migration process is over. The System Model represents the main events related to the Main Node, Standby Node, and the VM. The place MN_UP represents that the Main Node and its VM (see figure 1) are running properly. From this point several transitions can be fired. MN_fail fires when a internal (non-aging) failure occurs on the Main Node. As the VM depends on the node, the system fails when this transition is fired. The recovery is given by a two-step process. First, the Main Node is repaired, aiming to solve this problem. After that, the VM needs to be restarted to the system become operational again. The time values for the transitions MN_fail and MN_repair are retrieved from the RBD model depicted in Figure 2(a). Another possibility when the system is running properly is the VM failure. In the model, this is represented by firing the transition VM_fail, which deposits one token in the VM_DW place. This place represents that the VM is failed. At this point, two events are possible. Either the VM is repaired, so the system returns to operating state, or the Main Node also fails. If the second possibility occurs, first the problem on the Main Node is solved, thus allowing the repair of the VM. In any case, the system only goes up when both (VM and Main Node) are working. To model the VMM aging phenomenon we used a 3phase Erlang sub-net. This kind of net is adopted because we are handling a process which has an increasing failure rate. This behavior can be seen in the bathtub curve, where the failure rate increases after a certain stable period of life [5]. The transitions Aging and Aging2 represent the phases of the sub-net. The transitions ClearAging, ClearAging2 and ClearAging3 represent events which clear the aging effects of the VMM. The model consider that these events occur when the Main Node or the VM fails, since the respective repair actions involve rejuvenation actions, and there is no reason to accumulate aging when the VM is not working. When a live migration is performed the VM is moved to a fresh VMM environment, therefore justifying the removal of aging effects. If none of these events occur, the VMM achieves a critical age, that leads to a failure (transition FailureAging fires). After that, the node reaches an inactive state (MN_Dead) where it needs a recovery from aging effects (transition Re-
coverFromAging) so the system returns to an up state. When there is a token in SN_up this means the spare machine is up and can receive a VM by live migration. However, if SN_fail is fired, the Standby Node reaches an inactive state and needs to be repaired to return to active state (SN_repair is fired). The time values for the transitions SN_fail and SN_repair are retrieved from the RBD model depicted in Figure 2(a). It is important to mention that if the Standby Node fails, the system only fails if the Main Node or the Management Server also fail. The behavior of the rejuvenation supported by the live migration is modeled as follows. The transition DoLiveMigration represents the start of live migration event that can only occurs if some conditions are observed. MainNode and StandbyNode must be running properly and the clock have to announce that is the time to migrate (token in place ReadyToMigrate of Clock Model). If only one of these two conditions is satisfied the migration cannot occur. When both conditions are satisfied, DoLiveMigration fires and puts a token on LiveM place. Live Migration consists to move the VM to another host with a very small downtime [17] which is represented in the model by the MigrationTime transition. Thus, while a token remains on LiveM place, the system is down. After the end of live migration, the source node will undergo a rejuvenation process, the Standby Node takes role of Main Node, and the system is up again. Therefore, in this model the VM live migration is used to avoid a long downtime during the rejuvenation action. The main goal is to keep a spare machine with clear aging status ready to receive the migrated VM. So, with a aging cleared machine the VM can continue running, while the rejuvenation is applied to the node which suffered software aging. The process of rejuvenation is the RejuvenationNode transition. When the node is under the rejuvenation process no migration is allowed. V.
M ODEL ANALYSIS AND RESULTS
We conducted an availability study on the system described and modeled in the previous sections. The main objective is to evaluate the impact that different rejuvenation policies based on live migration will produce on the steady-state availability. For this purpose we created five scenarios with different rejuvenation policies and evaluated them to find steady-state availability and annual downtime. The appropriate trigger interval for the rejuvenation is also found for each scenario. Previous software aging studies on cloud platforms show that the time to aging related failure (TTARF) depends on the workload submitted for environment [13]. Therefore we evaluated the model using different values of TTARF to show what happens when the rejuvenation supported by live migration is used on clouds with different loads. A. Models parameters The RBD models are built and evaluated using the SHARPE tool [18]. SPN models are built and evaluated using the TimeNET tool [15]. Parameters adopted in the RBD models can be seen in Table I, and SPN parameters are in Table II. Due to the difficulty to obtain dependability rates through experimentation, in the scope of this paper all values are retrieved from consolidated studies [19] and [20].
Fig. 3.
SPN model for a cloud environment with live migration rejuvenation mechanism
(a) Scn #1 - 100hrs
(b) Scn #2 - 250hrs
(d) Scn #4 - 750hrs Fig. 4.
Sensitivity analysis of rejuvenation policies on steady-state availability
(c) Scn #3 - 500hrs
(e) Scn #5 - 1000hrs
TABLE I.
RBD S PARAMETERS
Block RBD HW OS Management tool
TABLE II. Parameters Transition Name MS fail MS repair MN fail, MN fail2 MN repair, MN repair2 SN fail SN repair VM fail VM repair VM Rb Aging, Aging2 RecoverFromAging MigrationTime RejuvenationNode * Depends on the scenario
MTTF 8760 h 1440 h 788.4 h
TABLE IV.
MTTR 100 min 1h 1h
# Scn 1 2 3 4 5
SPN S PARAMETERS
Description MS Internal failure MS Repair MainNode Internal failure MainNode Repair StandbyNode Internal failure StandbyNode Repair VM failure VM repair VM reboot Times to aging (phases) Time to recover from aging failure Time to live migrate a VM Time to rejuvenate Node
Values Mean time 481.5 h 1.03 h 1236.7 h 1.09 h 1236.7 h 1.09 h 2880 h 30 min 5 min * 1h 4s 30 min
R ESULTS FROM MODEL ANALYSIS
Rej. Policy 1h 6h 6h 12 h 12 h
SS Avail. 0.9961765 0.9965244 0.9966707 0.996714 0.9967297
Downtime (min/yr) 2009.6 1826.7 1749.8 1727.1 1718.8
other two metrics. First, the percentage gain of availability. This measure show how much the steady-state availability is improved when the system uses the rejuvenation supported by live migration. It was computed by comparing the steady-state availability with and without migration. The measure of this gain appears on Y-axis of the Figure 5. We also calculated the minutes of downtime avoided in a year. The Y-axis of Figure 6 shows the difference of downtime in the system with and without rejuvenation. The X-axis on both plots corresponds to time interval to trigger a rejuvenation on the system, which is limited to 720 hours.
B. Evaluation of scenarios The evaluation of the scenarios is based on two major factors: rejuvenation policy and Time To Aging Related Failure (TTARF). So, we built five scenarios with different TTARFs. In each scenario, the time interval to trigger the rejuvenation varies from 1 hour to 720 hours, using sampling. It is important to highlight that these values correspond to mean interval between migrations. The scenarios are specified in Table III, including the time values used for each phase of the Erlangbased sub-net that represents the aging phenomenon. The values are determined using a proportion shown in the aging phenomenon described in [13]. For each scenario, we made a sensitivity analysis considering the impact of rejuvenation policies on steady-state availability. Figure 4 presents the results of these scenarios. The plots also include a base line availability obtained from a model without rejuvenation policies. Notice that, for nearly all scenarios and policies, the system which uses rejuvenation has higher availability than the system which does not use it. With the results is possible to see that the availability decreases after reaching a specific maximum value. The appropriate rejuvenation policy to achieve the maximum availability on each scenario is an important conclusion from this sensitivity analysis. Table IV presents the rejuvenation trigger interval which yields the highest availability for each scenario. Note that for scenarios with lowest workload intensities (scenarios 4 and 5) the rejuvenation may be triggered with a larger interval than it would be triggered in scenarios with highest workload intensity, and subsequently smallest TTARFs.
Fig. 5.
Improvement availability in each scenario
Fig. 6.
Downtime reduction (min/yr)
In order to better show the impact of the proposed rejuvenation on cloud computing environments, we compute TABLE III. # Scenario 1 2 3 4 5
TTARF 100 h 250 h 500 h 750 h 1000 h
S CENARIOS DEFINITION
Aging 1st phase 66.667 h 166.667 h 333.333 h 500 h 666.667 h
Aging 2nd, 3rd phases 16.667 h 41.667 h 83.333 h 125 h 166.667 h
These plots enable us to see the direct impact of rejuvenation trigger intervals on the availability of the system. In Scenario 1, which has the smallest TTARF, the rejuvenation mechanisms produce significant improvements: with the proper rejuvenation trigger interval is possible to avoid about 78 hours
of downtime in a year. When the rejuvenation trigger interval varies, the steady-state availability also changes, but in sightly different ways for each scenario. There is an appropriate schedule for each scenario. In some cases, too often migrations can degrade system availability drastically. In the other way, the behavior depicted in the plots allows us to state that the steadystate availability tends to be close to baseline availability (no migration) as the trigger interval increases. VI.
C ONCLUSIONS AND F UTURE W ORKS
We presented a comprehensive availability model based on RBD and SPN for a cloud system where VMM rejuvenation supported by live migration is enabled. The models enable the choice of the appropriate rejuvenation policy to distinct scenarios, achieving significant improvement in the steady-state availability. The results show that the rejuvenation mechanism supported by live migration may be useful on various scenarios with different workloads. Besides, for systems under heavy workloads, where aging is faster, the live migration rejuvenation can bring a significant improvement to availability, by employing the correct rejuvenation schedule. Also, it is important to say that when the rejuvenation trigger interval is large the steady-state availability tends to return to the value of the system without rejuvenation. There are also cases where a intense migration activity is not appropriate to improve availability. In future works, we intend to study more scenarios with different types of workloads and include in the availability models the aging characteristics of other components of cloud computing environments. Another future objective is to do sensitivity analysis on another models parameters, not only rejuvenation trigger. ACKNOWLEDGMENTS We would like to thank the Coordination of Improvement of Higher Education Personnel – CAPES, the Foundation for Support to Science and Technology of Pernambuco State FACEPE, and MoDCS Research Group for their support. R EFERENCES [1]
[2]
[3]
[4]
[5]
[6]
C. Gong, J. Liu, Q. Zhang, H. Chen, and Z. Gong, “The characteristics of cloud computing,” in Parallel Processing Workshops (ICPPW), 2010 39th Int. Conf. on. IEEE, 2010, pp. 275–279. C. Clark, K. Fraser, S. Hand, J. G. Hansen, E. Jul, C. Limpach, I. Pratt, and A. Warfield, “Live migration of virtual machines,” in Proceedings of the 2nd Symposium on Networked Systems Design & ImplementationVolume 2. USENIX Association, 2005, pp. 273–286. F. Machida, D. S. Kim, and K. S. Trivedi, “Modeling and analysis of software rejuvenation in a server virtualized system,” in Software Aging and Rejuvenation (WoSAR), 2010 IEEE 2nd Int. Workshop on. IEEE, 2010, pp. 1–6. Y. Huang, C. Kintala, N. Kolettis, and N. D. Fulton, “Software rejuvenation: Analysis, module and applications,” in Proc. of 25th Symp. on Fault Tolerant Computing, FTCS-25, Pasadena, 1995, pp. 381–390. M. Grottke, R. Matias, and K. Trivedi, “The fundamentals of software aging,” in Proc of 1st Int. Workshop on Software Aging and Rejuvenation (WoSAR), in conjunction with 19th IEEE Int. Symp. on Software Reliability Engineering, Seattle, Nov. 2008. A. Avizienis, J. Laprie, B. Randell, and C. Landwehr, “Basic concepts and taxonomy of dependable and secure computing,” IEEE Transactions on Dependable and Secure Computing, vol. 1, pp. 11–33, 2004.
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14] [15] [16]
[17] [18] [19]
[20]
J. Araujo, R. Matos Junior, P. Maciel, and R. Matias, “Software aging issues on the eucalyptus cloud computing infrastructure,” in Proceedings of the IEEE Int. Conf. on Systems, Man, and Cybernetics (SMC’11), Anchorage, 2011. J. Araujo, R. Matos Junior, P. Maciel, R. Matias, and I. Beicker, “Experimental evaluation of software aging effects on the eucalyptus cloud computing infrastructure,” in Proceedings of the ACM/IFIP/USENIX International Middleware Conference (Middleware’11), Lisbon, 2011. R. Matos Junior, J. Araujo, V. Alves, and P. Maciel, “Experimental evaluation of software aging effects in the eucalyptus elastic block storage,” in Proceedings of the IEEE Int. Conf. on Systems, Man, and Cybernetics (SMC’12), Seoul, 2012. R. Matias and P. J. Freitas Filho, “An experimental study on software aging and rejuvenation in web servers,” in Proc. of 30th Annual Int. Computer Software and Applications Conference (COMPSAC’06), Chicago, Sep. 2006. J. Araujo, R. Matos Junior, P. Maciel, F. Vieira, R. Matias, and K. S. Trivedi, “Software rejuvenation in eucalyptus cloud computing infrastructure: a method based on time series forecasting and multiple thresholds,” in Proc. of the 3rd Int. Workshop on Software Aging and Rejuvenation (WoSAR’11) in conj. with the 22nd annual Int. Symp. on Software Reliability Engineering (ISSRE’11), Hiroshima, Japan, 2011. A. Guimaraes, H. Oliveira, R. Barros, and P. Maciel, “Availability analysis of redundant computer networks: A strategy based on reliability importance,” in Communication Software and Networks (ICCSN), 2011 IEEE 3rd Int. Conf. on, May, pp. 328–332. R. Matos, J. Araujo, V. Alves, and P. Maciel, “Characterization of software aging effects in elastic storage mechanisms for private clouds,” in IEEE 23rd Int. Symp. on Software Reliability Engineering Workshops (ISSREW), 2012, pp. 293–298. R. German, Performance Analysis of Communication Systems with NonMarkovian Stochastic Petri Nets. New York, NY, USA: John Wiley & Sons, Inc., 2000. R. German, C. Kelling, A. Zimmermann, G. Hommel, T. U. Berlin, and F. P. U. Robotik, “Timenet - a toolkit for evaluating non-markovian stochastic petri nets,” Performance Evaluation, vol. 24, pp. 69–87, 1995. K. D. Figiel and D. R. Sule, “A generalized reliability block diagram (rbd) simulation,” in Proc. of the 22th Winter Simulation Conference, New Orleans, Louisiana, USA, December 9-12, 1990, O. Balci, Ed. IEEE, 1990, pp. 551–556. Y. Wu and M. Zhao, “Performance modeling of virtual machine live migration,” in Cloud Computing (CLOUD), 2011 IEEE Int. Conf. on, July, pp. 492–499. K. S. Trivedi, “Sharpe 2002: Symbolic hierarchical automated reliability and performance evaluator,” in DSN, 2002, p. 544. J. Dantas, R. Matos, J. Araujo, and P. Maciel, “An availability model for eucalyptus platform: An analysis of warm-standy replication mechanism,” in Systems, Man, and Cybernetics (SMC), 2012 IEEE Int. Conf. on, Oct., pp. 1664–1669. D. S. Kim, F. Machida, and K. Trivedi, “Availability modeling and analysis of a virtualized system,” in Dependable Computing, 2009. PRDC ’09. 15th IEEE Pacific Rim Int. Symp. on, Nov., pp. 365–371.