11 Software Aging in the Eucalyptus Cloud Computing Infrastructure ...

9 downloads 13437 Views 3MB Size Report
The deployment of cloud-based architectures has grown over the recent years, mainly because ... based infrastructure due to the accumulation of memory leaks.
11

Software Aging in the Eucalyptus Cloud Computing Infrastructure: Characterization and Rejuvenation JEAN ARAUJO, RUBENS MATOS, VANDI ALVES, and PAULO MACIEL, Federal University of Pernambuco

F. VIEIRA DE SOUZA, Federal University of Piau´ı ˆ RIVALINO MATIAS Jr., Federal University of Uberlandia KISHOR S. TRIVEDI, Duke University

The need for high reliability, availability and performance has significantly increased in modern applications, that handle rapidly growing demands while providing uninterruptible services. Cloud computing systems fundamentally provide access to large pools of data and computational resources. Eucalyptus is a software framework largely used to implement private clouds and hybrid-style Infrastructure as a Service. It implements the Amazon Web Service (AWS) API, allowing interoperability with other AWS-based services. This article investigates the software aging effects in the Eucalyptus framework, considering workloads composed of intensive requests for remote storage attachment and virtual machine instantiations. We found problems that may be harmful to system dependability and performance, specifically regarding to RAM memory and swap space exhaustion, besides highly excessive CPU utilization by the virtual machines. We also present an approach that applies time series analysis to schedule rejuvenation, so as to reduce the downtime by predicting the proper moment to perform the rejuvenation. We experimentally evaluate our approach using an Eucalyptus test bed. The results show that our approach achieves higher availability, when compared to a threshold-triggered rejuvenation method based on continuous monitoring of resources utilization. Categories and Subject Descriptors: C.4 [Performance of Systems]: Performance attributes, reliability, availability and serviceability; D.4.8 [Operating Systems]: Performance—Measurements General Terms: Measurement, Performance Additional Key Words and Phrases: Software aging and rejuvenation; cloud computing; dependability and performance analysis; memory leak ACM Reference Format: Araujo, J., Matos, R., Alves, V., Maciel, P., Vieira de Souza, F., Matias Jr., R., and Trivedi, K. S. 2014. Software aging in the Eucalyptus cloud computing infrastructure: Characterization and rejuvenation. ACM J. Emerg. Technol. Comput. Syst. 10, 1, Article 11 (January 2014), 22 pages. DOI: http://dx.doi.org/10.1145/2539122

1. INTRODUCTION

The deployment of cloud-based architectures has grown over the recent years, mainly because they constitute a scalable, cost-effective and robust service platform [Peng et al. 2009; McKinley et al. 2006]. Such features are made possible due to the integration of This research was supported in part by the NASA Office of Safety and Mission Assurance (OSMA) Software Assurance Research Program (SARP) under a JPL subcontract #1440119. Authors’ addresses: J. Araujo, R. Matos, V. Alves, and P. Maciel, Informatics Center, Federal University of Pernambuco; F. Vieira de Souza, Statistics and Informatics Department, Federal University of Piaui; ˆ R. Matias Jr., School of Computer Science, Federal University of Uberlandia; K. S. Trivedi, Department of Electrical and Computer Engineering, Duke University. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]. c 2014 ACM 1550-4832/2014/01-ART11 $15.00  DOI: http://dx.doi.org/10.1145/2539122 ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.

11:2

J. Araujo et al.

various software components that enable reservation and access to remote computational resources, by means of standard interfaces and protocols, strongly based on web services [Eucalyptus 2011]. Virtualization is an essential requirement to build a typical cloud-computing infrastructure [Armbrust et al. 2009]. Cloud-oriented data centers allow the success of massive user-centric applications, such as social networks, that have experienced a rapid increase in the number of concurrent accesses. These benefits are rather important, mainly to small enterprises, enabling the provisioning of different levels of resource allocations, in a rapid manner, guaranteeing the performance and availability levels needed for the current day massive user-centric systems. Although the performance, availability and reliability are major requirements for cloud-oriented infrastructures, an aspect usually neglected by many service providers is the effect of software aging phenomenon [Grottke et al. 2008], which has been verified to play an important role in the reliability and performance degradation of many software systems [Grottke et al. 2008; Matias and Freitas Filho 2006; Bao et al. 2005]. While essential to elastic computing, the usage of virtual machines and remote storage volumes requires memory and disk intensive operations, mainly during virtual machine allocation, reconfiguration and destruction. Such operations may exhaust hardware and operating system resources [Araujo et al. 2011a] in the presence of software aging due to software faults or poor system design [Grottke et al. 2008]. The software aging effects in cloud computing environments were investigated in Araujo et al. [2011a, 2011b]. These papers demonstrate the occurrence of aging effects in an Eucalyptusbased infrastructure due to the accumulation of memory leaks. In Araujo et al. [2011c] and Matos et al. [2011], rejuvenation strategies have been proposed for reducing the downtime caused by the aging effects in the Eucalyptus framework. In this article, we extend the study presented in Araujo et al. [2011c], including an evaluation of other aging effects that were not described there for the Eucalyptus cloud computing environment. We focus on memory and CPU utilization during consecutive attachments of remote block storage volumes. Such operations are essential to provide flexible allocation of virtual machines, with minimum dependency on local storage devices, or complex data replication mechanisms. In addition to the trend analysis of software aging-related data proposed in Araujo et al. [2011c], we analyze the aging effects on the Eucalyptus elastic block storage (EBS). EBS is a technology that provides flexible allocation of remote storage volumes to the virtual machines running in a cloud environment. The remaining parts of this article are organized as follows. In Section 2, we present the fundamental concepts of the main topics discussed in this article. Section 3 presents related works, especially regarding cloud computing and software aging. Section 4 explains the test bed environment used in our experiments, including the definition of the adopted workloads. Section 5 describes the experimental studies, divided into three parts: the first experiment is performed using an EBS-based workload, whereas the second and third experiments are performed with a workload based on virtual machine’s lifecycle. Section 6 summarizes our conclusions and discusses possible topics for future research. 2. BACKGROUND

The investigation of software aging in cloud computing requires a multidisciplinary approach, that is at the intersection of several different but related topics. This section highlights the main concepts that provide the basis for this work. 2.1. Software Aging and Rejuvenation

Software aging can be defined as a growing degradation of the software’s internal state, during its operational life [Grottke et al. 2008]. Causes of software aging have been ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.

Software Aging in the Eucalyptus Cloud Computing Infrastructure

11:3

verified as the accumulated effects of software fault activations [Avizienis et al. 2004] during the system runtime [Huang et al. 1995]. Aging in a software system, as in human beings, is a cumulative process. The accumulating effects of successive error occurrences directly influence the aging-related failure manifestation. These faults gradually lead the system towards an erroneous state [Huang et al. 1995]. This gradual shifting is caused by aging effects accumulation, being the fundamental nature of the software aging phenomenon. It is important to highlight that a system fails due to the cumulative consequences of aging effects over time. For example, considering a specific load demand, an application server system may fail due to unavailable physical memory, which may be caused by cumulative memory leaks that in turn may be due to the lack of some memory deallocation. In this case, the aging-related fault is a defect in the code that causes memory leaks; the memory leak is the observed effect of an aging-related fault. The aging factors [Grottke et al. 2008] are the input patterns that exercise the code region where the aging-related faults may reside. The aging-related effect may be observable usually only after a long run of the system. The time to aging-related failure (TTARF) is an important metric for reliability and availability studies of systems suffering from aging [Bao et al. 2005]. Previous studies (e.g., Matias and Freitas Filho [2006], Bao et al. [2005], and Matias Jr. et al. [2010]) on the agingrelated failure phenomenon show that the TTARF probability distribution is strongly influenced by the intensity with which the system gets exposed to aging factors, such as the system workload. Due to the cumulative property of the software aging phenomenon, it occurs more intensively in continuously running systems that are executed over a long period of time, such as cloud-computing framework software components. In long-running executions, a system suffering from software aging exhibits an increasing failure rate due to the aging effect accumulation caused by successive errors, which degrades the system internal state integrity. Problems such as data inconsistency, numerical errors, and exhaustion of operating system resources are examples of software aging consequences [Grottke et al. 2008]. Since the notion of software aging was introduced in Huang et al. [1995], many studies have been conducted in order to characterize and understand this important phenomenon. Monitoring the aging effects is essential to any aging characterization study. Many previous studies have implemented aging monitoring at different system levels, however, to the best of our knowledge, the discussion of aging effects in a cloud computing environment is not sufficiently explored. Once the aging effects are detected, mitigation mechanisms might be applied in order to reduce their impact on the applications or the operating system. The search for software aging mitigation approaches resulted in the so-called software rejuvenation techniques [Huang et al. 1995; Matias and Freitas Filho 2006; Vaidyanathan and Trivedi 2005]. Since the aging effects are typically caused by hard-to-track (residual) software faults, rejuvenation techniques look for reducing the aging effects during the software runtime, until the aging causes (e.g., a software bug) are fixed definitively. Examples of rejuvenation approaches are the software restart or system reboot. In the former, the aged application process is killed and then a new process is created as a substitute. Replacing an aged process by a new one removes the aging effects accumulated during the replaced process’s runtime. Other approaches focus on different system levels, such as Kourai and Chiba [2007] that presents a rejuvenation technique for virtualized environments. A common problem during software rejuvenation is the downtime overhead caused by the restart or reboot actions, since the application or operating system are unavailable during the execution of these rejuvenation mechanisms. Matias and Freitas Filho [2006] proposed a zero-downtime rejuvenation technique for the apache web server, which was used by Matos et al. [2011] for addressing the characteristics of some aging effects observed in the Eucalyptus cloud computing environment. ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.

11:4

J. Araujo et al.

2.2. Dependability in Cloud Computing

Cloud computing provides access to computers and their functionality via the Internet or a local area network [Eucalyptus 2011]. The US National Institute of Standards and Technology - NIST [NIST 2011], defines cloud computing as follows: “Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction”. Cloud types (including public, private, and hybrid) refer to the nature of access and control with respect to the use and provisioning of virtual and physical resources. The most common cloud service styles are referred to by the acronyms IaaS (Infrastructure as a Service), PaaS (Platform as a Service), and SaaS (Software as a Service) [Sun Microsystems 2009]. Numerous advances in software architecture have helped to promote the adoption of cloud computing. These advances help to support the goal of efficient application development while helping applications to be elastic and scale gracefully and automatically [Sun Microsystems 2009]. Cloud computing is seen by some as an important forward-looking model for the distribution and access of computing resources because it offers the following potential advantages: —Scalability. Applications designed for cloud computing need to scale dynamically with workload demands so that performance and compliance with service level agreements remain on target [Eucalyptus 2011; Sun Microsystems 2009]. —Security. Applications need to provide access only to authorized and authenticated users, and the users need to be able to trust that their data is secure [Sun Microsystems 2009]. —Availability. Regardless of the application being provided, users of cloud applications expect them to be up and running every minute of every day [Sun Microsystems 2009]. —Reliability and Fault-Tolerance. Reliability means that applications do not fail and most importantly they do not lose data [Sun Microsystems 2009], that is, it is the ability to perform and maintain its functions even under unexpected circumstances. Many of the desirable features of a cloud system are related to the concept of dependability. There is no unique definition of dependability. By one largely adopted definition, it is the ability of a system to deliver the required services that can justifiably be trusted [Avizienis et al. 2004]. It is also defined as the property that prevents a system from failing in an unexpected or catastrophic way. Indeed, dependability is also related to disciplines such as availability and reliability. Availability is the ability of a system to perform its slated function at a specific instant of time [Trivedi et al. 2009; Xie et al. 2004; Musa 1998]. Dependability is a very important property for a cloud system as it should provide services with high availability, high stability, high fault tolerance, and dynamical extensibility. Because cloud computing is a large-scale distributed computing paradigm, and its applications are accessible anywhere and anytime, dependability in cloud system becomes more important and yet more difficult to achieve [Sun et al. 2010]. The software aging effects in cloud systems may affect the performance of communication among the cloud components, as well as their dependability. The degradation of communication performance, in its turn, may have an impact on the dependability of that system. Therefore, the presence of different and complex software layers in cloud systems raises the need of appropriate and effective monitoring of aging effects, as well as the need of proper rejuvenation mechanisms in order to assure the dependability aspects previously mentioned. ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.

Software Aging in the Eucalyptus Cloud Computing Infrastructure

11:5

2.3. Time Series

A time series can be represented by a set of observations of a variable arranged sequentially over time [Kedem and Fokianos 2002]. The series values form a stochastic process, that is, a collection of random variables X(t), for each t ∈ T , where T is the index set. A probability distribution associated to the random variables is another component of a stochastic process. In most situations, the variable t represents time, but can also represent another physical quantity, for example, space. The main applications of the time series are: description, explanation, process control and prediction. Time series enables one to build models that explain the behavior of the observed variable and the types of time series analyses may be divided into frequency-domain [Bloomfield 2000; Kedem and Fokianos 2002] and time-domain methods [Akaike 1969; Box and Jenkins 1970; Chatfield 1996]. Note that a model is the probabilistic description of a time series. The modeler must decide how to use the chosen model according to his goals. Many forecasting models are based on the method of “least squares” that provides the basis for all related theoretical studies. Usually they are classified according to the number of parameters involved. The regression models are among the most used for description and prediction of time series. This article adopts four models, namely: the linear model, the quadratic model, the exponential growth model and the model known as Pearl-Reed logistic, which are t of E[X(t)]: briefly described as follows, based on the predicted value Y —Linear Trend Model (LTM). This is the default model used in the analysis of trends. t = β0 + β1 · t + et where β0 is known as the y-intercept, β1 Its equation is given by Y represents the average rate of growth per unit time, and et is the error of fit between the model and the real series [Montgomery et al. 2008]. —Quadratic Trend Model (QTM). This model takes into account a smooth curvature t = β0 + β1 · t + β2 · t2 + et where the in the data. Its representation is given by Y coefficients have similar meanings as in the previous case [Montgomery et al. 2008]. —Growth Curve Model (GCM). This is the model of trend growth or decay in exponential t = β0 · β t + et . form. Its representation is given by Y 1 —S-Curve Trend Model (SCTM). This model fits the logistics of Pearl-Reed. It is usually used in time series that follow the shape of the S-curve. Its representation is given t = 10a /(β0 + β1 β t ) + et . by Y 2 Error measures [Schwarz 1978] are adopted for choosing the model that best fits the monitored environment. MAPE, MAD, and MSD are the error measures adopted in this article: —MAPE (Mean Absolute Percentage Error) represents the accuracy of the values of the time series expressed in percentage. This estimator is given by: n t )/Yt | |(Yt − Y M AP E = t=1 × 100, n t is the model calculated where Yt is the actual value observed at time t (Yt = 0), Y value and n is the number of observations. —MAD (Mean Absolute Deviation) represents the accuracy of the calculated values of the time series. It is expressed in the same unit of data. MAD is an indicator of the error size and is given by the statistic: n t )| |(Yt − Y M AD = t=1 , n t and n have the same meanings as in MAPE. where Yt , t, Y ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.

11:6

J. Araujo et al.

Fig. 1. Example of Eucalyptus-based environment.

—MSD (Mean Squared Deviation) is more sensitive to larger deviations than the MAD index. Its expression is given by n t )2 (Yt − Y , MSD = t=1 n t and n have the same meanings as in the previous indices. where Yt , t, Y 2.4. Eucalyptus Framework: An Overview

Eucalyptus is a software that implements scalable IaaS-style private and hybrid clouds [Eucalyptus 2010]. It was created with the purpose of cloud computing research and it is interface-compatible with the commercial service Amazon EC2 (Elastic Compute Cloud) [Jones 2008; Eucalyptus 2011]. The API compatibility enables one to run the same application on Amazon and Eucalyptus environments without modification. In general, the Eucalyptus cloud computing platform uses the virtualization capabilities (hypervisor) of the underlying computer system to enable flexible allocation of computing resources decoupled from a specific hardware [Eucalyptus 2010]. There are five high-level components in the Eucalyptus architecture, each one with its own web service interface: Cloud Controller (CLC), Cluster Controller (CC), Node Controller (NC), Storage Controller (SC), and Walrus [Eucalyptus 2010]. Figure 1 shows an example of Eucalyptus-based cloud computing environment, considering two clusters (A and B). Each cluster has one Cluster Controller, one Storage Controller, and various Node Controllers. The components in each cluster communicate with the Cloud Controller and Walrus in order to service the user requests. A user is able to employ EC2 tools as an interface to the Cloud Controller, or S3 (Amazons Simple Storage Service) tools to access Walrus. A brief description of each component follows. The Cloud Controller (CLC) is the front-end to the entire cloud infrastructure. The CLC is responsible for exposing and managing the underlying virtualized resources (servers, network, and storage) via Amazon EC2 API [Sun Microsystems 2009]. This component uses web service interfaces to receive the requests of client tools on one side and to interact with the rest of the Eucalyptus components on the other side. The Cluster Controller (CC) usually executes on a cluster front-end machine [Eucalyptus 2010, 2009], or on any machine that has network connectivity to both the ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.

Software Aging in the Eucalyptus Cloud Computing Infrastructure

11:7

nodes running Node Controllers (NC) and to the machine running the Cloud Controller. CCs gather information about a set of VMs and schedule VM execution on specific NCs. The Cluster Controller has three primary functions: schedule incoming requests to create VM instances at specific NCs, control the instance of virtual network overlay, and gather/report information about a set of NCs [Eucalyptus 2009]. A Node Controller (NC) runs on each node and controls the life cycle of VM instances running on the node. Hence, it is required that the NC interact with the operating system and the hypervisor running on the target node. A Cluster Controller (CC) manages the actions of NCs. NCs control the execution, inspection, and termination of VM instances on the node where they run, and fetch and clean up local copies of VM images. NCs query and control the system software on their nodes in response to queries and control requests from the Cluster Controller [Eucalyptus 2010]. An NC makes queries to discover the nodes physical resources - number of CPU cores, amount of memory, available disk space - as well as to learn about the state of VM instances on the nodes [Eucalyptus 2009; Johnson et al. 2010]. Storage controller (SC) provides persistent block storage for use by the VMs. It implements block-accessed network storage, similarly to that provided by Amazon Elastic Block Storage EBS [Amazon 2011a], and it is capable of interfacing with various storage systems (e.g., NFS, iSCSI). An elastic block storage is a block device that can be attached to a virtual machine but sends disk traffic across the locally attached network to a remote storage location. An EBS volume cannot be shared across VM instances [Johnson et al. 2010]. Walrus is a file-based data storage service, which is interface-compatible with Amazons Simple Storage Service (S3) [Eucalyptus 2009]. Walrus implements a REST interface (through HTTP), sometimes called the “Query” interface, as well as SOAP interfaces that are compatible with S3 [Eucalyptus 2009; Johnson et al. 2010]. Users that have access to Eucalyptus can use Walrus to stream data into/out of the cloud as well as from VM instances that they have started on the nodes. Additionally, Walrus acts as a storage service for VM images. Root filesystem as well as OS kernel and ramdisk images used to instantiate VMs on the nodes can be uploaded to Walrus and accessed from nodes. 3. RELATED PAPERS

The characteristics, architectures and applications of several popular cloud computing platforms are analyzed and discussed in Peng et al. [2009], which aims to clarify the differences among the investigated platforms. The authors conclude that although each cloud computing platform has its own strength, there are a lot of unsolved issues in all of them. Such issues include the continuous or high availability mechanisms of cluster failover in cloud environment, consistency guarantee, synchronization in different clusters, interoperation, standardization, and security. In Cordeiro et al. [2010], a comparative analysis of the three most popular cloud computing solutions (Xen Cloud Platform, Eucalyptus, and OpenNebula) is presented. The paper also describes illustrative examples of use of each platform, and it proposes that by understanding some of the main differences between them, one may decide where and when each solution may be more appropriate for use. Iosup et al. [2011] investigates the performance of cloud computing services for scientific computing workloads, quantifying the presence in real scientific computing workloads of Many-Task Computing (MTC) users, that is, users who employ loosely coupled applications comprising many tasks to achieve their scientific goals. That study was followed by an empirical evaluation of the performance of four commercial cloud computing services. Last, trace-driven simulation was used to compare the performance characteristics and cost models of clouds and other scientific computing platforms, for ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.

11:8

J. Araujo et al.

general and MTC-based scientific computing workloads. The results indicate that the current clouds need an order of magnitude in performance improvement to be useful to the scientific community, and show which improvements should be considered first to address this discrepancy between offer and demand. Mihailescu et al. [2011] proposes improvements for the resilience of cloud applications to infrastructure anomalies, by means of OX, a runtime system that uses application-level availability constraints and application topologies discovered on the fly. This system allows application owners to specify groups of highly available virtual machines. To discover application topologies, OX monitors network traffic among virtual machines transparently, and based on this information dynamically implements VM placement optimizations to enforce application availability constraints and reduce and/or alleviate application exposure to network communication anomalies, such as traffic bottlenecks. A technique for providing high availability in virtualized environments, called Remus, is presented in Cully et al. [2008]. It is an extension to the Xen hypervisor, which works by continually live-migrating a VM from the primary host to the backup. Such an approach prevents outages due to hardware failures and unusual software bugs, but it cannot avoid or fix problems commonly caused by software aging. In fact, a continuous software replication mechanism may copy bad aspects of system state, such as memory leaks or fragmentation, resulting from a faulty application. By considering software aging related to application domains, other than cloud computing, Matias and Filho [2010] present a study where they explored OS Linux kernel using instrumentation techniques to measure software aging effects. Carrozza et al. [2010] propose a practical approach to detect aging phenomena caused by memory leaks in distributed objects in an Off-The-Shelf middleware, that is commonly used to develop critical applications. The approach, which is validated on a real-world case study from the Air Traffic Control domain, defines algorithms and support tools to perform data filtering and for trading off experimentation time and statistical accuracy of aging trend estimates. Machida et al. [2010] present an availability analysis of virtualized servers, focusing on aging and rejuvenation of virtual machines monitors (VMMs or hypervisors), which are important components for every cloud computing infrastructure. Machida et al. [2010] used stochastic reward nets to analyze the events of failure, repair, and preventive maintenance. Those analytical models helped to find out an optimal combination of time intervals to perform the VM and VMM rejuvenation, aiming to achieve highservice availability and minimal loss of transactions. Similar works on rejuvenation in virtualized systems are found in Paing and Thein [2012] and Rezaei and Sharifi [2010]. 4. TESTBED ENVIRONMENT

We built a test bed composed of six machines (2.66-GHz Core 2 Quad processors, 4-GB RAM, 500-GB SATA hard disk). Three experimental studies were carried out in this infrastructure. For experiment #1, five of the six physical machines ran the Ubuntu Server 10.04 (Linux kernel 2.6.38-8) and the Eucalyptus System version 2.0.2. One machine, used as a client for the cloud, ran the Ubuntu Desktop 11.04 (Linux kernel 2.6.38-8 x86-64). The operating system running in the virtual machines was a customized version of Ubuntu Server Linux 9.04 that runs an HTTP server. For experiments #2 and #3, the operating system used was the Ubuntu Server 10.04 (kernel 2.6.35-24) with the Eucalyptus System version 1.6.1. The cloud environment under test was fully based on the Eucalyptus framework and the KVM [KVM 2012] hypervisor. Figure 2 shows the components of our test bed. The Cloud Controller, Cluster Controller, Storage Controller and Walrus were installed on the same machine ( the host 1 in our environment), and the VMs were instantiated on four physical machines (hosts 2, 3, ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.

Software Aging in the Eucalyptus Cloud Computing Infrastructure

11:9

Fig. 2. Components of the testbed environment.

4 and 5) so that each of them ran a Node Controller. Each host has the capacity to run at most four VMs. A single machine (client host) was used to monitor the entire environment and also to perform requests to the Cloud Controller, acting as a client for the cloud infrastructure implemented on our test bed. All nodes were connected to a private local area network, by means of a dedicated switch. The first experimental study considered the management of elastic block storage volumes assigned to the virtual machines. The second and third ones aimed at the effects of the life cycle of the virtual machines, so they stressed the instantiation of virtual machines in the nodes of the cloud. Each study used a specific workload, that was designed to accelerate their corresponding aging effects. 4.1. Description of the Workload #1: Management of EBS Volumes

For the first experiment, the environment was monitored for 300 hours. The duration for the experiment was defined based on previous works [Araujo et al. 2011b, 2011c] and empirical observation of the time elapsed until the occurrence of aging symptoms, considering the adopted workload that could accelerate possible faults and the manifestation of related aging effects. The workload adopted considered some Eucalyptus features that manage remote storage volumes assigned to the virtual machines. The Eucalyptus commands euca-attach-volume and euca-detach-volume were used for this purpose. The workload generation was implemented by a set of scripts that started 10 VMs and repeatedly attached and detached the 1-gigabyte remote volumes to the VMs. The use of elastic block storage enabled failover mechanisms such as the reboot of VMs in a different physical machine. Therefore, when a host failed, data and application were kept in a consistent state. Figure 3 represents the workload used in the first experiment. There are 50 storage volumes (Volume1 , . . . ,Volume50 ) available in the test environment, and 10 virtual machines (VM1 , . . . ,VM10 ). At the beginning of the experiment, each VM has one volume assigned to it, so Volumei is attached to VMi , where i = 1 to 10. The script detaches the current volumes from all VMs every 30 seconds, waits 10 seconds and attaches new volumes in the next range, from Volume11 to Volume20 . When the current volumes to be detached are in the range from Volume40 to Volume50 , the assignment returns to the initial range, from Volume1 to Volume10 . This workload script executes these operations for 300 hours, while measurement scripts collect data at 1-minute intervals in each ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.

11:10

J. Araujo et al.

Fig. 3. EBS management workload.

physical machine of the test bed. We collect memory usage for the Eucalyptus-related processes by means of the Linux/proc pseudo-filesystem [Canonical 2011], that gathers much information about the running processes and the overall system. The system programs mpstat and free [Blum 2008] have also been used to gather measures such as CPU utilization, swap memory utilization and the number of zombie processes. The processes responsible for the cloud controller, node controllers and virtual machines were also monitored. For each of these processes, the CPU usage, resident and virtual memory utilizations were also monitored. 4.2. Description of the Workload #2: Management of VM Lifecycle

For the study on VM lifecycle management, we changed the previously mentioned testbed infrastructure. Now, three nodes have a 32-bit (i386) version of the Ubuntu Server Linux OS, whereas one has the 64-bit (amd64) version of the same OS platform, which allowed us to capture possible different aging effects related to the system architecture. The environment was monitored for 30 days for the experiment #2 and 72 hours for the experiment #3. Similarly to the experiment #1, the definition of duration for these experiments was also based on empirical observations of the time elapsed until the manifestation of some aging symptoms, considering the workload that we adopted to accelerate the lifecycle of the virtual machines. Such a lifecycle is composed of four states: Pending, Running, Shutting down, and Terminated, as shown in Figure 4. Scripts are used in order to start, reboot and kill the VMs in a short time period. Such operations are essential to this kind of environment, because they enable quick scaling of capacity, both up and down, as the computing requirements change (the so-called elastic computing pattern) [Amazon 2011b]. Cloud-based applications adapt themselves to increases in the demand by instantiating new virtual machines, and save resources by terminating underused VMs when the load of requests is low. VM reboots are also essential to high-availability mechanisms that automatically restart VMs on other physical servers when a server failure is detected. Our workload was implemented by means of shell script functions that performed the operations we have just mentioned, as it may be seen here: —Instantiate Function. This function instantiates 8 VMs in a cluster. The VMs are instances of an Ubuntu Server running an HTTP server. —Kill Function. This function finds out which VM instances are running in the cloud and kills all of them. —Reboot Function. Much like the previous function, it also finds all the existing VM instances, but instead of killing them, this function executes their reboot. ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.

Software Aging in the Eucalyptus Cloud Computing Infrastructure

11:11

Fig. 4. VM lifecycle management workload.

Every two minutes, the script checked whether more than two hours have passed from the beginning of the last initialization. If so, all VMs were killed; otherwise, all VMs were rebooted. Empirically, we decided to use the time from two minutes to reboot eucalyptus service and two hours for carrying out the kill function because of the number of executions that we had in the monitoring period, which would generate a workload that could constantly stress the Eucalyptus infrastructure. The environment was monitored for about two hours without workload. Then, the control script instantiated all VMs and followed the workload cycle previously described. The monitoring period without running any workload was chosen to enable stating a relationship between the data obtained with and without workload. 5. EXPERIMENTAL RESULTS

In this section, we present the results of the three experiments performed in the Eucalyptus environment. Experiment #1 is related to the management of EBS volumes; the experiment #2 is based on the usage of system-wide resources during virtual machines lifecycle management; and experiment #3 shows the results related to the degradation of application-specific resources during the virtual machine’s lifecycle management. 5.1. Experiment #1: Management of EBS Volumes

The data collected in this experiment show some important aging effects on the elastic block storage management of Eucalyptus. Specifically, the analysis of resources utilization in one of the cluster nodes (host 2) indicates aging symptoms in some components of this cloud infrastructure. Figure 5 shows that the virtual memory consumed by the Eucalyptus Node Controller process (apache2/var/run/eucalyptus/httpd-nc.conf ) increases linearly during the entire experiment. We modeled this increase through t = 356842 + 2.32 ∗ t, where Y t is a linear regression. The result is the equation Y the predicted amount of memory usage at time t. The MAPE for this linear model is 0.00099, which confirms the goodness of fit for the obtained model. Figure 6 shows the plot of real and fitted values, besides the three error indices. The mean absolute deviation (MAD) is 368 KB, a small value if we consider that only increases in the order of megabytes would deserve attention. Table I shows the predicted values, considering that this Eucalyptus infrastructure would receive the workload for larger period of time, which is represented in months. The single-node controller process is expected ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.

11:12

J. Araujo et al.

Fig. 5. Virtual memory utilization by the node controller process.

Fig. 6. Trend analysis for the virtual memory of the node controller. Table I. Prediction for Virtual Memory Values Time (month)

Predicted (KB)

2 4 6 8 10 12

557,290 757,738 958,186 1,158,634 1,359,082 1,559,530

to reach about 1.5 GB of virtual memory in a period of one year. This behavior may have important consequences to the performance of virtual machines that run on this node. It is important to highlight that Eucalyptus is the software infrastructure for the management and execution of VMs and it should not consume, to a large extent, the resources supposed to be provisioned for the VMs. The utilization of resident memory, presented in Figure 7, also has a trend of growth but it drops at around 10,000 minutes, due to the Linux memory management that starts transferring part of the data from RAM to the swap area on the hard disk. This phenomenon is confirmed by checking the swap utilization, in Figure 8, that starts increasing at the same time. Swap space in Linux is commonly used when the amount ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.

Software Aging in the Eucalyptus Cloud Computing Infrastructure

11:13

Fig. 7. Resident memory utilization by the node controller process.

Fig. 8. Swap utilization in host 2.

of available physical memory (RAM) is low. If the system needs more memory resources and the RAM is full, inactive pages in memory are moved to the swap space. While swap space can help systems with a small amount of RAM, it should not be considered a replacement for physical memory. Swap space is located on hard drives, which have a slower access time than physical memory. Compared to memory, disks are very slow. Memory latency can be measured in nanoseconds, while disks are measured in milliseconds, so accessing the disk can be tens of thousands times slower than accessing physical memory. In such situations the system is struggling to find free memory and keep applications running at the same time. A system administrator may add more physical RAM to the system. Instead, it is better to mitigate the aging issues or try to fix the leakage through software update. In Figure 8, we can also see that the used swap space reaches about 450 MB, whereas the resident memory of the node controller decreases by only about 100 MB. The remaining amount of memory comes from the virtual machines allocated in that host. Figure 9 shows the resident memory utilization for both virtual machines (KVM processes) that were running in host 2. It is important to emphasize that we measure the memory used by the KVM processes responsible for each VM at the host machine, and not inside the guest operating system runing in the VM. A growth tendency similar to the one detected in the node controller process was observed as well as a drop at the ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.

11:14

J. Araujo et al.

Fig. 9. Resident memory utilization for VMs at host 2.

Fig. 10. CPU utilization for VMs at host 2.

mark of about 10,000 minutes. Notice that each virtual machine has a more significant memory utilization growth than the node controller. They reach almost 256 MB, which is the total size of RAM configured for each VM in our environment. Therefore, each VM was near to the maximum usage of physical memory, reinforcing the harmfulness of this aging phenomenon due to its possible performance consequences. The aging effects in the virtual machines also appear by means of increasing CPU utilization as time passes by. Figure 10 shows that CPU utilization in virtual machines reaches almost 90%, confirming that the system performance degrades as the workload requests are processed over time. This can be considered one of the most critical results observed in this experiment, because such a high CPU utilization can make the system to take too much time to respond and even cause failures in the execution of new requests [Witkon 2007; Sousa et al. 2009]. These results reveal that the system must be carefully evaluated and tuned, possibly developing and applying patches to Eucalyptus as a preventive measure for taking into account further workload demands. Figure 11 shows another very important effect of system degradation, which is the number of unsuccessful attachment/detachment requests during the experiment. Since 10 attachments and 10 detachments are done in each workload cycle, the maximum number of errors reported is 20 for each graph point. The increase in the number of errors begins at a point close to the beginning of swap usage growth in the node, indicating ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.

Software Aging in the Eucalyptus Cloud Computing Infrastructure

11:15

Fig. 11. Errors in workload requests for volumes attachment/detachment.

some relationship between these aging symptoms. The errors returned in the workload requests are due to the attempt of attaching a remote volume to a device that is already in-use by the VM. This information highlights the increasing trend of the detachment time as the experiment goes on. This tendency becomes critical when the 10-second interval between detachments and new attachments is not sufficient anymore. 5.2. Experiment #2: Virtual Machines Instantiation Workload - System Wide Resources

We analyzed the utilization of hardware and software resources in a scenario where the adopted workload performed operations related to instantiation of virtual machines. The main goal of this case study was to verify the existence of software aging symptoms in the scenario of VM lifecycle management. Some indicators of software aging had already been observed, but there was a need for confirmation and characterization of this phenomenon. Therefore, the experiment was performed during a 30-day period. The extent of the experiment was based on empirical observations of the time elapsed until the manifestation of possible aging symptoms, considering the workload that was adopted to stress the system. The CPU utilization and swap space usage, for the cloud controller machine (host 1) and node controller 64-bit machine (host 4), are among the most representative results found in this scenario. Other results are found in Araujo et al. [2011a] and Matos et al. [2011]. Figure 12 shows the CPU utilization results of the cloud controller machine. In almost all experiment the CPU usage does not exceed 5%. However, some major growth spurts following a nearly linear pattern can be observed throughout the experiment. We realize that such peaks of resource usage are increasing over time, which may be a sign of progressive performance degradation. There was also a considerable growth in swap memory use in the cloud controller machine and node controller machine that runs the 64-bit OS. In Figure 13, we can see that this growth has come close to 14 MB and 3.5 MB, respectively. The growth is constant, without drops, since the host continued responding to the requests of VM instantiation throughout all the experiment. However, even without stopping the service, this behavior deserves attention because the swap space is a limited resource and in a longer period this growth may lead to resource depletion, then to the system crash. It also may cause performance issues, similarly to the case observed in the previous experimental study. ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.

11:16

J. Araujo et al.

Fig. 12. CPU utilization in the cloud controller machine (Host 1).

Fig. 13. Swap memory used.

5.3. Experimental #3: Virtual Machines Instantiation Workload - Application Specific Resources

Based on the experimental study #2 and previous works [Araujo et al. 2011a], we noticed some software aging symptoms in the Eucalyptus node controller, related to virtual machine instantiations. The repeated operations (instantiation, reboot and termination) highlighted an ever-increasing usage of virtual memory, which disrupted the VM processes on the node controller in 32-bits operating systems, which, in turn, stopped responding to VM instantiation commands. This behavior constitutes an aging phenomenon of the Eucalyptus framework related to memory leaks. A manual restart of the Eucalyptus node controller service made the virtual memory usage fall to less than 110 MB. After restarting the service (by means of manual intervention), the same pattern is repeated. As can be seen in Figure 14, the process’s virtual memory has grown until about 3064 MB and again the node controller was not able to service the requests of virtual machines instantiation. This section further explains the rejuvenation method that was adopted for mitigating the harmful consequences of this aging phenomenon. ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.

Software Aging in the Eucalyptus Cloud Computing Infrastructure

11:17

Fig. 14. Virtual memory used in the NC process at Host3 [Araujo et al. 2011a].

Fig. 15. Classification of rejuvenation strategies.

Adopted Rejuvenation Method. We propose an automated method for triggering a rejuvenation mechanism in private cloud environments that are backed up by the Eucalyptus cloud-computing framework. Our method is based on the rejuvenation mechanism originally presented in Matias and Freitas Filho [2006], which sends a software signal (SIGUSR1) to the apache master process, so that all apache slave processes, in idle state, are terminated and new ones are created to replace them. This action cleans up the accumulated memory leak effects inside the replaced apache processes, and has a small impact on the service, since the master process is still able to wait for new connections while the slave processes’ rejuvenation occurs. Creating a new process to replace the old one causes a downtime of around 5 seconds, due to the load of Eucalyptus configurations during process startup, as observed in previous experiments. In a production environment, the interval between process restarts should be as large as possible, in order to reduce the summed effects of small downtimes during a large runtime period. One approach to determine that maximum interval is based on a high frequency monitoring of process memory usage. At the exact moment when a memory limit is reached, the rejuvenation is triggered. Since a small sampling interval may affect system performance, so we believe that a 1-minute interval is the minimum time to avoid interference with the system. Despite the ability to provide good results, we can identify a problem in such an approach. It is possible that the node controller process reaches its memory limit between two monitoring epochs in time. An additional downtime is introduced in this way, which we will describe as monitor-caused downtime. Our proposed approach for triggering mechanism aims to remove this additional downtime, as it tries to keep the interval between process restarts as large as possible. The prediction about when the critical memory utilization (CMU) will be reached is used for this purpose. Therefore, considering the classification of rejuvenation strategies presented in Figure 15, our approach has characteristics of two categories, since it is a threshold-based rejuvenation but it is aided by predictions. Time-series fitting ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.

11:18

J. Araujo et al.

Fig. 16. Forecasting and threshold-based hybrid rejuvenation approach.

enables us to perform a trend analysis and therefore state, with an acceptable error, the time remaining until the process reaches the CMU. This information makes it possible to schedule the rejuvenation to a given time (Trej ), which takes into account a safe limit (Tsaf e ) to complete the rejuvenation process before the CMU is reached. The safe limit should encompass the time spent during the rejuvenation and the time relative to the time-series prediction error. Therefore, we can state that Trej = TC MU − Tsaf e = TC MU − (Trestart + T PredError ). As can be seen in Figure 16, the trend analysis is started only after the monitoring script detects that the node controller process has grown over a time series computation starting point, TSC SP , which in our case is 80% of the critical memory utilization. This starting point was adopted to avoid unnecessary interference on the system due to the periodic computation of time series fitting. When a limit of 95% of CMU is reached, the last prediction generated by the trend analysis is used to schedule the rejuvenation action, that is, the last computed TC MU will be used to assess the Trej and the system will be prepared so that the rejuvenation occurs gracefully in time Trej . We see that by reaching this time series computation final point, TSC F P , there is no benefit in continuing computing new estimates, and it would be a risk to postpone the scheduling of the rejuvenation action. Note that the values adopted for TSC F P and TSC SP are specific to our environment, and therefore may vary if this strategy is instantiated for other kinds of systems. Result Analysis. We carried out an experimental study to verify the effectiveness of our proposed rejuvenation method on the described Eucalyptus cloud computing environment. We focused on the rejuvenation of the Node Controller process, since it has shown the major aging effects among all monitored components. This experiment was executed for 72 hours because it was already known, from experiment #2, that such a time interval was enough to obtain the expected aging results. It is noteworthy that the use of time series to reduce the system downtime during the execution of a rejuvenation action may also be applied to other computing environments. The time series itself is not involved directly in the system, but indicates the appropriate time at which an action should be taken. We used data collected in preliminary experiments to find out which kind of time series provides a better fit for the growth of virtual memory usage in Node Controller processes. We used four models (LTM, QTM, GCM, and SCTM) for trend analysis. A summary of t is the predicted the results of the fitting and their errors are shown in Table II, where Y value of the memory consumption at time t. It can be seen from Table II that the values of the indices MAPE, MAD, and MSD are smaller for the LTM and QTM models. So the choice must be made selecting one of these two models. It is also observed that despite the fact that the MAPE values are the same for these two models, the other indices values of the QTM model are smaller than for the LTM model. So the QTM model was chosen as the best fit for the trend analysis of virtual memory utilization in Eucalyptus node controllers, and therefore it was used for our rejuvenation scheduling. ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.

Software Aging in the Eucalyptus Cloud Computing Infrastructure

11:19

Table II. Summary of the Accuracy Indices for Each Model (NC Virtual Memory) t Y

Model

MAPE

MAD

MSD

LTM

44157.1 + 2.85t

1%

900

1472447

QTM

43354.8 + 2.95830t − 0.000002t2

1%

872

1343698

GCM

53013.6(1.00003t )

6%

6014

52619294

SCTM

(106 )/(4.70218 + 16.8036(0.999942t ))

2%

1259

3449812

Fig. 17. Quadratic trend analysis of virtual memory. Table III. Comparison of Experiments Availability Number of nines Downtime

Threshold-based rejuvenation

Proposed rejuvenation method

0.999584 3.38 108 seconds

0.999922 4.11 20 seconds

The actual experimental study for verifying the rejuvenation method was performed in two parts. First, the cloud environment was stressed with the workload described in Section 4.2, and the rejuvenation action was triggered only when the monitoring script detected that the critical limit was reached. Next, the same workload was used, but the rejuvenation was scheduled based on the time-series predictions. Figure 17 shows the trend analysis for the growth of virtual memory utilization, t = 94429 + 3825.3t − 0.0686t2 . Such an analysis has fit by a quadratic function Y provided a value of 809 minutes for the TC MU , that is, the predicted time to reach the 3-GB limit. By knowing the beginning time of the experiment, we scheduled the rejuvenation to a given Trej = 809 − (5/60 + 5), in minutes, counting up from the beginning of the experiment. After rejuvenation, the memory usage is reduced, and other trend analysis is carried out when the 80% limit is reached again. The results show that the proposed rejuvenation triggering method brings system availability to a higher level, when compared to the threshold-based rejuvenation approach. The number of nines increases from 3.38 to 4.11 (see Table III). In a time-lapse of one year, such difference means a decrease from 218 minutes to 40 minutes of downtime, that is, the unavailable time was reduced by about 80%. Such an enhancement in the system availability avoids the loss of requests for instantiation of virtual machines, or any other similar users’ requests. Table IV presents the absolute percentage error between the predictions and the actual values for the virtual memory utilization in this experiment. The error varies ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.

11:20

J. Araujo et al. Table IV. Comparison of Virtual Memory Predictions and Actual Values Time (min)

Predicted (KB)

Actual (KB)

Error (%)

120 240 360 480 600

608181 1069185 1530189 1991193 2452197

695624 1064728 1433776 1977272 2601880

12.57% 0.42% 6.72% 0.70% 5.75%

in this range from two to ten hours, but it is below 10% in most analyzed points, which means that our approach achieved an acceptable accuracy in its predictions, and it may be enhanced by using the last prediction errors to adjust the related threshold. The best approximations are obtained at four and eight hours of experiment, which are the points were the “fit” line intercepts the “actual” line in Figure 17. The worst errors in Table IV correspond to the regions where the lines in Figure 17 have the largest distances, showing that the 12.57% error at 120 minutes is close to the higher bound of prediction errors in this experiment. Such a maximum error in the predictions provided by the time series enable the use of our approach in other environments that have similar aging characteristics. 6. FINAL REMARKS

This article investigates software aging effects in the Eucalyptus-based cloud computing infrastructure. In addition to the detection of different aging effects in the Eucalyptus, a rejuvenation method is also proposed for mitigating the identified aging effects. We found indicators of memory leaking in Eucalyptus processes and related to the handling of elastic block storage. Such problems may be harmful to the Eucalyptus dependability, or any other cloud applications running under its environment. Performance degradation due to RAM memory exhaustion and subsequent use of swap space are detected and discussed. The high CPU utilization by the virtual machines also highlighted possible faults related to the management of EBS volumes, supported by the guest operating system or even by the KVM hypervisor. Memory leaks in the VMs management, especially on instantiations, caused system crashes that blocked the creation of new VMs. In terms of aging mitigation approach, the proposed rejuvenation method used multiple thresholds and time-series forecasting to reduce the virtual memory utilization before the system reaches a critical point. The experimental results show that our approach offers a reduced downtime when compared to a threshold-based method. Note that the proposed method is not tied to the characteristics of our experimental cloud computing environment, so it can be adapted to handle aging issues in practically any other software system. ACKNOWLEDGMENTS We would like to thank the following Brazilian agencies for reseacrh support: CNPq, FACEPE, FAPEMIG and CAPES. We also give our thanks to the MoDCS Research Group.

REFERENCES AKAIKE, H. 1969. Fitting autoregressive models for prediction. Ann. Institute Stat. Math. 21, 1, 243–247. AMAZON. 2011a. Amazon Elastic Block Store (EBS). Amazon.com, Inc. Available in: http://aws.amazon. com/ebs. AMAZON. 2011b. Amazon elastic compute cloud - ec2. Amazon.com, Inc. ARAUJO, J., MATOS JUNIOR, R., MACIEL, P., AND MATIAS, R. 2011a. Software aging issues on the eucalyptus cloud computing infrastructure. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics (SMC’11). Anchorage.

ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.

Software Aging in the Eucalyptus Cloud Computing Infrastructure

11:21

ARAUJO, J., MATOS JUNIOR, R., MACIEL, P., MATIAS, R., AND BEICKER, I. 2011b. Experimental evaluation of software aging effects on the eucalyptus cloud computing infrastructure. In Proceedings of the ACM/IFIP/USENIX International Middleware Conference (Middleware’11). Lisbon. ARAUJO, J., MATOS JUNIOR, R., MACIEL, P., VIEIRA, F., MATIAS, R., AND TRIVEDI, K. S. 2011c. Software rejuvenation in eucalyptus cloud computing infrastructure: A method based on time series forecasting and multiple thresholds. In Proceedings of the 3rd International Workshop on Software Aging and Rejuvenation (WoSAR’11) in conjuction with the 22nd Annual International Symposium on Software Reliability Engineering (ISSRE’11). Hiroshima. ARMBRUST, M., FOX, A., GRIFFITH, R., JOSEPH, A. D., KATZ, R., KONWINSKI, A., LEE, G., PATTERSON, D., RABKIN, A., STOICA, I., AND ZAHARIA, M. 2009. Above the clouds: A Berkeley view of cloud computing. Tech. Rep. UCB/EECS-2009-28, UC Berkeley Reliable Adaptive Distributed Systems Laboratory. Feb. AVIZIENIS, A., LAPRIE, J., RANDELL, B., AND LANDWEHR, C. 2004. Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. Depend. Secure Comput. 1, 11–33. BAO, Y., SUN, X., AND TRIVEDI, K. S. 2005. A workload-based analysis of software aging and rejuvenation. IEEE Trans. Reliab. 54, 541–548. BLOOMFIELD, P. 2000. Fourier Analysis of Time Series: An Introduction. Wiley Series in Probability and Statistics. BLUM, R. 2008. Linux Command Line and Shell Scripting Bible. Wiley. BOX, G. AND JENKINS, G. 1970. Time Series Analysis. Holden-Day series in time series analysis. Holden-Day, San Francisco, CA. CANONICAL. 2011. Manual pages about using a GNU/Linux system. Canonical Ltd. Available in: http:// manpages.ubuntu.com/manpages/hardy/man5/proc.5.html. CARROZZA, G., COTRONEO, D., NATELLA, R., PECCHIA, A., AND RUSSO, S. 2010. Memory leak analysis of missioncritical middleware. J. Syst. Softw. 83, 1556–1567. CHATFIELD, C. 1996. The Analysis of Time Series: An Introduction 5th Ed. Chapman & Hall/CRC, New York. CORDEIRO, T., DAMALIO, D., PEREIRA, N., ENDO, P., PALHARES, A., GONCALVES, G., SADOK, D., KELNER, J., MELANDER, B., SOUZA, V., AND M˚ANGS, J.-E. 2010. Open source cloud computing platforms. In Proceedings of the 9th International Conference on Grid and Cloud Computing (GCC’2010) (Jiangsu). 1–5. CULLY, B., LEFEBVRE, G., MEYER, D., FEELEY, M., HUTCHINSON, N., AND WARFIELD, A. 2008. Remus: High availability via asynchronous virtual machine replication. In Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation (San Francisco). 161–174. EUCALYPTUS. 2009. Eucalyptus Open-Source Cloud Computing Infrastructure - An Overview. Eucalyptus Systems, Inc., 130 Castilian Drive, Goleta, CA 93117 USA. EUCALYPTUS. 2010. Cloud Computing and Open Source: IT Climatology is Born. Eucalyptus Systems, Inc., 130 Castilian Drive, Goleta, CA 93117 USA. EUCALYPTUS. 2011. Eucalyptus - the open source cloud platform. Eucalyptus Systems, Inc. Available in: http://open.eucalyptus.com/. GROTTKE, M., MATIAS, R., AND TRIVEDI, K. 2008. The fundamentals of software aging. In Proceedings of the 1st International Workshop on Software Aging and Rejuvenation (WoSAR), in conjunction with the 19th IEEE International Symposium on Software Reliability Engineering (Seattle). HUANG, Y., KINTALA, C., KOLETTIS, N., AND FULTON, N. D. 1995. Software rejuvenation: Analysis, module and applications. In Proceedings of the 25th Symposium on Fault Tolerant Computing (FTCS-25) (Pasadena). 381–390. IOSUP, A., OSTERMANN, S., YIGITBASI, N., PRODAN, R., FAHRINGER, T., AND EPEMA, D. 2011. Performance analysis of cloud computing services for many-tasks scientific computing. IEEE Trans. Paral. Distrib. Syst. (TPDS), Special Issue on Many-Task Computing 22, 931–945. JOHNSON, D., MURARI, K., RAJU, M., RB, S., AND GIRIKUMAR, Y. 2010. Eucalyptus Beginner’s Guide UEC Ed. For Ubuntu Server 10.04 - Lucid Lynx, v1.0. JONES, M. T. 2008. Cloud computing with Linux - cloud computing platforms and applications. IBM Corporation, 12. KEDEM, B. AND FOKIANOS, K. 2002. Regression Models for Time Series Analysis. Wiley. KOURAI, K. AND CHIBA, S. 2007. A fast rejuvenation technique for server consolidation with virtual machines. In Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’07) (Washington). 245–255. KVM. 2012. Kernel based virtual machine. Project Home Page. Available in: http://www.linux-kvm.org. MACHIDA, F., KIM, D. S., AND TRIVEDI, K. 2010. Modeling and analysis of software rejuvenation in a server virtualized system. In Proceedings of the 2010 IEEE 2nd International Workshop on Software Aging and Rejuvenation (WoSAR). 1 –6.

ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.

11:22

J. Araujo et al.

MATIAS, R. AND FILHO, P. J. F. 2010. Measuring software aging effects through OS kernel instrumentation. In Proceedings of the 2nd International Workshop on Software Aging and Rejuvenation (WoSAR), in conjunction with 21th IEEE International Symposium on Software Reliability Engineering (ISSRE’10) (San Jose). MATIAS, R. AND FREITAS FILHO, P. J. 2006. An experimental study on software aging and rejuvenation in web servers. In Proceedings of the 30th Annual International Computer Software and Applications Conference (COMPSAC’06) (Chicago). MATIAS JR., R., BARBETTA, P. A., TRIVEDI, K. S., AND FILHO, P. J. F. 2010. Accelerated degradation tests applied to software aging experiments. IEEE Trans. Reliab. 59, 1, 102–114. MATOS JR., R., ARAUJO, J., MACIEL, P., VIEIRA, F., MATIAS, R., AND TRIVEDI, K. S. 2011. Software rejuvenation in Eucalyptus cloud computing infrastructure: A hybrid method based on multiple thresholds and time series prediction. Int. Trans. Syst. Sci. Appl. 7, 295–303. MCKINLEY, P. K., SAMIMI, F. A., SHAPIRO, J. K., AND TANG, C. 2006. Service clouds: A distributed infrastructure for composing autonomic communication services. In Proceedings of the 2nd IEEE International Symposium on Dependable, Autonomic and Secure Computing (DASC’06) (Indianapolis, IN). 341–348. MIHAILESCU, M., RODRIGUEZ, A., AND AMZA, C. 2011. Enhancing application robustness in infrastructure-as-aservice clouds. In Proceedings of the 1st International Workshop on Dependability of Clouds, Data Centers and Virtual Computing Environments (DCDV 2011) in conjunction with the 41st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’11) (Hong Kong). MONTGOMERY, D. C., JENNINGS, C. L., AND KULAHCI, M. 2008. Introduction to Time Series Analysis and Forecasting. Wiley series in probability and statistics. MUSA, J. D. 1998. Software Reliability Engineering: More Reliable Software, Faster Development and Testing 2 Ed. McGraw-Hill, New York, NY. NIST. 2011. NIST. National Institute of Standards and Technology, Information Technology Laboratory, U.S. Department of Commerce. Available in: http://csrc.nist.gov. PAING, A. M. M. AND THEIN, N. L. 2012. Stochastic reward nets model for time based software rejuvenation in virtualized environment. Int. J. Comput. Sci. Telecommuni. 3, 1, 1–10. PENG, J., ZHANG, X., LEI, Z., ZHANG, B., ZHANG, W., AND LI, Q. 2009. Comparison of several cloud computing platforms. In Proceedings of the 2nd International Symposium on Information Science and Engineering (ISISE) (Shanghai). IEEE Press, 23–27. REZAEI, A. AND SHARIFI, M. 2010. Rejuvenating high available virtualized systems. In Proceedings of the International Conference on Availability, Reliability, and Security, 2010 (ARES’10). 289–294. SCHWARZ, G. 1978. Estimating the dimension of a model. Ann. Stati. SOUSA, E., MACIEL, P. R. M., ARAJO, C., ALVES, G., AND CHICOUT, F. 2009. Performance modeling for evaluation and planning of electronic funds transfer systems. In Proceedings of ISCC’09. 73–76. SUN, D., CHANG, G., GUO, Q., WANG, C., AND WANG, X. 2010. A dependability model to enhance security of cloud environment using system-level virtualization techniques. In Proceedings of the 1st International Conference on Pervasive Computing, Signal Processing and Applications. 6. SUN MICROSYSTEMS. 2009. Introduction to Cloud Computing Architecture 1 Ed. Sun Microsystems, Inc. TRIVEDI, K. S., KIM, D. S., ROY, A., AND MEDHI, D. 2009. Dependability and security models. In Proceedings of the 7th International Workshop on the Design of Reliable Communication Networks (DRCN’09). VAIDYANATHAN, K. AND TRIVEDI, K. S. 2005. A comprehensive model for software rejuvenation. IEEE Trans. Depend. Secure Comput. 2, 124–137. WITKON, E. 2007. Using Load Testing to meet Your SLA. RadView Software. RadView Executive White Paper. XIE, M., DAI, Y.-S., AND POH, K.-L. 2004. Computing System Reliability: Models and Analysis. Kluwer Academic Publishers. Received April 2012; revised September 2012; accepted November 2012

ACM Journal on Emerging Technologies in Computing Systems, Vol. 10, No. 1, Article 11, Pub. date: January 2014.