Keep It Moving: Proactive workload management for reducing SLA ...

Keep It Moving: Proactive workload management for reducing SLA violations in large scale SaaS clouds Arpan Roy, Rajeshwari Ganesan and Santonu Sarkar Next Gen Computing Lab, Infosys Labs, Electronics City, Bangalore 560100, India Email: {Arpan_Roy02, Rajeshwari_Ganesan, Santonu_Sarkar01}@infosys.com Abstract—Software failures, workload-related failures and job overload conditions bring about SLA violations in softwareas-a-service (SaaS) systems. Existing work does not address mitigation of SLA violations completely as (i) none of them address mitigation of SLA violations in business specific scenarios (SaaS, in our case), (ii) while some do not address software and workload-related failures, other approaches do not address the problem of target PM selection for workload migration comprehensively (leaving out vital considerations like workload compatibility checks between migrating VM and VMs at the target PM) and (iii) a clear mathematical mapping between workload, resource demand and SLA is lacking. In this paper, we present the Keep It Moving (KIM) software framework for the cloud controller that helps minimize service failures due to SLA violation of availability, utilization and response time in SaaS cloud data centers. Though we consider migration to be the primary mitigation technique, we also try to mitigate SLA violations without migration. We achieve this by performing a capacity check on the host physical machine (PM) before the migration to identify if enough capacity is available on the current PM to address the upcoming SLA violations by restart/reboot or VM resizing. In certain cases such as workloadrelated failures due to corrupt files, we prefer workload rerouting to a replica VM over migration. We formulate the selection of a target PM as a multi-objective optimization problem. We validate our proposed approach by using a trace-based discrete event simulation of a virtualized data center where failure and workload characteristics are simulated from data extracted from a real SaaS business server logs. We found that a 60% reduction in SLA violation is possible using our approach as well as reducing VM downtime by approximately 10%. Keywords—SLA violation, business SaaS data center, application logs, failures, multi-objective optimization

I.

I NTRODUCTION

On cloud platforms, the service providers need to abide by certain service level agreements (SLA) with the customer. For every SLA violation, the service provider pays the client a pre-determined penalty. In SaaS platforms, SLA agreements usually consist of two measures of interest (MOI): (i) availability of service and (ii) service response time. In order to provide the customer with proof of SLA compliance, the service provider provides them with a snapshot of these MOIs over a certain time interval. For each MOI, a threshold is defined in the SLA. If the MOI value exceeds a pre-specified threshold it will lead to SLA violation. These SLA violations are typically viewed as service failures. VM failures (specifically, we investigate software failures and workload-related failures) are more frequent and is usually brought about by exhaustion of operating system resources (e.g., memory leaks), fragmentation, accumulation of errors over time or corrupt

978-1-4799-2366-3/13/$31.00 ©2013 IEEE

421

files provided by the workload. These kind of failures are manifested in virtual machine crashes (availability violations) or response time violations (performance degradation) [1]. We use a proactive SLA violation detection model to generate proactive triggers for remedial actions. Typical remedial actions include virtual machine (VM) restarts/reboots, increasing the VM’s allocated capacity, rerouting the workload to a replica VM and live VM migration to a new PM. In this paper, we present a unified software framework that helps minimize service failures and VM downtimes in SaaS cloud data centers. More specifically, we present, ∙

the Keep It Moving (KIM) software framework for the cloud controller that is aware of SLA violations of workload response time thresholds, PM and VM utilization thresholds as well as availability violations at the PM and VM and a workload management system (within the KIM framework) that performs SLA violation specific mitigation of migration/rerouting the workload as well as increasing VM’s allocated capacity for load balancing (using Little’s law),

∙

a classification of SLA violations in SaaS scenarios based on failure and workload data from a business SaaS platform [2] and

∙

validations of our proposed approach via data tracebased discrete event simulation of a relevant case study that shows possibility of a 60% reduction in the number of SLA violations and close to 10% improvement in VM downtime.

The dynamic nature of the system we propose can be viewed as a system with several nodes (PMs) where numerous tokens (workloads/VMs) are periodically in motion between these nodes (as shown in Figure 1) in order to mitigate upcoming SLA violations. We refer to the PM on which the VM selected for migration initially resides as the host PM and the PM to which this VM migrates as the target PM. Our approach to select the target PM is based on (i) migrating VM utilization, (ii) target PM utilization, (iii) response time of jobs of the migrating workload on the target PM and (iv) workload collocation compatibility of the migrating VM with the VMs on the host PM. Though we refer to our implementation as the Keep It Moving (KIM) framework, we perform a capacity check before the migration to identify if enough capacity is available on the current PM to address the SLA violations on one of its VMs (mitigation without migration). If the answer is yes, migration is avoided which helps to cut down expenses. Furthermore, we generate proactive triggers for the detection of an SLA violation (proactive so as to respond to upcoming

SLA violations as opposed to mitigation after the occurrence of an SLA violation). For our validation, we use real server logs from one of our SaaS data centers to extract failure and workload data. We use this data to simulate the behavior of a real system where the workload is more predictable. VM

VM before migration

VM

VM after migration

VM

Static VM Migration Path Workload reroute path

Data Center 1

Cloud Controller Data Center 2 Server racks

Server racks

Grottke et.al [1] classified the types of software failures in virtualized servers and suggested restart and migration as possible solution strategies to some of these failures. Commercial hypervisors such as VMWare vCenter has a Distributed Resource Scheduler (DRS) in place that allows for live VM migration in response to physical machine failure [6] or resource imabalance in the cloud. VMWare DRS, though address availability violations however do not address performance degradations due to software failures (as their load balancing technique is not application SLA violation specific). Shen et.al [7] proposed the CloudScale system that employs proactive prediction of upcoming SLA violation (PRESS [8]) and perform dynamic allocation of resources as well as workload migration. Cloudscale uses Markov chain based state space approach for prediction. We use similar but more detailed stochastic activity network (SAN) models [9]. We describe these models in the Section III. Also, CloudScale does not consider workload rerouting.

KIM Engine

Data Center 3 Server racks

...

Workloadrelated Failure

VM Failure

Server3

Server2

Server1 VM1

VM3

VM1

VM8

VM2 VM4

Resp time thr. violation

VM6 VM4

Server5 Server4 VM6

VM7 VM5

Utilization thr. violation

Fig. 1. SaaS workload dynamics with Keep It Moving (KIM) software framework

The rest of this paper is organized as follows. Some related work and motivation is presented in Section II. Section III discusses our analysis of data from a business SaaS data center [2]. Our approach for the Keep It Moving software framework is presented in Section IV. In Section V, we discuss our Ptolemy-based discrete event simulation [3] of a business SaaS data center along with simulation results and their implications. We discuss some of the planned extensions for the KIM module in Section VI. Finally, we conclude the paper in Section VII. II.

the resource demand of a workload on a specific target PM. This is important as we assume that data center consists of heterogeneous set of machines. Hence predicting the resource demands, which in turn will determine the response time on the target PM is important before considering a migration.

R ELATED W ORK

Wood et.al [4] proposed the Sandpiper system for automated mitigation of increasing utilization and response time of the host PM in a virtualized data center due to workloads. Sandpiper’s view towards migration is only based on utilization threshold increase, and not based on SLA violation or failures or software aging. Xu et.al [5] decoupled resource management from a central controller by proposing the use of a local resource controller in each VM and a global resource controller on each PM. However due to unavailability of application usage logs, they used fuzzy modeling to infer resource demand from the workload in a virtualized data center. We have application logs and OS logs from a SaaS data center hosting a business-specific application [2]. We use these logs to create an abstraction of the failure characteristics of the sample data center for our simulation. We select the target for migration or workload rerouting using Little’s law to compute

422

We present a unified framework for mitigation of SLA violation using migration of VM and workload rerouting to a replica VM as well as resource scaling on the same host PM for load balancing. Unlike most of the above mentioned approaches, we use a business SaaS cloud for validation where the workload is predictable. Khanna et.al [10] and more recently Eyraud-Dubois et.al [11] deal with on-demand provisioning of resources to a VM in order to mitigate response time SLA violation. Specifically, [10], [11], [12] suggest a binpacking based approach to allocate as many VMs as possible to a given server without violating their SLAs. Verma et.al [13] discussed the issue of frequent occurrence of higher SLA violations at the target PM when migrating VMs with bursty workloads i.e., workloads with Coefficient of Variation (CoV) > 1. Also workloads from complementary time zones may share the same PM without causing SLA violations. Such additional compatibility considerations lead us to reformulate the problem of VM placement as a multi-objective optimization problem and we use a metaheuristic tabu search algorithm to solve it. A preliminary version of our SLA policy definition, a reactive (mitigate after an SLA violation) migration-based solution along with a preliminary validation using discrete event simulation in MATLAB using commonly used (nonworkload specific) distributional assumptions was presented in our earlier paper [14]. III.

A NALYSIS OF USAGE AND APPLICATION LOGS FROM A BUSINESS S AA S DATA CENTER

Business SaaS applications have two major components: (i) batch processing of data that comes from various subscribers (business transaction data, as well as data from the business context, such as subscriber access pattern, social and other multimedia information) (ii) online transaction processing (OLTP). The major component of the SaaS application under consideration is of type (i) [2]. The control flow of this data analysis module is shown in Figure 2.

TABLE I.

S AMPLE LOGS FROM BUSINESS DATA CENTER [2]

Manual Failure report ID 1 Failure reason

Date created 31-Dec-12 Production (Y/N) Y

Timeout issue

Windows event Application error log

SaaS Error Logs

Fig. 2.

Perform time coalescence

Coalesced Error log

Time created 7:45 AM

Time Resolved 1 Jan ’13 Details

Date Resolved 6:30 PM

File event ID 647856 Action Taken

Event code PRC4, PRC4_TRNF Status

Timeout expired. The timeout period We have re-processed the file. closed elapsed prior to completion of the operation It processed successfully. or the server is not responding Event ID Service Control Manager log 7011 A timeout (30000 milliseconds) was reached while waiting for a transaction response from the UxSms service Timestamp Message Severity Process Id Win32 Thread Id 10/1/2013 4:52:03 AM Load failed-1205: Transaction (Process ID 104) was deadlocked Error 4304 4368 on lock | communication buffer resources with another process and has been chosen as the deadlock victim. Rerun the transaction.

Goodness Time to Error of Fit (X2) distribution

SAN Model Execution

Goodness Time to failure Failure of Fit (X2) distribution Model

Control flow for the application and usage log analysis module

a) Error Logs: This SaaS system receives workload of various types and sizes from various subscribers who are retail companies from geographies such as North America, Europe, Asia and Africa. The workload files can be of different sizes ranging from 100KB to 700MB. The application runs on a virtualized platform by VMWare. As the SaaS platform processes the transaction data, various components of the platform fail due to workload related issues such as faulty workload as well as system related problems. These faults are manifested in an increase of response time leading to SLA violations. Four types of error logs are evaluated: resource utilization log, computer generated application logs (application specific logs) and operating system logs (for both the application server and the database server) and human generated software failure reports. Samples of the latter three logs are in Table I. During the processing of the workload data through multiple stages, various components of the SaaS system encounters failures. The reasons for failures are (i) arrival of faulty workloads leading to application failure, (ii) VM and OS related errors and (iii) software aging related issues. The application maintainance team resorts to manual detection and recovery of failure. Furthermore, the maintenance team performs a periodic maintenance of the system to improve the overall dependability of the platform. b) Data extracted: Information related to fault and the recovery actions are typically maintained in the application and system log files. In order to build our failure model, we have considered 283 days of application and system log (windows event log), covering 275790 transactions, 42 customers and 11 different countries. The platform is a 3-tier application. We have collected these log files from application server, web-server and database server tiers. Upon analysis, we have found 86.23% successful and 13.87% failed transactions for the above period of time. For the failed transactions we further observed the trends enumerated in Table II. TABLE II.

File name EEDMS_IS_Daily_20121230 Identified from database (OR logs) D/L D

FAILURE DATA GATHERED

Failure Cause Failure due to system/OS issues Failure due to faulty workload

Percentage of failures 96.1% 3.9%

MTBF 0.18h 4.18h

After analyzing the log as well as discussing with the maintenance team, we found that 75% of the failures are due

423

to various software exception (such as impossibility to start a task, timeout errors and database deadlock) and erroneous data file format. The data has been analyzed with the heuristic proposed in [15] for the analysis of datacenters failures. Multiple reporting for the same error event can happen in the error logs. Such redundancy is removed by applying the time coalescence technique discussed in [16], where multiple error logs concerning the same error event are tupled into the same error report. This grouping is performed for error logs generated within a pre-specified small time window [15]. c) Goodness of fit test: We use a chi-square goodnessof-fit (GOF) [18] tests to identify the distribution for the time to generation of failure logs from the sample. We test the sample data with uniform, Gaussian, exponential, Weibull, Pareto and log-normal distributions. The coalesced error event reports are used to generate failure signatures with mean occurrence rate for the corresponding failure log. For any distribution, the distribution parameter(s) (𝜇 for Poisson and {𝜇,𝜎} for normal) is generated from their corresponding maximum likelihood estimates. Here, we first construct the unknown statistical frequency distribution (𝑜𝑖 , where 𝑜𝑖 is the observed value) sample from coalesced error data. Next, using the mean and standard deviation of this sample, we create expected cumulative frequencies for the known target distributions (such as Normal, Poisson etc.). Then we use the standard chi square 2 ∑ 𝑖) ) metric (𝜒2 = 𝑖 (𝑜𝑖 −𝑒 ) [17]. We consider the sample to 𝑒𝑖 come from the chosen target distribution if the value of the 𝜒2 variate is less than the threshold value (tables available) at a certain degree of freedom. d) Failure and workload models: Stochastic Activity Network (SAN) models are designed to represent the underlying behavior of the SaaS system (interdependent components, dependence between failures). We construct separate SAN models for the workload and the application infrastructure. For our purposes, we use the SAN models proposed in [16] for the analysis of supercomputer failure data. We specialize these models with parameters gathered from the analysis of the failure data (e.g., the empirical MTBF). Specifically, we use the SAN models to model the failure behavior of our SaaS at the level of CPU, workload, Memory and Software and IO. The model is generated using an approach similar to [18]. We have reused the SAN models for workload, CPU, memory, IO and network from [16] by specializing them for our specific business SaaS data center. The SAN model for the software subsystem (SW) specific to the SaaS system is shown in Figure 3. Failure occurrence rates obtained from the error logs are used in the timed activity of this SAN model. The model generates mean time to failure (MTTF) for the workload

Input gate

Output gate

Place

Transition

Extended Place

IV.

myjobID select_TTF_SW resetDistr changeSWbehavior selectDistribution recovered subrecover

nodeID Operational

resetOperational

EN_subrecover

net_correlated IO_correlated

resetNode

TTR failed EN_TTR

concluded_jobID jobEnd

endJob

Timed_activity4 Sw_error_signature

Fault_start

EN_fault Fault_arrivalSW1 signatureCorrelatedFault EN_propagation check1 propagatetoNET

EN_TTF2 Fault_arrivalSW2 Correlated_faulttype numFaults

propagation

check2 propagatetoIO

mem_correlated Cpu_correlated EN_correlated_fault starttime cpu_error_signature IO_error_signature Mem_error_signature Node_offsetNetwork_error_signature

Correlated_fault Correlated_fault_start

startPeriodic

EN_PERIODIC

periodicCheck

TBE_Weibull

Distribution_choicewritten Error_triggering

Error_distr1 error_triggering1 Error_distr2

noShow

Fig. 3.

gotAsubpropagated

timestampStart

faultSignature

K EEP I T M OVING (KIM) S OFTWARE F RAMEWORK FOR T HE C LOUD C ONTROLLER

write_log_entry1 Input_Gate1 error_triggering2 TBE_Gamma write_log_entry2

Error_distr3 error_triggering3 TBE_Normal write_log_entry3

Software (SW) subsystem SAN model

as well as the mean time to occurrence of different types of SLA violations such as i) utilization threshold violations and ii) response time threshold violations. Next, for sake of clarity, we provide a description of the software SAN model. The SW subsystem failure distribution is chosen at the startup of the model by the action 𝑠𝑒𝑙𝑒𝑐𝑡_𝑇 𝑇 𝐹 _𝑆𝑊 depending on the SaaS platform configuration (e.g., number of nodes, etc.). This failure distribution is as observed from the application logs [2]. Each input gate is connected to multiple places. The enabling function of each input gate is dependent on the number of tokens in each place and the specific signatures (character data) associated with each extended place connected to it. For e.g., extended place 𝑐𝑝𝑢_𝑒𝑟𝑟𝑜𝑟_𝑠𝑖𝑔𝑛𝑎𝑡𝑢𝑟𝑒 may contain various signature strings. When a failure is activated it can cause a local propagation (places 𝑝𝑟𝑜𝑝𝑎𝑔𝑎𝑡𝑒𝑇 𝑜𝑁 𝑒𝑡, 𝑃 𝑟𝑜𝑝𝑎𝑔𝑎𝑡𝑒𝑇 𝑜𝐼𝑂). For instance, when a token is placed in the 𝑝𝑟𝑜𝑝𝑎𝑔𝑎𝑡𝑒𝑇 𝑜𝑁 𝑒𝑡 place, it will also be placed in the 𝑛𝑒𝑡_𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑒𝑑 place of the network subsystem model. This in turn enables a propagated failure activation in the corresponding SAN submodel. Output gates 𝑐ℎ𝑒𝑐𝑘1 and 𝑐ℎ𝑒𝑐𝑘2 at the right of Figure 3 govern this token propagation making sure the necessary conditions for successful propagation are satisfied. A model receiving a propagation is notified by means of the 𝐶𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑒𝑑_𝑓 𝑎𝑢𝑙𝑡_𝑠𝑡𝑎𝑟𝑡 output gate. The failure will be kept in the submodel until the originating subsystem recovers from it (activity 𝑇 𝑇 𝑅 when fires puts a token in the place 𝑟𝑒𝑐𝑜𝑣𝑒𝑟𝑒𝑑). The distribution of the 𝑇 𝑇 𝑅 will be selected depending on the 𝑇 𝑇 𝐹 action that caused the failure. The signature (𝑠𝑤_𝑒𝑟𝑟𝑜𝑟_𝑠𝑖𝑔𝑛𝑎𝑡𝑢𝑟𝑒) is used to keep track of failure propagations. The SAN models are implemented and evaluated using Mobius tool [19] to obtain unavailability, MTTF values for each unique failure event. Another set of goodness of fit tests are implemented using the ExternalLib C library to perform distributional fits for major failure events. The SAN models along with the goodness of fit tests help us perform proactive SLA violation detection. Accuracy of the distributions generated from historical data in predicting test sets was found to be accurate within an acceptable error margin.

424

Based on the failures observed in these logs, we propose a framework that uses virtual machine (VM) restarts/reboots, stretching allocated VM capacity, workload rerouting and live VM migration to a new PM as remedial actions for different types of SLA violations in a virtualized cloud environment. Specifically, we discuss the particular SLA violations our framework addresses, a prediction model that will detect upcoming SLA violations and a method for optimal placement of the migrating/rerouted workload. The Keep It Moving (KIM) software framework consists of an SLA policy definition module that stores the SLA requirements specific to each application. The SLA violation definition module works in conjunction with an SLA violation prediction module that uses historical data specific to each workload (application) to predict upcoming SLA violations. The detection of an upcoming SLA violation triggers a suitable remedial action. Deciding on and executing a remedial action is addressed by a workload management module. The target selection module outputs a selection of target PM best suited to host a rerouted/migrating workload. A basic workflow for our proposed KIM software module is shown in Figure 1. KIM Engine Workflow

Failure model

Application log Data

Workload management module SLA violation predictor

Target Selector for Workload Migration/Rerouting

Capacity Manager Capacity available at source PM?

Yes

SLA violation imminent?

Violation-based trigger

No

Workload Target Compatibility PM Checker Noselection

Yes Decision

module

Type of SLA violation ?

SLA violation policy definition

SLA violation (ii)a and SLA violation (i)b

SLA violation (i) SLA violation (ii)b or (iii) Restart/reboot

Fig. 4.

Expand allocated VM capacity

Do VM sizing & placement

Control flow of Keep It Moving (KIM) software framework

A. SLA Violation Predictor As discussed earlier, SLA agreements in SaaS scenarios usually consist of two MOI [2]: (i) availability violations, namely the hardware and software (VM) infrastructure failure and (ii) performance degradation owing to software failures or due to insufficient hardware capacity for meeting workload requirements. Our SLA Violation Predictor (Figure 4) considers these two main causes for the SLA violation. For proof of SLA compliance, the service provider needs to provide a snapshot of these MOIs taken over a sufficiently large period of time (time large enough to characterize the variability of the particular MOI) to the client. We refer to this period of time as the user sensitivity time (UST). For instance, if the MOI value exceeds the specified MOI threshold within the UST, it will lead to an SLA violation. For each MOI, a threshold is defined in the SLA which is used to measure if there is an SLA violation.

User Sensitivity Time (UST) Computation: For our simulation-based validation, we use UST value used in [2]. This business data-center processes files of size varying from 100KB to 700MB. For smaller files (1 MB), sample SLA requirement [2] is that the average response time over any 10-minute interval will remain below 5 seconds, 90% of times (90𝑡ℎ percentile). The system’s ability to meet the SLAs is evaluated against the UST, and not other time units such as “per second”. The interval of time chosen for UST is derived from the maximum expected response time of the system from the user’s or the business requirement point of view. For instance, UST of 1 sec may be acceptable for applications with response time in milliseconds but not for applications with response time in seconds. Optimal value of UST in any sample scenario can be computed using the Autocorrelation Function on workload time-series data. 1) SLA Violation Policy Definition: SLA Violation Predictor module predicts upcoming SLA violations of different types. Then we characterize various remedial actions that can be associated with them. In the absence of remedial action, frequency of SLA violations increases over time. Enumerated below are the types of availability violations: (i) Availability violations: Service failures causing availability violations in a SaaS environment can be caused by software failures and workload-related failures. (a) Software failures: Over time, exhaustion of operating system resources, memory fragmentation and accumulation of errors can bring about software failure or progressive performance degradation (resulting in increase in response time). Other possible causes for software failures include transient errors due to integration of COTS products on a SaaS platform (generating system errors and timeout errors), database saturation (for e.g., system log databases need periodic cleaning to allow for accommodation of new logs) and memory leakage due to some underlying software infrastructure. When an SLA violation (availability or response time) due to aging-related software failure is triggered, usually a current state replica of the VM is created on the same or another PM. When this new VM is up and running, the original VM enters preventive maintenance. The replica is created on the same PM only if the PM has enough available capacity. At the end of preventive maintenance, the original VM is restarted and jobs are redirected back to this VM whereas the replica VM is dropped after the completion of all the jobs in its queue. However, our approach does not cover all kinds of software failures. Our trigger only addresses SLA violations due to software aging related bugs [1] (a subset of Mandelbugs). It has been observed that higher the application load, faster is the rate of software aging [1]. Also, the mean failure rate increases over time whereas service rate of the workload degrades with time owing to accumulating errors. (b) Workload-related failures: We observed a workloadrelated availability violation in business SaaS application logs [2] where arrival of certain files caused the workload to fail. Geographical region of the file’s point of origin is a contributory factor to such failures where Unicode, Chinese character conversions or region-specific viruses bring about failures. Large file sizes may cause resource locking or contention time related failures. There may also be cascading failures brought about by the aforementioned causes. Mitigating this kind of failure does not involve migration as arrival of a faulty file

425

crashes the VM rendering it unavailable. Hence to deal with this form of service failure, workload jobs arriving after the faulty/corrupt file are rerouted to a new VM created from an application specific template on the same or a different PM. It is important to note that these failures are different from transient errors in runtime. Transient errors need to be prevented using retry and checkpointing and recovery techniques. At the hardware level, the failure of a PM can be another trigger for availability violations (though such triggers will be few). In our earlier paper [14], we showed that our approach does not offer anything new for the reduction of PM failures over the state-of-the-art [6]. Hence, we do not analyze PM failures and focus on the other types of SLA violations in the rest of this paper. (ii) Utilization threshold violations: In any PM, when CPU utilization exceeds a certain threshold, thrashing sets in. We have observed that in several cloud data centers [2], each PM has an assigned threshold (typically 60%) to indicate the point where thrashing begins. (a) PM utilization threshold violation: The sum of the utilization of all the SaaS application workloads cannot exceed this threshold. We consider this to be the trigger for migration/workload rerouting (depending on the specific SaaS scenario). Usually the workload with the highest peak utilization is selected for migration (to free up maximum capacity on the host PM). PM utilization violations also can only be remedied by workload migration/rerouting. Other remedies will not help. (b) VM utilization threshold violation: In cloud data centers, VM utilization is measured from the hypervisor. Increase in VM utilization above a certain threshold will lead to an increase in the response time of workloads running on the VMs that are in direct contention with this overutilized VM. Consider two VMs running on one PM, each VM assigned 50% of the PM’s utilization. Now if one VM (or vCPU) starts overperforming and using more than 30% of the host PM’s utilization, this will lead to a decline in the service rate of the other collocated contending VMs subsequently leading to SLA violations in those VMs as well. In our investigation of cloud data centers [2], we found that this VM utilization thrashing threshold is typically 60%. (iii) Response time threshold violation: While PM and VM utilization thresholds are set by service providers, response time SLA violation thresholds are specified by the SaaS customer. Response time violations can happen due to performance degradation of software or due to overload conditions in the system. For the same reason, when there is a high load over a prolonged period of time, we treat this as a trigger for VM migration. 2) SLA Violation Prediction Module: As discussed in Section III and shown in Figure 2, the SLA violation predictor module takes as input error logs generated on a business SaaS platform [2]. Duplicate notifications are filtered out of the error logs and occurrence rates for a unique set of failure logs are computed. The distributional fit for this occurrence rates are associated with input transitions of a set of Stochastic Activity Network (SAN) [9] models that models the failure behavior of the system [16]. The MTTF values for different forms of failures are fitted to relevant distributions using chi-square goodness of fit tests. These inferred failure distributions are constantly updated as new historical data pours in. Prediction

accuracy using this approach is constantly checked using training and test sets from error logs. B. Workload management In our proposed KIM module, once an SLA violation is detected, the SLA violation predictor sends out a proactive trigger to the Capacity Manager module. First the available capacity on the host PM is estimated. If this available capacity is enough to remedy this specific SLA violation, the VM capacity is adjusted by the Decision Module. If not, the control flow is passed on to the target selection module for workload migration/rerouting. For validation purposes, the underlying behavior of the system proposed in this section is generated via simulation. 1) Mitigating SLA violation without workload migration (Capacity manager): PM utilization violations (SLA violations (ii)a) can only be remedied by moving the workload. However, most other SLA violations may be remedied without moving the workload to a different PM. Availability violations related to VM or software failures (SLA violation [i]) can be remedied by restarting/rebooting the VM on the host PM itself, if the available capacity of the host PM allows for it. Similarly performance degradation related violations such as VM utilization violations (SLA violation [ii]b) and response time violations (SLA violation [iii]) can be remedied by resizing the VM on the host PM. Typically this involves increasing the VM’s allocated capacity from the available free capacity in the host PM (for e.g., allocating any free cores of the host PM to the VM experiencing response time violations). If the PM’s available capacity is not enough to remedy the particular SLA violation, the workload has to be moved to a new PM. Recall from the previous section that it is necessary to ensure that a PM’s utilization (0 ≤ 𝜌𝑃 𝑀 < 1) should not exceed a certain threshold, say 𝜌𝑇𝑃 𝑀 . When a PM’s entire capacity is utilized, then 𝜌𝑃 𝑀 ≈1.𝐺𝑖𝑣𝑒𝑛𝜌𝑇𝑃 𝑀 (=60%), we compute the maximum fraction of the capacity (0.6) that can be allocated to VMs. Any additional capacity allocation will lead to a violation of 𝜌𝑇𝑃 𝑀 . We refer to this capacity as the allowed capacity for a PM. When a VM’s capacity needs to be increased on the same PM, the capacity manager first checks if by allocating the capacity required by a VM, PM’s total allocated capacity would exceed the allowed capacity. Using hypervisor specific probes, we determine at runtime the capacity available on the PM for VM allocation or resource scaling (AllowedAvailableCapacity) such that 𝜌𝑃 𝑀 1 workloads (bursty workloads [13]), 90𝑡ℎ percentile value for the arrival rate is used in the above computation. 2) Moving the workload in specific SaaS scenarios: Though live-VM migration is supported in private clouds and most non-private commercial cloud infrastructures, we also consider workload rerouting as an alternative to liveVM migration. In large data centers, processing is done over thousands of nodes (map-reduce like architecture) with high job arrival rates and stringent response time SLA requirements. In such data centers, live-VM migration is supported to ensure the prevention of job dropping due to failures. However, in smaller data centers with lower job arrival rates, sensitive user input data and higher response time SLAs, a section of the past history on user input files is archived [2]. Once a workload fails, the job currently being processed and the jobs waiting in queue are lost. However, the processing can be restarted using the archived user input files. The jobs arriving after the point of proactive migration trigger are rerouted to the target PM selected by the optimization algorithm. A new VM image of the same VM type is created and loaded on this target PM beforehand from VM templates loaded in the cloud controller. In order to cut expenses, small time SaaS providers may not consider live-VM migration as a viable option. For such small scale SaaS clouds, workload rerouting forms an attractive solution. Irresepective of the solution strategy (migration or workload rerouting), the same process of target selection (described below) is used. 3) Selection of a target PM for moving workload (Target selector): Before workload rerouting or live-VM migration, the cloud controller needs to select a suitable target PM for the workload. In a cloud data-center composed of heterogeneous physical machines, service rates of different PMs are different (owing to their difference in processing power). Hence, when

between the migrating VM with that of the workloads running on the target PM and verify that the migrating workload has a minimum number of peak overlaps with the workloads on the target PM. We mine historical workload data to extract tuples {peak utilization value,time of value}. Then we compute the number of peak overlaps (WC𝑃 𝑀𝑖 ) between the migrating workload and all the workloads on the target PM (PM𝑖 ). This WC matrix has to be compute apriori and kept handy. The matrix can be periodically refreshed but it has to be available during migration decision. We select the PM with minimum number of peak overlaps. We extend Algorithm 1 from Ganesan et.al’s earlier paper [23] to compute the workload collocation compatibility matrix. Hence, selection of a target PM for workload migration can be formulated as a multi-objective optimization problem. The objective functions include (i) response time, (ii) migrating VM utilization, (iii) target PM utilization and (iv) number of workload peak overlaps. We select the PM that offers the minimum job response time. We also select the PM which after migration will provide the lowest workload VM utilization and target PM utilization (leaving more scope of addressing SLA violations later with resource elasticity). The objective function for target PM selection can be given as below (where PM𝑖 is a physical machine in the pool of physical machines (PM)):

Algorithm 1: Workload Management input : SLA Violation Type: 𝑣𝑖𝑜𝑡𝑦𝑝𝑒 input : VM under consideration: 𝑉 𝑀 input : Host machine: 𝑃 𝑀 output: Actions Let 𝜌𝑃 𝑀 be the current utilization of the 𝑃 𝑀 and 𝜌𝑉 𝑀 the current utilization of the 𝑉 𝑀 ; Let 𝜌𝑇𝑃 𝑀 be the PM utilization threshold and 𝜌𝑇𝑉 𝑀 be the VM utilization threshold, specified in the SLA violation policy definition; avlCap ← ComputeAllowedAvlCapacity(𝑃 𝑀 , 𝜌𝑇𝑃 𝑀 ); if avlCap≥ 𝑉 𝑀𝑐𝑎𝑝𝑎𝑐𝑖𝑡𝑦 then switch viotype do case TYPE(i) /* VM avail. violation */ 𝑉 𝑀 ′ ← createReplica(𝑉 𝑀 ); start 𝑉 𝑀 ′ ; stop 𝑉 𝑀 ; case TYPE(ii)b /* VM thr. violation */ 𝜇′ Compute 𝜇′𝑠 ← 𝜌𝑇𝜆 , and 𝑐𝑓 ← 𝜇𝑠𝑠 ; 𝑉𝑀 Modify 𝑉 𝑀𝑐𝑎𝑝𝑎𝑐𝑖𝑡𝑦 ← 𝑉 𝑀𝑐𝑎𝑝𝑎𝑐𝑖𝑡𝑦 × 𝑐𝑓 ; case TYPE(iii) /* Resp. time thr. violation */ 1 𝜇′′𝑠 ← 𝜆 + 𝐸[𝑅] ; 𝑇

𝐹𝑚𝑢𝑙𝑡𝑜𝑏𝑗 = min∀𝑃 𝑀𝑖 ∈𝑃 𝑀 −{ℎ𝑜𝑠𝑡𝑃 𝑀 } {𝑓1 , 𝑓2 , 𝑓3 , 𝑓4 }

𝜇′′ 𝑠

𝑐𝑓 ← 𝜇𝑠 ; Modify 𝑉 𝑀𝑐𝑎𝑝𝑎𝑐𝑖𝑡𝑦 ← 𝑉 𝑀𝑐𝑎𝑝𝑎𝑐𝑖𝑡𝑦 × 𝑐𝑓 ;

such that, 𝑓1 (𝑃 𝑀𝑖 ) = 𝐸[𝑅] ∀ 𝑗𝑜𝑏𝑠 ∈ UST 𝑓2 (𝑃 𝑀𝑖 ) = 𝜌𝑉 𝑀 𝑓3 (𝑃 𝑀𝑖 ) = 𝜌𝑃 𝑀𝑖 𝑓4 (𝑃 𝑀𝑖 ) = 𝑊 𝐶𝑃 𝑀𝑖

endsw else COMPUTE_WORKLOAD_TARGET_PM(𝑉 𝑀 ); VM_SIZING(𝑉 𝑀 ); end

(2)

where min{f1 (PM𝑖 ), f2 (PM𝑖 ), f3 (PM𝑖 ), f4 (PM𝑖 )} is satisfied by the Pareto optimal solution composed of one non-dominated vector or a set of non-dominated vectors subject to constraints,

migrating a VM from one such PM to another, this difference in performance needs to be taken into consideration. Expressed using the rPerf benchmark, the underlying capacity of the source IBM 780 machine is 115.86 whereas that of the IBM 570 machine is 58.96 [22]. The capacity factor for these 58.96 machines is 115.86 ≈ 0.5, which obviously implies that the service rate (𝜇𝑠 ) for jobs will be halved due to migration to the IBM eServer 570. Now, mean response time (E[R]) is given 2 as in Equation 1. At the target PM, E[R] = 𝜇𝑠 −2𝜆 implying an increase in response time at the target PM. This may result in response time SLA violations at the IBM eServer 570. Since the utilization is 𝜌 = 2𝜆 𝜇𝑠 at the new PM, the VM utilization gets doubled when the VM is migrated from IBM 780 to IBM 570 (since 𝜆 is unchanged) and such an increase can result in an SLA violation of the VM’s utilization threshold after migration. We take these observations into consideration when selecting a target PM. During target selection, checking for collocation compatibility between the migrating workload and the workloads on the target PM is very important. Workloads serving geographical regions with complementary working time zones can be collocated. Also workloads with pattern characterized by Coefficient of Variation or CoV>1 is typical for internet traffic. Such workload’s peak utilization is often very high compared to its average. While migrating such types of workload, we need to consider the number of peak utilization overlaps

427

𝐸[𝑅] < 𝐸[𝑅]𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝜌𝑉 𝑀 < 𝜌𝑉 𝑀 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝜌𝑡𝑎𝑟𝑔𝑒𝑡𝑃 𝑀 < 𝜌𝑃 𝑀 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑

(3)

Since this is a multi-objective combinatorial optimization problem, we use a state-space search technique to look for one or more optimal solutions. For small scale SaaS clouds (