Metaheuristics-based Planning and Optimization for

Metaheuristics-based Planning and Optimization for SLA-aware Resource Management in PaaS Clouds Edwin Yaqub∗ , Ramin Yahyapour∗ , Philipp Wieder∗ , Ali Imran Jehangiri∗ , Kuan Lu∗ and Constantinos Kotsokalis† ∗ Gesellschaft f¨ ur wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG), Göttingen, Germany Email: {edwin.yaqub, ramin.yahyapour, philipp.wieder, ali.jehangiri, kuan.lu}@gwdg.de † AfterSearch P.C., Athens, Greece Email: {costas}@aftersear.ch

Abstract—The Platform as a Service (PaaS) model of Cloud Computing has emerged as an enabling yet disruptive paradigm for accelerated development of applications on the Cloud. PaaS hides administration complexities of the underlying infrastructure such as the physical or virtual machines. This abstraction is achieved through advanced automation and OS-level multi-tenant containers. However, the on-demand procurement, unpredictable workloads and auto-scaling result in rapid increase and decrease of containers. This causes undesired utilization of Cloud resources and energy wastage that can be avoided with real time planning. Hence, the main challenge of a PaaS Cloud provider is to regularly plan and optimize the placement of containers on Cloud machines. However, the service-driven constraints regarding containers and spatial constraints regarding machines make SLA-aware resource allocation non-trivial. This relatively novel “Service Consolidation” problem is a variant of multi-dimensional binpacking and hence NP-hard. In this work, we concretely frame this problem by leveraging the definition of Machine Reassignment model proposed by Google for the ROADEF/EURO challenge and characterize it for OpenShift PaaS. We apply Metaheuristic search to discover best (re)allocation solutions on Clouds of varying scales. We compare four state of the art algorithms as problem properties change in datasets and evaluate their performance against a variety of metrics including objective function score, machines used, utilization, resource contention, SLA violations, migrations and energy consumption. Finally, we present a policy-led ranking of solutions to obscure the complexity of individual metrics and decide for the most preferred solution. Hence, we provide valuable insights for SLA-aware resource management in PaaS Clouds. Keywords-Cloud Computing; PaaS; OpenShift; Resource Allocation; Optimization; Service Consolidation; Service Migration; Service Level Agreement (SLA); SLA Violations

I. I NTRODUCTION Cloud Computing provides on-demand network access to shared pool of computing resources that can be provisioned and released rapidly with minimal management effort[1]. Unlike the Infrastructure as a Service (IaaS) model, the Platform as a Service (PaaS) delivers software platforms as a utility by providing an ecosystem to Cloud-enable software services and manage their lifecycle. PaaS also hides the administration complexities of underlying IaaS for the user. Due to its vast potential, the PaaS is gaining market traction. IDC estimated the PaaS market to reach $14 billion by

2017[2], while GigaOM predicted it to reach $20.1 billion in 2014[3]. So far, much research in Cloud Computing concentrated around efficient management of virtual machines in the IaaS[6][7][8]. Unlike IaaS, the deployment model in PaaS is based on densely allocated OS-level containers, which securely host different modules of a software service. Thus, PaaS allows fine grained management of services as lightweight, portable virtual containers that can be efficiently scaled, migrated or placed according to business dependencies near other services or data sources. This allows service oriented applications to overlay Cloud infrastructure, which opens new avenues for added-value SLA propositions. However, our work on PaaS reveals that on-demand procurement, unpredictable workloads and auto-scaling result in rapid increase and decrease of containers that are automatically provisioned and cause undesired utilization of the underlying machines. At global scale, this poses a huge challenge e.g., Google recently revealed that it runs all its services from search to gmail in containers and each week, it launches more than 2 billion containers across its global data centers[5]. RedHat’s public PaaS (OpenShift Online[4] - built over Amazon EC2) currently hosts over 2 million applications. Hence, the main challenge of a PaaS provider is to regularly plan and optimize the placement of containers on machines, such that SLA commitments are fulfilled and resource utilization is maximized using minimum machines. On the other hand, non trivial service-driven constraints regarding containers and spatial constraints regarding machines render allocations satisfying only resource capacity requirements as infeasible. This relatively novel “Service Consolidation” problem is a variant of variable sized multidimensional bin-packing and hence NP-hard[10][11]. Further, feasible solutions need to be evaluated not just against objective function score but also additional criteria e.g., total machines used, utilization, resource contention, migrations, container scaleups, SLA violations and energy consumption. The latter raises economic and ecological concerns with data centers consuming between 1.1%-1.5% of global energy use in 2010[9]. Well defined metrics for these criteria need to be systematically assessed for optimal trade-offs[8][14]. Therefore, in this work we define or reuse formal mod-

els for machine utilization, power consumption and SLA violations. Before a production scale PaaS Cloud is developed, it is vital to investigate and hypothesize SLAaware resource management in Clouds of various scales, configurations and workloads. Hence, to concretely frame the Service Consolidation problem, we leverage and extend the Machine Reassignment model proposed by Google for the EURO/ROADEF challenge[13] and characterize it for OpenShift PaaS Cloud. Our main contribution is the application of Metaheuristic local search to discover (re)allocation plans (consolidated solutions), with the intent to introduce SLA management in PaaS Clouds. We evaluate solutions discovered by Tabu Search, Simulated Annealing, Late Acceptance and Late Simulated Annealing algorithms as problem properties change in datasets, and also present a policy-led method to automate decision making in middlewares. The remainder of this paper is organized as follows. In Section II, we discuss related work. In Section III, we present the architecture of our OpenShift based SLA management system. The problem is formally described in Section IV and performance evaluation of algorithms is presented in Section V. We conclude the paper in Section VI. II. R ELATED W ORK Resource allocation in Cloud Computing has focused mostly on the IaaS model. [7] presents energy-aware resource allocation using modified Best Fit Decreasing heuristic considering single resource model. The work is extended in[6] using real workloads but remains confined to CPU utilization. Our objective on the other hand is to examine consolidation algorithms under high stress imposed by multiple resources and constraints without which the feasibility of a solution remains questionable. In[8], profit maximizing models are tested using Worst Fit and Best Fit heuristics for VM based provisioning. These works used CloudSim and evaluated VM migrations, energy use and SLA violations. In contrast, we focus on resource allocation in PaaS. We abstract over multiple resource dimensions, use highvariability datasets from Google where the base state space is exponentially large and characterize these for OpenShift. As OpenShift Cloud can span multi-domain IaaS, which makes allocations hard, we solve Service Consolidation under said settings. Other popular PaaS tools are CloudFoundry, AppEngine, Docker and more recently project Atomic, GearD and Kubernetes. Growing interest in these tools to exploit the untapped potential of Cloud consequently highlights associated research problems such as Service Consolidation. Service Consolidation is addressed in[11] using integer linear programming and compared with constraint programming. Like them, our results also confirm that latter always finds a feasible solution in short times. In[12], queuing network theory is used to generate complex data center models and a heuristic is devised to optimally allocate applications on servers. Our work is Cloud-based and we

use Metaheuristics instead. In[21], Simulated Annealing is shown to improve resource utilization better than First Come First Serve but very small datasets are used. [22] solves network-sensitive hosting of application components on multiple Cloud data centers and proves the efficiency of Tabu search over mixed-integer linear programming. III. S YSTEM A RCHITECTURE GWDG is a shared IT and research facility by Max Planck Society and the University of Göttingen. In this role, GWDG provides Cloud services to the scientific community. To this end, we used OpenShift to implement the GWDG Platform Cloud[16]. OpenShift is well positioned by Gartner on its 2014 Magic Quandrant for Enterprise Application PaaS and is IaaS-agnostic. OpenShift’s ecosystem supports multiple languages, databases and middlewares as execution environments called Cartridges that are plugged to a container. A template approach called QuickStart[20] Cloud-enables an application and provides hooks to control its life cycle. Containers can be of varying types based on the capacity of CPU, RAM, Bandwidth, Disk, etc. assigned using technologies like kernel namespaces and control groups (cgroups). Service levels for QoS terms like response times, throughput or availability can be measured using different types, quantities and co-location of containers, while SELinux securely isolates multi-tenant containers sharing the OS. This facilitates dynamic SLA negotiation on QoS terms, which makes better business sense to customers rather than negotiating low level resource parameters. Interestingly, QoS satisfaction becomes synonymous to cgroup’s assurance that resource limits of containers are respected. For automating SLA management in future service economies, the SLA@SOI project developed a generic SLA Management (SLAM) architecture which can be instantiated to incorporate SLA management on Cloud stacks[19]. Fig.1 shows a simplified architecture we propose for an OpenShiftbased SLAM. Here, we summarize major interactions: 1) An SLA is negotiated to agree on QoS values using a Protocol Engine component that we developed and introduced in[17]. It uses the Planning and Optimization component to process an offer and generate counter offer. 2) When SLA is successfully agreed, the Planner triggers provisioning via the Provisioning and Adjustment component. 3) Provisioning invokes the OpenShift broker to allocate resources accordingly. 4) Monitoring Management starts monitoring the resource usage against SLA using an OpenTSDB based mechanism[18]. 5) Monitoring also triggers a Service Consolidation request to Planner if underutilization is observed over time. 6) Planner optimizes the usage model of allocated containers and machines to produce a migration plan. 7) Provisioning executes the migration plan in off-peak times. OpenShift Cloud comprises of broker(s) that receive API calls to control services and Cloud machines (see Fig.1).

3 provision service 1

7 migrate

SLA Negotiation

counter offer

offer

Negotiation Interface

Protocol Engine

Small District 1

6

lan

4

provision monitor _p SLA on ati gr i Monitoring m

5

OpenShift PaaS Cloud

Region 1 Medium District 1

Small District 2

Large District 1 Machine 7 Machine 8

Provisioning and Adjustment

2

Planning and Optimization

allocate resources

OpenShift Broker

Machine 6

Machine 5

Machine 3

Machine 4

Machine 1 Machine 2

Management

consolidate

SLA Template Registry

Service Manager Registry

SLA Registry

...

... Zone 1

Figure 1.

...

... Zone 3

Zone 2

District: Logically groups Machines to host containers of same type e.g., Small or Medium. Zone: A site that provides machines lying in a geographic Neighborhood. Machines in a District may belong to different Zones. A Machine has a Safety Capacity for each Resource. Service Green depends on Service Blue. A container of Blue must be allocated within the Zone of Green. Multiple containers of a Service run on distinct machines hosting same container size (Small for Service Red) or scaled up to bigger size (Medium/Large for Service Orange and Service Magenta).

Architecture for OpenShift SLAM and OpenShift Cloud

Machines are grouped as districts and may belong to different IaaS providers. An IaaS site is tagged as a zone. Multiple zones constitute a region. Thus, OpenShift Cloud can span multi-domain infrastructures (zones) in different continents (regions) to increase resilience and provide compliance with regional data protection regulations. Its plugin-in interface can integrate custom algorithms to control container placement. This can be augmented with the Planner of our SLAM. IV. P ROBLEM D ESCRIPTION AND M ODEL For the ROADEF/EURO Challenge 2012, Google presented the Machine Reassignment problem[13], whose remarkable applicability to the PaaS adds to the motivation behind this work. The datasets provided by Google represent realistic challenges and common grounds for further exploitation. Hence, we adhere much to the specification in[13] but consider OpenShift for adjusting some constraints and for dataset characterizations, without which the problem remains largely theoretical. This approach provides widely accepted definitions and abstaining the bias of self generated workloads. The general aim of the problem is to improve utilization of a set of machines given a set of services. A service comprises a set of processes which are assigned to machines and consume its resources. We characterize processes as containers because the minimum unit of deployment in PaaS is a container. Containers can be migrated between machines to improve utilization but moves are limited by hard and soft constraints. Since OpenShift PaaS can be developed using physical or virtual machines, both notions of machine apply. Next, we formally describe the problem and the models.

v. SC(m, r) is the safety capacity of resource r for machine m. Let T = {small, medium, large} be a set of supported container types such that small < medium < large along all resource dimensions. Let D be a set of districts where a district d ∈ D is a set of machines. Districts are disjoint sets. d(m) identifies the district of machine m, t(v) ∈ T gives the type of container v and t(d(m)) gives the (subsuming) container type of the district to which machine m belongs e.g., a district of type large can host large, medium and small containers, a medium district can host medium and small containers, but a small district only hosts small containers. Let Z be a set of disjoint zones where a zone z ∈ Z is a set of machines lying in a spatial neighborhood.1 B. Hard Constraints Hard constraints represent capacity related requirements, which must be satisfied by a feasible solution. Definition 1: Capacity Constraint Given an assignment M , the usage U of a machine m for a resource r is defined as: P U (m, r) = R(v, r) v∈V,M (v)=m

A container can run on a machine only if the machine has enough capacity available on every resource. A feasible assignment satisfies the capacity constraint: ∀m ∈ M, r ∈ R, U (m, r) ≤ C(m, r) Definition 2: Type Constraint A container can be placed on a machine whose district supports the container’s type:

A. Notations Consider M is a set of machines, V is a set of virtual containers, R is a set of resources common to all machines. S is a set of services and a service s ∈ S is a set of containers ⊂ V. Services are disjoint. M (v) = m is an assignment of a container v ∈ V to a machine m ∈ M while Mo (v) represents initial assignment of container v. C(m, r) is the total capacity of resource r ∈ R for machine m and R(v, r) is the requirement of resource r by container

∀v ∈ V, M (v) = m, t(v) ≤ t(d(m)) This allows vertical scaleup but not down scaling as the latter could violate the resource capacity guaranteed in the SLA. Definition 3: Conflict Constraint Containers of a service s ∈ S must run on distinct machines. ∀s ∈ S, (vi , vj ) ∈ s2 , vi 6= vj ⇒ M (vi ) 6= M (vj ) 1 We

equate neighborhood of Machine Reassignment problem as a zone.

Definition 4: Transient Usage Constraint When a container is migrated from machine m to a machine m0 , some resources e.g., disk or memory are consumed twice. Hence, machine m and m0 must have enough capacity for the migration. Let T R ⊆ R be resources needing transient usage, then the transient usage constraints are: P ∀m ∈ M, r ∈ T R, R(v, r) ≤ C(m, r) v∈V,suchthat Mo (v)=m∨M (v)=m

C. Soft Constraints Soft constraints model costs associated with resource contention, collocation requirements or migrations. Definition 5: Load Cost Let SC(m, r) be the safety capacity of a resource r ∈ R on a machine m ∈ M. The load cost is defined per resource and refers to the capacity used above the SC. P loadCost(r) = max(0, U (m, r) − SC(m, r)) m∈M

Definition 6: Dependency Constraint2 A service sa depending on service sb requires each sa container to run in the zone of atleast one container of sb : ∀v a ∈ sa , ∃v b ∈ sb such that M (v a ) ∈ z and M (v b ) ∈ z OpenShift API allows to migrate a service to another zone (or region). After migration, its software based intercontainer networking techniques can be used to conveniently implement or reset service dependencies. Definition 7: Balance Cost Availability of one resource could be useless without sufficient availability of another. When migrating containers, an availability ratio between two resources is targeted for future assignments. This is modeled by a set B of balance triples defined on N X R2 . ForP a triple b =< r1, r2, target >∈ B: balanceCost(b) = max(0, target ∗ A(m, r1) − m∈M

A(m, r2)) where A(m, r) = C(m, r) − U (m, r) Definition 8: Container Move Cost This cost models the difficulty some containers pose for migrating e.g., due to size, replicas or dependencies. Let CM C(v) be the cost of migrating a container v, then the container move cost is computed as a sum of all migrations: P containerM oveCost = CM C(v) v∈Vsuchthat M (v)6=Mo (v)

Definition 9: Service Move Cost Service move cost defines maximum number of migrated containers. This balances migrations among services. serviceM oveCost = maxs∈S (| {v ∈ S | M (v) 6= Mo (v)} |) Definition 10: Machine Move Cost 2 We consider Dependency as a soft constraint due to consolidation stress posed by OpenShift’s zone and district based setup.

Let M M C(msource , mdestination ) be the cost of migrating a container v from msource to mdestination . This models the complexity due to spatial distance, machine or topology properties. Then, the machine move cost is computed as: P machineM oveCost = M M C(Mo (v), M (v)) v∈Vsuchthat M (v)6=Mo (v)

Definition 11: Objective Function By combining costs from all soft constraints, an objective function is defined as a weighted sum of all costs: P totalCost = weightloadCost (r) ∗ loadCost(r) + r∈R P weightbalancedCost (b) ∗ balanceCost(b) + b∈B

weightcontainerM oveCost ∗ containerM oveCost + weightserviceM oveCost ∗ serviceM oveCost + weightmachineM oveCost ∗ machineM oveCost This value is to be minimized by searching for feasible assignments that better satisfy soft constraints. e.g., increase utilization of machines, while also balancing resource contention among used machines. Conversely, constraint dissatisfaction is measured as a negative value (called score), which is maximized as improved solutions are discovered. D. OpenShift Characterization To comply with OpenShift Cloud model, we characterize each process p ∈ P given in the original datasets as OpenShift container v ∈ V of type t ∈ T based on a classification method shown in Fig.2. minR(P, r) and maxR(P, r) give the minimum and maximum requirements of resource r among all processes, while R(p, r) gives the requirement of resource r by process p. Lines 7-15 record a process’s usage of each resource by incrementing small, medium and large (s, m, l) flags, after comparing with the global usage of that resource. A process using any resource in large proportion is classified as a large container, else medium and finally as a small container (lines 16-22). Next, we characterize each group of machines called location l ∈ L in the original datasets as OpenShift districts d ∈ D to host containers of subsuming types, based on a classification method shown in Fig.3. Here, dominanceCount shows that a location dominates another over a resource and is measured for all resources of all machines belonging to each location (lines 14-24). Using this criteria, we characterize most dominating locations as districts of type large and in descending order of type medium and small (lines 2735). Characterization is proportional to the amount of large, medium and small containers in the dataset (lines 1-3). Finally, we ran controlled search to discover initial feasible solution. This indicated redundancy among bigger containers as our problem does not allow down scaling while scaleup is permitted. We reduced randomly selected (mostly) large containers to introduce some slack, as our problem already hardens the search more than the base definition[13].

class CharacterizeProcessAsContainer method Classify(P, R) 1: for each resource r ∈ R do 2: minr ← minR(P, r); maxr ← maxR(P, r) 3: portionr ← (maxr - minr ) / 3 4: end for 5: for each p ∈ P do 6: usagep = (s, m, l) ∈ Z 3 7: for each resource r ∈ R do 8: if R(p, r) ≤ (minr + portionr ) then 9: usagep .s + + 10: else if R(p, r) ≤ (minr + 2 * portionr ) then 11: usagep .m + + 12: else if R(p, r) ≤ maxr then 13: usagep .l + + 14: end if 15: end for 16: if usagep .s ≥ 1 and usagep .m = 0 and usagep .l = 0 then 17: p.containerType ← small 18: else if usagep .m ≥ 1 and usagep .l = 0 then 19: p.containerType ← medium 20: else if usagep .l ≥ 1 then 21: p.containerType ← large 22: end if 23: end for Figure 2.

Container Characterizer

E. Utilization and Power Model Improving machine utilization remains a challenge and an advantage for Cloud providers. Reports show that data centers commonly utilize only 10-20% of their server resources[23]. Utilization can be raised by consolidating workloads, which reduces the peak-to-average utilization ratio by executing more load using fewer machines. Since most machines consume upto 70% of their peak utilization energy even when idle, consolidation can reduce energy and the associated cooling costs by turning idle machines into low-power mode or turning them off. Power usage has been modeled on utilization of one or two resources, especially the CPU[7][14]. However, recent advances in processor technologies have resulted in energyefficient CPUs while memory, disk and network components are rising contributors of total power consumption. In[15], Barroso and Hölzle show that CPU contribution to power consumption of Google servers in 2007 was less than 30%. Since our datasets resemble point-in-time system snapshots rather timeseries data, we aggregate utilization of all resources to analytically model utilization u of a machine m: X u=( U (m, r)/C(m, r))/ | R | (1) r∈R

where U is the usage and C is the total capacity of resource r for machine m. We use u with the power model from[7]. P (u) = k ∗ Pmax + (1 − k) ∗ Pmax ∗ u

(2)

class CharacterizeLocationAsDistrict Let vs ⊂ V : t(vs ) = small and vm ⊂ V : t(vm ) = medium and vl ⊂ V : t(vl ) = large method Classify(L, M, R, V) 1: smallDistrictCount ← min(1, b(| L | / | V |)∗ | vs |e) 2: mediumDistrictCount ← min(1, b(| L | / | V |)∗ | vm |e) 3: largeDistrictCount ← min(1, b(| L | / | V |)∗ | vl |e) 4: Map resources 5: Map locationCapacity 6: for each location l ∈ L do 7: for each machine m ∈ M : l(m) = l do 8: for each resource r ∈ R do 9: resources[r].addOrUpdate(getResourceCapacity(m, r)) 10: end for 11: end for 12: locationCapacity[l] ← resources; resources.reset() 13: end for 14: for each location lout ∈ L do 15: for each location lin ∈ L : lin 6= lout do 16: for each resource r ∈ R do 17: rout ← locationCapacity[lout ].resources[r].capacity 18: rin ← locationCapacity[lin ].resources[r].capacity 19: if rout > rin then 20: lout .dominanceCount++ 21: end if 22: end for 23: end for 24: end for 25: locationsList ← locations.sortOnDominanceCountDescending() 26: largeCounter=0, mediumCounter=0, smallCounter=0 27: for each location l ∈ locationsList do 28: if largeCounter < largeDistrictCount then 29: largeCounter++; l.districtType ← large 30: else if mediumCounter < mediumDistrictCount then 31: mediumCounter++; l.districtType ← medium 32: else if smallCounter < smallDistrictCount then 33: smallCounter++; l.districtType ← small 34: end if 35: end for Figure 3.

District Characterizer

where k is the fraction of power used by a machine in idle state and Pmax is the maximum power consumed by a fully utilized machine. We use k = 0.7 and Pmax =250 Watt which modern machines generally consume and are also used in[7]. Given mean utilization umean of all used machines, energyefficiency e of a Cloud can be loosely measured as: e = umean /P (umean ) ∗ 100

(3)

F. SLA Violation Model Improving utilization to save power costs risks over provisioning, which negatively effects performance. This powerperformance traded-off[14] translates into controlling SLA violations in order to maintain profits and good reputation. Mindful of prior art[6], we model SLA violations (SLAV) as upper bound estimate(s) in terms of i) performance degradation due to migration (PDM) and ii) performance degradation due to contention on machine’s resources (PDC). PDM is the time when containers are underperforming or

unavailable due to migrations. To analyze upper bound, we assume that migrations are performed serially. Migrating in parallel surely reduces this value. PDM is defined as: P DM = ts ∗ | vs | +tm ∗ | vm | +tl ∗ | vl |

(4)

where ts , tm and tl are maximum times needed to migrate a small, medium or large container while vs , vm and vl ⊂ V are candidate sets of small, medium and large containers to be migrated. In our experiments, we use ts = 10 seconds, tm = 20 seconds and tl = 40 seconds. The amount of resource r used above its safety capacity is saved as load(r). PDC is measured as risk tertiles, where tertile1 represents low contention, tertile2 medium contention and tertile3 high contention. An SLA violation only occurs when resource use exceeds 100% of total capacity but low contention (loadCost) is desired by balancing load among all used machines. The tertiles allow to estimate SLA violations risk if contention requirements can be relaxed by extending safety capacity to tertile1, 2 or 3. This increases utilization at the cost of some extra power but can reduce mean time to failure (MTTF) of a machine. PDC is defined as: Contr∈R = C(m, r) − SC(m, r) (5)  if      load(r) ≤ 1/3(Contr ) 7→ tertile1 + +   else if (6) P DC = load(r) ≤ 2/3(Contr ) 7→ tertile2 + +     else if    load(r) > 2/3(Contr ) 7→ tertile3 + + For a dataset, PDC tertiles are incremented for all machines. Finally, SLAV is defined as a product of PDM and PDC: SLAV = P DM ∗ max(1, P DC)

(7)

Here, PDC can be: i) sum of all tertiles, ii) sum of tertile 2 and 3 and iii) only tertile 3. This gives upto three possible SLAV values as plotted in Figs.5(i-l) for the solved datasets. V. E XPERIMENTAL E VALUATIONS A. Datasets We used simulations to evaluate algorithms over given datasets in a reproducible manner. Table I details the unconsolidated datasets. We retain dataset names from[13] for recognition. A dataset represents a Cloud configuration and workload. Datasets A.2.2 and A.2.5 represent Clouds of small scale with 100 and 50 machines hosting 170 and 153 services respectively. Dataset B.1 represents a medium scale Cloud with 100 machines running 15 times more services than smaller datasets. Dataset B.4 represents a large scale Cloud of 500 machines running lesser services than B.1 but very larger number of containers. A2.2, A.2.5 and B.1 have 12 while B.4 has 6 resources, which makes the search very hard. Overall, small containers make 60-70%, medium 2532% and large 3-8% of total population, showing that small districts may grow faster. Capacity is planned accordingly.

B. Algorithms Machine heterogeneity and different resource usages provide the desired variability to resemble real world dynamics. The large state spaces, high dimensionality due to multiple resources and excessive constraints all provide strong motivation to use Metaheuristic search algorithms for solving. Metaheuristics also avoid the trap of getting stuck in local optima. The open-source OptaPlanner[24] solver implements various Metaheuristics and was used in this work. Due to space restriction, we only mention key configurations here. A random mix of change and swap moves were used for all algorithms. Tabu Search (TS) was used with a tabu list of size 7, where one element represents a feasible assignment for a single container. TS produces numerous feasible moves, a subset of which is evaluated in each step. For TS, this evaluation size was set to 2000. Simulated Annealing (SA) requires a starting temperature value to factor score difference. This was set to 0 for hard and 400 for soft constraints. The algorithm allows some non improving moves in the start but gets elitist depending on time-gradient. For SA, the evaluation size was set to 5. Like SA, Late Acceptance (LA) makes fewer moves and improves score over certain late steps. This late size was set to 2000 and evaluation size to 500 for LA. Late Simulated Annealing (LSA) hybrids SA with LA, allowing improvement but also controlled random decrement in score. For LSA, late size was set to 100 and evaluation size to 5. C. Simulation Results We ran each algorithm for 5 minutes. The results of consolidated solutions are presented in Table III. Figs.5(a-d) reveal the search pattern for score improvement. In (a), SA finds the best score in less than 15 seconds while LSA trails closely behind. In (b), SA outshines after 20 seconds and LSA lands second but with a larger gap. In (c), LA takes a wide lead after 175 seconds and is trailed by TS. In (d), LA takes lead after 100 seconds but is closely followed by TS. The solved solution score shows that SA is best suited for small state spaces (A2.2 and A2.5), LA for medium state spaces (B.1), while for large state spaces (B.4), LA is slightly better than TS. Fig. 4 summarizes score improvement over initial solution after consolidation. Averaging these results over all datasets places SA as the best score improving algorithm with 32.65%, LSA is second with 32.04%, LA is third with 31.79% and TS is fourth with 28.89%. Figs.5(e-h) show how many container migrations (and corresponding PDM) are proposed by the best solution of each algorithm. In (e), TS and LA have a slight advantage over SA whose solution score was the best. In (f), TS again proposes least migrations but the margin with SA is very low. Considering its high lead in solution score, SA is a clear winner. In (g), LA and TS propose the same number of migrations and LA can be preferred due to its wide margin on TS in terms of solution score. In (h), LA proposes

Table I C LOUD DATASET D ETAILS Name

Resources

Districts S, M, L

Zones

Machines S, M, L

Services

Containers S, M, L

State Space

Utilization/ Machine

Consumed Power(kWh)

Energy Efficiency

A.2.2

12

15, 8, 2

5

60, 32, 8

170

434, 222, 48

101148

35%

19.320

45.25%

Initial Solution Score -18312546

A.2.5

12

15, 8, 2

5

30, 16, 4

153

455, 245, 63

101004

37%

10.151

45.5%

-185103629

B.1

12

6, 3, 1

5

60, 30, 10

2512

2572, 1272, 147

106598

48%

21.122

56.75%

-945528368

B.4

6

35, 13, 2

5

350, 130, 20

1732

10964, 3955, 543

1036959

37%

101.199

45%

-18475614730

Table II C ONSOLIDATION USING TABU S EARCH (TS), S IMULATED A NNEALING (SA), L ATE ACCEPTANCE (LA) AND L ATE S IMULATED A NNEALING (LSA) Solved Solution Score -17742166 -15012210 -17741965 -15023979 -50622223 -28774923 -47511660 -45311736 -642113741 -751389259 -544596716 -691537717 -17116530373 -17136833433 -17116172349 -17136527221

Dataset/ Algortihm

Utilization/ Machine

Consumed Power(kWh)

95 95 95 95 47 46 47 47 99 98 99 99 500 500 500 500

35% 35% 35% 35% 40% 41% 40% 40% 49% 49% 49% 49% 36% 36% 36% 37%

19.141 19.126 19.144 19.092 9.618 9.454 9.621 9.619 20.951 20.766 20.948 20.947 101.122 101.173 101.153 101.21

Tabu Search Simulated Annealing Late Acceptance Late Simulated Annealing

60 40 20 0

Score Improvement (%)

80

A.2.2 (TS) A.2.2 (SA) A.2.2 (LA) A.2.2 (LSA) A.2.5 (TS) A.2.5 (SA) A.2.5 (LA) A.2.5 (LSA) B.1 (TS) B.1 (SA) B.1 (LA) B.1 (LSA) B.4 (TS) B.4 (SA) B.4 (LA) B.4 (LSA)

Machines S+M+L

A.2.2

A.2.5

B.1

Reduced Load on Resources 50% 83% 50% 83% 58% 71% 51% 56% 12% 23% 15% 11% 100% 93% 100% 93%

Energy Efficiency

Migrations

PDM (min)

Container Scaleup

43.43% 43.46% 43.42% 43.54% 48.87% 49.87% 48.85% 48.86% 57.89% 57.81% 57.89% 57.9% 44.5% 44.48% 44.49% 45.7%

206 220 205 369 345 351 352 461 2237 2362 2237 2935 14883 15176 14598 15207

66 69 66 95 119 119 120 136 531 551 532 647 3384 3433 3333 3437

85% 80% 85% 40% 34% 30% 34% 21% 9% 10.5% 9% 7% 17% 14% 17% 14%

same evaluation criteria, Fig. 6 exposes how consolidation significantly reduced SLA violations in comparison to the unconsolidated system state. The only exception here is the LSA which increased the SLAV by 0.02% in dataset B.1 for the used PDC criteria. It however does perform well for relaxed criteria. Taking mean value of Fig. 6’s results over all datasets, SA takes the lead by reducing SLA violations upto 70.38%, TS is second with 60.18%, LA is third with 59.43% and LSA is last with 58.38%.

B.4

Datasets

Figure 4.

PDC Tertiles (1, 2, 3) (1, 1, 1) (0, 0, 1) (1, 1, 1) (0, 0, 1) (17, 4, 4) (3, 6, 8) (22, 3, 4) (20, 2, 4) (77, 10, 6) (64, 14, 4) (72, 14, 4) (73, 15, 6) (0, 0, 0) (18, 4, 2) (0, 0, 0) (17, 3, 4)

Score improvement over initial solution

least migrations and distinguishes itself over TS. The LSA is a clear outlier and proposes highest migrations. If low migrations is a strict preference, LSA is not a good choice. Figs.5(i-l) show SLA violations (SLAV) with relaxing PDC tertile values (see Table III). Plot (i) shows three regions against the three PDC values. Using least contention (PDC=sum of all tertiles) as our evaluation criteria, in (i) SA and LSA cut even but SA proposes low SLAV due to its low PDM. In (j), SA wins over TS with low SLAV as PDM is the same for both. In (k), SA again creates low contention and hence low SLAV while LA comes second. In (l), LA and TS amazingly eliminate all contention on 500 machines. LA gives least SLAV due to lower PDM than TS. Using the

Overall, consolidation improved mean utilization of machines while using fewer machines and lesser energy in most cases. In A2.2, load on resources is reduced upto 83%, total energy consumption is reduced by 1%, utilization remains the same but 5% less machines are used. In A2.5, load on resources is reduced upto 71%, energy is reduced by 7%, utilization improves upto 3% while 8% less machines are used. In B.1, load on resources is reduced upto 23%, energy is reduced by 2%, utilization improves by 1% while 2% less machines are used. In B.4, load on resources is reduced upto 100%, energy saving and utilization show little improvement and same number of machines are used, hinting that longer search could be tried for larger scale. The energy consumption is considered to be mean value for hourly consumption. Note that energy savings of 1-7% observed here multiply into reasonable monthly savings as shown in Fig. 7. The mean value over all datasets positions SA as our most environment friendly choice with monthly

200000

250000

Score

300000

0

50000

100000

200000

250000

140 130

PDM (minutes)

300

57.0

●

●

PDC

1.5 15000

15200

●

0

50

1500

2500

SLA Violations

(j) Dataset A.2.5

100

150

200

(i) Dataset A.2.2

●

●


15

80 3500

100 SLA Violations

●

PDC

60

PDC


●

5 ●

●

0

0

●

500

450


1.0

55.5

14800

40 20

5

●

400

(f) Dataset A.2.5

●

0

350

Number of Migrations


14600

●

●

300000

●

(h) Dataset B.4


250000



25

35

350

56.5

PDM (hours)

●

(g) Dataset B.1

15

300

56.0

10.5 10.0 9.5

PDM (hours)

9.0

2800

200000

110 250


PDC

●

(e) Dataset A.2.2

2600

150000

(c) Dataset B.1


●

●

100000

●

200

300000


2400

50000

Time (ms)


(d) Dataset B.4

2200

0

120

65 70 75 80 85 90 95

●

Time (ms)

●

300000

3.0

150000

250000

2.5

100000

PDM (minutes)

−17800000000

Score

−18400000000

Tabu Search Simulated Annealing Late Acceptance (best) Late Simulated Annealing 50000

200000

(b) Dataset A.2.5

−17200000000

(a) Dataset A.2.2

0

150000

Time (ms)

2.0

150000

Time (ms)

20

100000

Tabu Search Simulated Annealing Late Acceptance (best) Late Simulated Annealing

10

50000

Tabu Search Simulated Annealing (best) Late Acceptance Late Simulated Annealing

−900000000 −800000000 −700000000 −600000000

−150000000

Score 0

−100000000

−50000000

−17800000 −17900000

Score

−18000000 −18100000

Tabu Search Simulated Annealing (best) Late Acceptance Late Simulated Annealing

10000

30000

50000

SLA Violations

(k) Dataset B.1

20000

40000

60000

80000

SLA Violations

(l) Dataset B.4

Figure 5. (a), (b), (c), (d): Score improvement pattern for 5 minutes search. (e), (f), (g), (h): Number of Migrations and PDM. (i), (j), (k), (l): Drop in SLA Violations with relaxing PDC ( as i) sum of all Tertiles (north-east region), ii) sum of Tertile 2 and 3 (mid-region), iii) Tertile 3 (south-west region)).

the weight assigned to ui and

N P

wi = 1. This policy-led,

20

40

60

80

100

utility-oriented scheme allows to obscure the complexity of individual metrics and decide for the most preferred solution. We normalized nine metrics to derive individual utility. For solution score, the difference between initial and solved score was normalized over the maximum difference. The same was applied to machines used and energy consumed before and after consolidation. The values for mean utilization and load reduction were used as such since they are already normalized. Some metrics represent negative properties. These include the sum of PDC tertiles, number of migrations, PDM and container scaleups. The differences in their values were first normalized and then subtracted from 1 to obtain their positive utility. Energy-efficiency was not considered since it is more complex to interpret. We present five business policies to rank algorithms on each dataset. High Score Policy: This policy prefers a high solution score and assigns it a weight of 0.5 while remaining metrics are weighted 0.0625 each. Results are shown in Table III.


0

SLAV Reduction (%)

i=1

A.2.2

A.2.5

B.1

B.4

Datasets

Reduction in SLA Violations over initial solution

200

300

400

500


100

Table III A LGORITHMS R ANKED ON H IGH S CORE P OLICY

0

Energy Savings (kWh)

600

Figure 6.

A.2.2

A.2.5

B.1

B.4

Rank 1 2 3 4

Datasets

Figure 7.

Energy savings per month

savings of 229.14kWh, TS is second with 172.8kWh, LA is third with 166.68kWh and LSA is last with 166.32kWh. Our results confirm that major gains from Service Consolidation lie in reduced energy costs and reduced SLA violations. This is achieved through a balanced redistribution of workload on machines by performing migrations such that the resultant loadCost is substantially reduced. Table III also shows energy-efficiency and container scaleup values. The energy-efficiency must be considered in reference to the used machines. Hence, we normalize it to maximum energyefficiency possible with the used machines.

A.2.2 SA LSA TS LA

Datasets A.2.5 B.1 SA LA TS TS LSA SA LA LSA

B.4 TS LA SA LSA

Low Migration Policy: This policy favors low migrations and weighs the PDM and the number of migrations as 0.25 each, while the remaining metrics are weighted equally with 0.071428571. Results are shown in Table IV. Table IV A LGORITHMS R ANKED ON L OW M IGRATION P OLICY Rank 1 2 3 4

A.2.2 SA TS LA LSA

Datasets A.2.5 B.1 SA SA TS LA LA TS LSA LSA

B.4 TS LA SA LSA

D. Policy-led Ranking of Solutions Our analysis so far unfolds the decision making challenge faced by Cloud service providers. We argue that the presented metrics must be harnessed and traded-off according to a high level policy. To this goal, we rank solutions by aggregating the weighted utility of individual metrics, where the weights reflect business preferences of the policy in use. This is achieved by the following utility function: U (sol) =

N X

wi ui (xi )

(8)

i=1

Here, U is the utility of a solution sol, ui (xi ) gives the utility of metric xi normalized as a real value in range [0, 1], wi is

Low Contention Policy: This policy favors low resource contention and weighs the PDC and load reduced on resources as 0.25 each, while the other metrics are equally weighted with 0.071428571. Results are shown in Table V. Table V A LGORITHMS R ANKED ON L OW C ONTENTION P OLICY Rank 1 2 3 4

A.2.2 SA LSA TS LA

Datasets A.2.5 B.1 SA SA TS LA LSA TS LA LSA

B.4 TS LA SA LSA

Low SLA Violations Policy: This policy favors low SLA violations and hence assigns a weight of 0.25 to both PDM and PDC. The remaining metrics are equally weighted with 0.071428571. Results are shown in Table VI. Table VI A LGORITHMS R ANKED ON L OW SLA V IOLATIONS P OLICY Rank 1 2 3 4

A.2.2 SA LSA TS LA


B.4 TS LA SA LSA

Low Energy Policy: This environment friendly policy favors low energy consumption and assigns a weight of 0.5 to energy saved, while the remaining metrics are weighted equally with 0.071428571. Results are shown in Table VII. Table VII A LGORITHMS R ANKED ON L OW E NERGY P OLICY Rank 1 2 3 4

A.2.2 LSA SA TS LA


B.4 TS LA SA LSA

Analyzing on a coarse-grained level, the results reveal that for four of the five policies, Simulated Annealing ranked first with a total of 13 wins on small and medium datasets (A2.2, A2.5, B.1). Tabu Search ranked first for all policies on the large dataset (B.4) and accumulated 5 wins. Late Acceptance ranked first only for the high score policy on the medium dataset (B.1) and similarly, Late Simulated Annealing ranked first on the small dataset (A2.2) for the low energy policy. For presented work, Simulated Annealing can be regarded as maximum yielding algorithm to implement most policies. VI. C ONCLUSION AND F UTURE W ORK In this work, we presented valuable insights for SLAaware resource management and capacity planning in PaaS Clouds by solving the Service Consolidation problem framed for OpenShift PaaS. To our knowledge, this is one of the first works on the subject using multiple Metaheuristic algorithms and holistic evaluations. Future prospects include expanding evaluations to more metrics, constraints and datasets. Further, we are investigating negotiated SLAs to yield maximum return on investment for Cloud customers and providers. ACKNOWLEDGMENT We are grateful to Geoffrey De Smet of RedHat for his guidance on using the OptaPlanner solver. This research is supported by the EU (FP7/2007-2013) project PaaSage under grant no. 317715 and Gesellschaft für wissenschaftliche Datenverarbeitung mbH, Göttingen (GWDG).

R EFERENCES [1] M. Peter, and T. Grance, “The NIST Definition of Cloud Computing.”, 2011. [2] IDC Press Release, [Online]. Available: http://www.idc.com/ getdoc.jsp?containerId=prUS24435913, 2013. [3] GigaOM, [Online]. Available: http://research.gigaom.com/ report/sector-roadmap-platform-as-a-service-in-2012, 2012. [4] OpenShift, [Online]. Available: https://www.openshift.com. [5] Google Cloud Platform, [Online]. Available: http://google-opensource.blogspot.de/2014/06/ an-update-on-container-support-on.html, 2014. [6] A. Beloglazov and R. Buyya, “Optimal online deterministic algorithms and adaptive heuristics for energy and performance efficient dynamic consolidation of virtual machines in Cloud data centers”, in Con. Comp.: Pract. Exp. 24:1397-1420, 2012. [7] A. Beloglazov, J. Abawajy, and R. Buyya,“Energy-aware resource allocation heuristics for efficient management of data centers for Cloud computing”, in Future Generation Computer Systems (FGCS), vol. 28, no. 5, pp. 755-768, 2012. [8] L. Wu, S. K. Garg, and R. Buyya, “SLA-based Resource Allocation for Software as a Service Provider in Cloud Computing Environments”, in 11th IEEE/ACM CCGrid, pp.195-204, 2011. [9] J. Koomey, “Growth in data center electricity use 2005 to 2010”, Oakland, CA, USA, [Online]. Available: http://www. analyticspress.com/datacenters.html [10] G. Zhang, “On Variable-Sized Bin Packing”, in Proc. of 3rd International Workshop on ARANCE, Roma, Italy, 2002. [11] K. Dhyani, et al., “A Constraint Programming Approach for the Service Consolidation Problem”, in Integration of AI and OR Techniques in Constraint Programming for Combinatorial Optimization Problems, Springer, pp. 97-101, 2010. [12] J. Anselmi, E. Amaldi, and P. Cremonesi, “Service Consolidation with End-to-End Response Time Constraints”, in 34th Euromicro Conf. Softw. Eng. and Adv. App., pp. 345-352, 2008. [13] Google ROADEF/EURO Challenge 2012, [Online]. Available: http://challenge.roadef.org/2012/en/index.php, 2012. [14] S. Srikantaiah, A. Kansal, and F. Zhao, “Energy Aware Consolidation for Cloud Computing”, in Proc. of the 2008 Conference on Power aware Computing and Systems, 2008. [15] L. Barroso, and U. Hölzle, “The Case for Energy-Proportional Computing”, IEEE Computer 40(12), pp. 33-37, 2007. [16] E. Yaqub, R. Yahyapour, P. Wieder, P. Wieder, C. Kotsokalis, K. Lu, and A. I. Jehangiri, “Optimal Negotiation of Service Level Agreements for Cloud-based Services through Autonomous Agents”, in 11th IEEE SCC, USA, 2014. [17] E. Yaqub, R. Yahyapour, P. Wieder, and K. Lu, “A Protocol Development Framework for SLA Negotiations in Cloud and Service Computing,” in 9th GECON, pp.1-15, Springer, 2012. [18] A.I. Jehangiri, R. Yahyapour, P. Wieder, E. Yaqub, and K. Lu, “Diagnosing Cloud Performance Anomalies using Large Time Series Dataset Analysis,” in 7th IEEE CLOUD, USA, 2014. [19] W. Theilmann, et al. “A Reference Architecture for MultiLevel SLA Management”, in J. Int. Eng.”, pp. 289-298, 2010. [20] OpenShift QuickStarts, [Online]. Available: https://github. com/openshift-quickstart, 2014. [21] D. Pandit, S. Chattopadhyay, et al., “Resource Allocation in Cloud using Simulated Annealing,” in AIMoC, pp. 2127, 2014. [22] F. Larumbe and B. Sanso, “A Tabu Search Algorithm for the Location of Data Centers and Software Components in Green Cloud Computing Networks,” in IEEE TCC, 1(1), 2013. [23] J. Hamilton, “Infrastructure Innovation Opportunities”, [Online]. Available: http://mvdirona.com/jrh/talksAndPapers/ JamesHamilton FirstRoundCapital20121003.pdf, 2012. [24] OptaPlanner, [Online]. Available: optaplanner.org, 2014.