Predicting Web Service Levels During VM Live Migrations - DMTF

Predicting Web Service Levels During VM Live Migrations

Predicting Web Service Levels During VM Live Migrations 5th International DMTF Academic Alliance Workshop on Systems and Virtualization Management: Standards and the Cloud Helmut Hlavacs, Thomas Treutner

24. 10. 2011

1 / 31


Table of Contents

1

Scenario

2

Experiment

3

Data overview

4

Model selection

5

Conclusions

2 / 31


Abstract

• VM live migration important for energy efficiency • Establish energy efficient target distribution of VMs • No perceivable downtime while live migrating?

• Live migration is resource intensive (iterative page copying) • Experiments: Influence on service levels while migrating? • Modelling: Predict service levels based on utilization?

3 / 31

Predicting Web Service Levels During VM Live Migrations Scenario

Scenario

Scenario • Virtualized data center, static consolidation (P2V)

• Provisioning for peak load, still bad energy efficiency • E.g., 9-5-cycles, very low utilization at night

• Live migration enables dynamic consolidation

• But: Seldom used, fear of possible side effects

• ⇒ Identify and quantify effects on (web) service levels

4 / 31

Predicting Web Service Levels During VM Live Migrations Scenario

Scenario

 





  





















 





5 / 31

Predicting Web Service Levels During VM Live Migrations Experiment

Experiment

Experiment • 2 servers, 1 VM, migrating forth and back

• VM disk image on central node (Gbit, open-iscsi)

• qemu-kvm VM: Linux, Apache2, PHP5, MediaWiki • SQL VM and load generation on an extra nodes

• Logging utilization of servers and VM, >100 variables

• Rise load from 50 to 600 concurrent virtual users and back

• Migrate every 15min, track response time of last 5min (SLA=1s)

6 / 31

Predicting Web Service Levels During VM Live Migrations Experiment

Experiment 

 

  



 

   



      



 7 / 31

Predicting Web Service Levels During VM Live Migrations Data overview

CPU utilization

CPU Utilization (VM normed)

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

20

40

60

80 100 120 140 160 180 200 Interval

Abbildung: Effect of increasing/decreasing HTTP workload in equal steps on the VM’s CPU utilization 8 / 31


Load average 6

Load5 Average

5 4 3 2 1 0 0

20

40

60

80 100 120 140 160 180 200 Interval

Abbildung: Effect of increasing/decreasing HTTP workload in equal steps on the VM’s UNIX load average. 9 / 31


Markov Queues

• UNIX load as approximation to Q, the average number of jobs

in an M/M/1 queue, CPU utilization as ρ, the system utilization

• Q M/M/1 =

ρ2 1−ρ

• UNIX load exponentially averaged by definition, service times

are not necessarily exactly exponentially distributed, a systematic deviation from the theoretical solution is expectable.

• Q M/G/1 =

ρ2 1−ρ

·

(1+cB2 ) , 2

cB : coefficient of service time variation

• Exponentially distributed service times: cB2 = 1 • Deterministic service times: cB2 = 0

• Simple linear regression: cB2 = 0.42

10 / 31


8 Measured M/M/1 M/D/1 M/G/1, c2B=0.42

7 Load5 Average

6 5 4 3 2 1 0 0

0.1

0.2 0.3 0.4 0.5 0.6 0.7 CPU Utilization (VM normed)

0.8

0.9

Abbildung: Queuing issues are rapidly increasing at 60-70% CPU utilization. The measured data follow the trend of the theoretical solution very closely.

11 / 31


Response Time

Response Time [ms]

25000 20000 15000 10000 5000 0 06

12

18

00

06 12 Hours

18

00

06

12

Abbildung: HTTP response times are massively increased during phases with higher workload. 12 / 31


Service Level 1

Service Level Ratio

0.98 0.96 0.94 0.92 0.9 0.88 0.86 0.84 0.82 0

20

40

60

80 100 120 140 160 180 200 Interval

Abbildung: Service levels decrease significantly during phases of high workload. The maximum allowed response time is 1 s, which is extremely conservative.

13 / 31


Load average vs Service Level 1

Measured Theoretic

0.98

Service Level

0.96 0.94 0.92 0.9 0.88 0.86 0.84 0.82 0

1

2 3 4 Load5 Average

5

6

Abbildung: Higher load average is an indicator for queuing issues which are causing decreased service level ratios. The measured data follow the theoretical solution closely for higher load values.

14 / 31


Response Time: Low Load 4

Response Time [s]

3.5

HTTP Response Time Migration started Migration finished SLA Ratio Maximum allowed response time 1 0.99 0.98 0.97 0.96 0.95

3 2.5 2

SLA Ratio

4.5

1.5 1 0.5 0 Time

Abbildung: Influence of live migration on HTTP response time and service level during low load. 15 / 31


Response Time: Medium Load 16 14

HTTP Response Time Migration started Migration finished SLA Ratio Maximum allowed response time

1 0.95

10

0.9

8

0.85

SLA Ratio

Response Time [s]

12

6 4 2 0 Time

Abbildung: Influence of Live Migration on HTTP Response Time and service level during medium load. 16 / 31


Response Time: High Load 20

HTTP Response Time Migration started Migration finished SLA Ratio Maximum allowed response time

1 0.9 0.8

10

0.7

SLA Ratio

Response Time [s]

15

0.6 5 0.5

0 Time

Abbildung: Influence of Live Migration on HTTP Response Time and service level during high load. 17 / 31

Predicting Web Service Levels During VM Live Migrations Model selection

Model selection Stepwise: AIC • Akaike Information Criterion (AIC), lower values are better • AIC = 2k − 2 ln(L)

• k: the number of independent parameters of a model • L: the maximum likelihood

• Stepwise model selection approach

• Find trade-off between number of parameters (model size) and

goodness of fit (model quality)

• Start with empty model, in each step a variable can be

added/removed, until the AIC can not be decreased further.

• Does not guarantee to find a global optimum, but typically gives a

near-optimum result within reasonable time.

18 / 31


Model selection

Exhaustive: LEAPS • Comparison with the stepwise AIC approach • Exhaustive all-subsets-regression

• Find best of all possible models for given range of model size

• Computationally intensive even if number of variables is limited

19 / 31


AIC -4200 -4250

AIC value in current step Final AIC value

AIC value

-4300 -4350 -4400 -4450 -4500 -4550 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Step

Abbildung: The AIC value quickly improves during the first steps and then slowly converges to the final value, thus allowing a trade-off between model complexity and precision.

20 / 31


AIC Standard Deviation of Residuals

0.018

Best value

0.017 0.016 0.015 0.014 0.013 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Step

Abbildung: The residuals’ standard deviation quickly improves during the first steps and then slowly converges to the final value, when using stepwise AIC. 21 / 31


AIC

Abbildung: QQ-Plot of standardized residuals for the model produced by stepwise AIC. 22 / 31


AIC

Abbildung: There is no visible correlation between the residuals’ variance and the time series. 23 / 31


LEAPS 0.97

Best value

0.96

R2Adj

0.95 0.94 0.93 0.92 0.91 0.9 1

2

3

4

5

6 7 8 Model Size

9

10 11 12

Abbildung: Higher model complexity yields better model quality when using LEAPS, but the increase is rather small. 24 / 31


LEAPS Standard Deviation of Residuals

0.025

Best value

0.024 0.023 0.022 0.021 0.02 0.019 0.018 0.017 0.016 0.015 1

2

3

4

5

6 7 8 Model Size

9

10 11 12

Abbildung: Higher model complexity yields better regression error when using LEAPS. 25 / 31


AIC: Most influencing variables of final model Variable Intercept wp01 load5 wp01 swapUsed wp01 residentSize SQ src host cpu proc s src host cpu proc s SQ wp01 cpu util vmnorm SQ wp01 cpu util vmnorm wp01 load5 SQ wp01 freeMemRatio SQ

Meaning VM UNIX load5 VM swap used The squared amount of resident memory used by the qemu-kvm process. Tasks created/s, source host. CPU util measured inside VM The squared UNIX load5 of the VM. The squared ratio of free memory inside the VM.

Estimate 2.395e+00 -1.871e-02 -7.656e-07 -4.652e-14

Std. Error 5.069e-01 2.627e-03 8.809e-08 1.652e-14

Pr (> |t|) 3.00e-06 3.67e-12 < 2e-16 0.00506

-2.475e+00

9.166e-01

0.00716

1.091e+00 -1.328e-01 9.517e-02

4.133e-01 3.187e-02 2.316e-02

0.00856 3.63e-05 4.64e-05

-1.140e-03

1.918e-04

5.22e-09

1.976e-02

9.462e-03

0.03727

Tabelle: The most influencing and significant variables of the final model produced by stepwise AIC. 26 / 31

Predicting Web Service Levels During VM Live Migrations Conclusions

Conclusions

Qualitative 1

Impact of live migration on SL depends on amount of workload

2

Tighter SLAs can be fulfilled for low workload situations

3

High workload: SL decreases massively

4

⇒ Live migrating when VM has only medium load

5 6 7

Avoid multiple live migrations within short time ⇒ Inertia effect for recently migrated VMs

Avoid senseless flip-flop migrations and stabilize service level

27 / 31


Conclusions

Quantitative 1

SL variance during a live migration to 90% predictable using only a single variable, the UNIX load5 average.

2

Models with 12 variables can explain 95% of variance

3

Models selected by AIC/LEAPS are of comparable quality and robust to statistical tests

4

AIC is way faster, so LEAPS is not really necessary

28 / 31


Conclusions

Technical 1

Systems using live migration as a mechanism to realize a more energy efficient target distribution need to consider the UNIX load average, at least for web servers and comparable

2

Typical hypervisors do not collect and export this information

3

Needs to be done by VM introspection

4

Related efforts by qemu-kvm and libvirt developers to pass-through the VMs’ memory utilization (free mem)

5

Should be extended to export load information

29 / 31


Conclusions

Future Work 1

Influence of additional VMs • idle • utilized • and a mixed scenario

2

Linux and qemu-kvm: Kernel Samepage Merging (KSM)

3

Database VM migration

4

Bandwidth limits, maximum allowed downtime

5

Migration delay, energy consumption, service downtime

30 / 31


Q&A Thank you for your attention!

31 / 31

Predicting Web Service Levels During VM Live Migrations - DMTF

Predicting Web Service Levels During VM Live Migrations - DMTF

Suggest Documents

Predicting Web Service Levels During VM Live Migrations - DMTF

Predicting Query Reformulation During Web Searching

DSP1025 - DMTF

Web Services for Management (WS-Management ... - DMTF

Circulating IGF-I levels in monitoring and predicting efficacy during ...

DSP1025 - DMTF

Circulating IGF-I levels in monitoring and predicting efficacy during ...

DSP0263 - DMTF

DSP2024 - DMTF

Cloud Management Architecture - DMTF

Predicting Geomorphic Change during Large Floods ... - Forest Service

Scheduling Live-Migrations for Fast, Adaptable and ...

A Survey on Live Virtual Machine Migrations and its Techniques

The Efficacy of Live Virtual Machine Migrations ... - Clemson University

Predicting Web Service Maintainability via Object ... - Semantic Scholar

Predicting Web Service Maintainability via Object ... - Semantic Scholar

The Efficacy of Live Virtual Machine Migrations Over the ... - CiteSeerX

Averaged Vibration Levels During Courier Parcel Delivery Service in ...

OVF 2.0 - DMTF

SMBIOS Specification - DMTF

Predicting Service - NIST

SMBIOS Specification - DMTF

Performance Analysis of Live and Offline VM Migration ... - MECS Press

Plasma catecholamine levels during