Timely Virtual Machine Migration for Pro-Active Fault ... - CiteMaster

Timely Virtual Machine Migration for Pro-Active Fault Tolerance

Keywords-meta-learning, virtualization, failure prediction, live migration, monitoring

I. I NTRODUCTION Achieving system dependability through redundancy in space is a traditional approach for reliable server systems. Various protocols cope with transient and permanent faults through redundant resources on or above the operating system level. Analyses of large-scale systems have shown a mean time between failures (MTBF) in the order of 6.5 to 40 hours [1], depending on installation maturity. Google for example experiences a MTBF in the order of one hour, which is hidden from running server software (and therefore the users) through fault-tolerant middleware and file systems [2]. With the advent of multi-core and many-core CPUs in commodity servers such as blade centers, dependability challenges that once were of interest only to a small community of researchers and HPC users are about to seriously impact the computing environment of tomorrows server environments. One example are the techniques for effectively overclocking one core while suspending all remaining (currently unused) cores. This dynamic voltage and frequency scaling (DVFS) capability is used in multi-core CPUs to achieve further improvements in serial application

@2#A'2()%

>??'36(;24%!"#$"#%

8!% Live Migration

8!%

*+,%

!"#$"#%&'()"%

Figure 1.

VMM-Based Monitoring

*2#"% *2#"%

13#09('3:(;24%*'970"#% /(4(??'36(;24%!"#$"#%

Andreas Polze, Peter Tr¨oger Operating Systems and Middleware Group Hasso-Plattner-Institute at University of Potsdam Potsdam, Germany (andreas.polze/peter.troeger)@hpi.uni-potsdam.de

!"#$"#%&'()"%

Migration of Virtual Machines

performance. However, a side effect of DVFS can be an increased processor exposure to soft errors that may severely affect system reliability [3][4]. Memory also shows an increasing failure rate due shrinking structure sizes and increasing scales [5]. The IT industry already started to consider this upcoming dependability challenge with a set of new hardware monitoring and fault tolerance solutions. Examples are the Intel Machine Check Architecture (MCA) or the Predictive Failure Analysis (PFA) features offered in HP and Fujitsu server systems. They typically rely on improved hardware monitoring and threshold-based analysis of reported corrected errors. The reaction on detected error states in hardware is left to the operating system. Recent versions of Solaris, Windows and Linux for example already consider corrected and uncorrected memory errors reported by the MCA interface. An alternative to operating system level error recovery is the management on higher layers in the software stack. One group of methods facilitates live migration capabilities for virtual machines running the actual service. In such frameworks, a virtual machine (VM) can be migrated during runtime from one physical machine to another physical machine without explicit interruption (see Figure 1). Recent virtualization products utilize virtual machine movement not only for load balancing or hardware maintenance scenarios, but also feature migration as reaction on

II. A PPROACH The overall goal of our approach is to maintain an upper bound on service timeliness by migrating failureprone virtual machines executing the service to other hosts. Feasibility of this approach depends on three major factors: • Ability to accurately anticipate the occurrence of failures, including timing failures such as the missing of service response deadlines. This requires continuous monitoring and evaluation of the systems state. More precisely, measurements taken during runtime have to be evaluated in order to determine whether the current state of the system might lead to a failure such that the application deadline cannot be met. In order to be able to predict such failures, a large variety of faults (root causes of the failure) have to be taken into account, such as transient hardware faults or unusual data records being sent to the application. • The ability to determine a host with acceptable (meaning better) reliability parameters that the virtual machine can be migrated to. • Timely live migration of virtual machines such that the small downtime (still) necessary for live migration does not result in a missed deadline. This involves profound knowledge about the factors determining overall migration time and blackout time. We show an architectural blueprint for pro-active virtual machine migration in Figure 2. The architecture comprises monitoring of system variables at multiple levels of the system stack, ranging from hardware to the applicationlevel. At each level we employ failure predictors trying to anticipate failures based on the monitoring data at the

B&'2F&'($,(4(,G$ B&'2F&'($,(4(,G$ -/'+>&,$*&5C/0($*.0/+.'G$ !"#$%"&'%&()*+,-.$%/&!/% !"#$%"&'%&()*+,-.$%/&!/% -:'.1(H$0123)4$%4()5%

!"#$

3(4/5(6$

!.'($ !.'($

*&/01.&'2$

%&'()*(+&,$-**$ !.'($ !.'($

*/@'&;.0$!.0+'.,,('$

B&'2F&'($,(4(,G$ B&'2F&'($,(4(,G$ B&'2F&'(G$ !"#$%"&'%&()*+,-.$%/&!/% !"#$%"&'%&()*+,-.$%/&!/% !"#$%"&'%&()*+,-.$%/&!/% E&'@(+$*&5C/0($85C(2>,('$

"'()2/5+.'6$ "'()2/5+.'6$ "'()2/5+.'6$ "'()2/5+.'6$ "'()2/5+.'6$ "'()2/5+.'6$

B&'2F&'($,(4(,G$ B&'2F&'($,(4(,G$ 7:('&;0@$8I6+(AG$ !"#$%"&'%&()*+,-.$%/&!/% !"#$%"&'%&()*+,-.$%/&!/% 63(750$%8,-6)91%!)-,3)(,-.%:0(-0+%

"'()2/5+.'6$ "'()2/5+.'6$ "'()2/5+.'6$ "'()2/5+.'6$

Timely Virtual Machine Migration for Pro-Active Fault ... - CiteMaster

Timely Virtual Machine Migration for Pro-Active Fault ... - CiteMaster

Suggest Documents

Architecting Dependable Systems with Proactive Fault ... - CiteMaster

Downtime Analysis of Virtual Machine Live Migration - CiteMaster

Downtime Analysis of Virtual Machine Live Migration - CiteMaster

Downtime Analysis of Virtual Machine Live Migration - CiteMaster

Proactive Fault Tolerance Using Preemptive Migration - CiteSeerX

A Fault-Tolerant Java Virtual Machine

Proactive Fault Tolerance in MPI Applications via Task Migration

Virtual Machine Migration Strategy in Cloud Computing

Improving Virtual Machine Migration via Deduplication - NUCSRL!

Survey Study of Virtual Machine Migration ...

Virtual Machine Migration Plan Generation ... - Clarkson University

An Efficient Virtual Machine Migration Algorithm

Exploiting Live Virtual Machine Migration - Black Hat

Virtual Machine Migration Triggering using Application ... - Core

Proactive Network Fault Detection

Elasticity Management for Virtual Machine Fault Recovery - SBRC 2016

Virtual Machine Proactive Scaling in Cloud Systems - Semantic Scholar

Prediction Based Proactive Thermal Virtual Machine Scheduling in ...

A Location Selection Policy of Live Virtual Machine Migration for

VMAP: Proactive Thermal-aware Virtual Machine ... - ECE, Rutgers

Virtual Machine Migration Implementation in Load Balancing for Cloud ...

Virtual Machine Live Migration for Pervasive Services in Cloud ...

VMFlock: Virtual Machine Co-Migration for the Cloud - CiteSeerX

Implementing Scalable, Network-Aware Virtual Machine Migration for ...