Timely Virtual Machine Migration for Pro-Active Fault ... - CiteMaster

168 downloads 98282 Views 1MB Size Report
reactive fault tolerance schemes less appropriate for server applications ... Recent versions of Solaris, Windows and Linux ... active migration, were the virtual machine is migrated to ..... monitoring data from the virtual machine monitor. As this.
Timely Virtual Machine Migration for Pro-Active Fault Tolerance

Keywords-meta-learning, virtualization, failure prediction, live migration, monitoring

I. I NTRODUCTION Achieving system dependability through redundancy in space is a traditional approach for reliable server systems. Various protocols cope with transient and permanent faults through redundant resources on or above the operating system level. Analyses of large-scale systems have shown a mean time between failures (MTBF) in the order of 6.5 to 40 hours [1], depending on installation maturity. Google for example experiences a MTBF in the order of one hour, which is hidden from running server software (and therefore the users) through fault-tolerant middleware and file systems [2]. With the advent of multi-core and many-core CPUs in commodity servers such as blade centers, dependability challenges that once were of interest only to a small community of researchers and HPC users are about to seriously impact the computing environment of tomorrows server environments. One example are the techniques for effectively overclocking one core while suspending all remaining (currently unused) cores. This dynamic voltage and frequency scaling (DVFS) capability is used in multi-core CPUs to achieve further improvements in serial application

@2#A'2()%

>??'36(;24%!"#$"#%

8!% Live Migration

8!%

*+,%

!"#$"#%&'()"%

Figure 1.

VMM-Based Monitoring

*2#"% *2#"%

13#09('3:(;24%*'970"#% /(4(??'36(;24%!"#$"#%

Andreas Polze, Peter Tr¨oger Operating Systems and Middleware Group Hasso-Plattner-Institute at University of Potsdam Potsdam, Germany (andreas.polze/peter.troeger)@hpi.uni-potsdam.de

!"#$"#%&'()"%

Migration of Virtual Machines

performance. However, a side effect of DVFS can be an increased processor exposure to soft errors that may severely affect system reliability [3][4]. Memory also shows an increasing failure rate due shrinking structure sizes and increasing scales [5]. The IT industry already started to consider this upcoming dependability challenge with a set of new hardware monitoring and fault tolerance solutions. Examples are the Intel Machine Check Architecture (MCA) or the Predictive Failure Analysis (PFA) features offered in HP and Fujitsu server systems. They typically rely on improved hardware monitoring and threshold-based analysis of reported corrected errors. The reaction on detected error states in hardware is left to the operating system. Recent versions of Solaris, Windows and Linux for example already consider corrected and uncorrected memory errors reported by the MCA interface. An alternative to operating system level error recovery is the management on higher layers in the software stack. One group of methods facilitates live migration capabilities for virtual machines running the actual service. In such frameworks, a virtual machine (VM) can be migrated during runtime from one physical machine to another physical machine without explicit interruption (see Figure 1). Recent virtualization products utilize virtual machine movement not only for load balancing or hardware maintenance scenarios, but also feature migration as reaction on

II. A PPROACH The overall goal of our approach is to maintain an upper bound on service timeliness by migrating failureprone virtual machines executing the service to other hosts. Feasibility of this approach depends on three major factors: • Ability to accurately anticipate the occurrence of failures, including timing failures such as the missing of service response deadlines. This requires continuous monitoring and evaluation of the systems state. More precisely, measurements taken during runtime have to be evaluated in order to determine whether the current state of the system might lead to a failure such that the application deadline cannot be met. In order to be able to predict such failures, a large variety of faults (root causes of the failure) have to be taken into account, such as transient hardware faults or unusual data records being sent to the application. • The ability to determine a host with acceptable (meaning better) reliability parameters that the virtual machine can be migrated to. • Timely live migration of virtual machines such that the small downtime (still) necessary for live migration does not result in a missed deadline. This involves profound knowledge about the factors determining overall migration time and blackout time. We show an architectural blueprint for pro-active virtual machine migration in Figure 2. The architecture comprises monitoring of system variables at multiple levels of the system stack, ranging from hardware to the applicationlevel. At each level we employ failure predictors trying to anticipate failures based on the monitoring data at the

B&'2F&'($,(4(,G$ B&'2F&'($,(4(,G$ -/'+>&,$*&5C/0($*.0/+.'G$ !"#$%"&'%&()*+,-.$%/&!/% !"#$%"&'%&()*+,-.$%/&!/% -:'.1(H$0123)4$%4()5%

!"#$

3(4/5(6$

!.'($ !.'($

*&/01.&'2$

%&'()*(+&,$-**$ !.'($ !.'($

*/@'&;.0$!.0+'.,,('$

B&'2F&'($,(4(,G$ B&'2F&'($,(4(,G$ B&'2F&'(G$ !"#$%"&'%&()*+,-.$%/&!/% !"#$%"&'%&()*+,-.$%/&!/% !"#$%"&'%&()*+,-.$%/&!/% E&'@(+$*&5C/0($85C(2>,('$

"'()2/5+.'6$ "'()2/5+.'6$ "'()2/5+.'6$ "'()2/5+.'6$ "'()2/5+.'6$ "'()2/5+.'6$

B&'2F&'($,(4(,G$ B&'2F&'($,(4(,G$ 7:('&;0@$8I6+(AG$ !"#$%"&'%&()*+,-.$%/&!/% !"#$%"&'%&()*+,-.$%/&!/% 63(750$%8,-6)91%!)-,3)(,-.%:0(-0+%

"'()2/5+.'6$ "'()2/5+.'6$ "'()2/5+.'6$ "'()2/5+.'6$