for cloud applications, which relies on experimental data collected using fault injection ... recover a very large fraction of errors in client-server applications with small ...... Despite all best efforts, we must accept the possibility that the checkpoint ...
Recovery for Virtualized Environments Frederico Cerveira, Raul Barbosa, Henrique Madeira, Filipe Araujo CISUC, Department of Informatics Engineering University of Coimbra P-3030 290, Coimbra, Portugal {fmduarte, rbarbosa, henrique, filipius}@dei.uc.pt Abstract—Cloud infrastructures provide elastic computing resources to client organizations, enabling them to build online applications while avoiding the fixed costs associated to a complete IT infrastructure. However, such organizations are unlikely to fully trust the cloud for the most critical applications. Among other threats, soft errors are expected to increase with the shrinking geometries of transistors, and many errors are left for the software layers to correct and mask. This paper characterizes the behavior of a virtualized environment, using Xen with CentOS as the hypervisor, in presence of soft errors. One of the main threats arises from soft errors directly affecting the hypervisor, as these faults have the potential to disrupt several virtual machines at once. With this in mind, we develop a fault tolerant architecture for cloud applications, which relies on experimental data collected using fault injection to guide its design. This architecture recovers from bit-flip errors with the help of a watchdog timer, to securely reboot the hypervisor. Nevertheless, errors might still propagate outside the system, for example to a client in a client-server interaction. Despite this, our results suggest that our architecture and a few simple techniques, like timers on the client, can recover a very large fraction of errors in client-server applications with small hardware and performance overhead. Conversely, the fraction of errors requiring Byzantine fault-tolerant techniques is quite small, thus restricting those expensive approaches to highly critical applications. Keywords—Virtualization, fault injection, cloud computing, fault tolerance, dependability.
I.
I NTRODUCTION
Cloud computing infrastructures are increasingly trusted as the means for organizations to reduce investment and management costs, by outsourcing computational resources and paying only for the resources and services they need. This trend has created several dependability challenges. Both the hardware and the software have changed in significant ways, such as the widespread use of virtualization and the separation of concerns between the software deployed by the cloud provider and the application designer. As a result, in order to keep up with the growing trust placed on cloud infrastructures, one must first understand how these changes affect the chain of events leading to failures, and to develop effective protection mechanisms. Transient hardware faults are one of the threats to dependability that require close attention. The so-called soft errors are incorrect states in hardware elements, caused by transient events, such as particle strikes. Although the root cause of a soft error is a transient event, the error may well become permanent and lead to failures unless corrective action is taken. The semiconductor industry continues making progress in manufacturing processes and gate geometries with less than
10 nm are expected within the next few years, with gate sizes significant smaller than 10 nm becoming the norm within the next decade [1]. The decrease in feature sizes expected for the forthcoming generations of processor cores also causes the reduction of the collected charge required to cause soft errors, due to smaller nodal capacitances and operating voltages. This trend, together with the higher operating frequencies, has increased the vulnerability of circuits to soft errors and constitutes a major concern for the semiconductor manufacturers and for the computer industry in general. The benefits of increased performance, reduced power consumption, and increased portability come with the price of forcing dramatic changes in the computer industry. Namely, the expected increase of soft errors, compared to the rates observed in current hardware technology, requires a considerable part of the burden of coping with soft errors to be moved to the software layers. The prospect of a less reliable baseline hardware represents a major challenge, especially for very large datacentres with many thousands of processor cores, such as the ones used for the cloud infrastructure. In fact, when an application is moved to the cloud, it becomes exposed to the problems arising from soft errors in two different ways. First, a virtualized environment implies a hypervisor responsible for handling low-level requests on behalf of virtual machines (VMs). A soft error occurring during the execution of the hypervisor may lead to several VMs being affected. Second, soft errors often require software mechanisms for correction and recovery, which are often application-specific. Given that a cloud provider deploys and manages the platform software, and another team develops application software, it is often unclear who should develop such mechanisms. This paper experimentally characterizes the behavior of applications running in a virtualized system in presence of soft errors. The detailed characterization is performed using fault injection as the means to emulate soft errors, targeting virtual machines as well as the hypervisor. To this end, we developed a software-implemented fault injector that introduces bit flips in processor registers and memory locations. One relevant observation is that errors targeting the hypervisor are either masked, thereby having no effect, or lead to multiple virtual machines failing to provide correct service. In other words, when an error injected in the hypervisor is effective, most often the effect is that the hypervisor and the virtual machines fail simultaneously. A detailed analysis of such situations revealed that the system hangs in a way that is recoverable, provided that some mechanism is able to trigger a restart. Based on this observation, we designed, implemented,
and evaluated a mechanism for restarting the hypervisor, and show that even such simple mechanism can greatly improve the virtualized system behavior in the presence of soft errors. The proposed mechanism acts like an external watchdog timer, with the time interval calibrated by fault injection experiments, to reset a physical machine after the time elapses. The implementation uses Intel’s AMT hardware infrastructure to remotely restart a physical machine, the hypervisor (Xen) and the virtual machines using initialization scripts. This technique relies on applications keeping the state in ACID stores (so that VMs may simply be restarted) or on periodic snapshots of VMs being stored for transparent recovery. These two approaches are known from the literature and our recovery strategy is compatible with both. The remainder of the paper is organized as follows. Section II describes research articles most related to our work. Section III provides details concerning fault injection, focusing on the tool and method used for experimentation, and Section IV describes the experimental setup. A detailed characterization of the effects of soft errors on virtualized servers, conducted experimentally, is presented in Section V. Based on those results, Sections VI and VII propose how to recover and handle errors that were found to lead to system failures. Section VIII finalizes the paper with the conclusions and implications for practice. II.
R ELATED WORK
With the increasing volume of applications, data, and processing being deployed in cloud infrastructures, their dependability becomes particularly important. However, as the underlying hardware uses smaller manufacturing processes [1] and reduces the energy consumption, concerns about a potential increase in the soft error rate should be adequately addressed. Soft errors are temporary errors due to transient hardware faults caused by manufacturing defects that result in parasitic nodal capacitances or caused by particle strikes [2]. Virtualization technologies, supported by adequate hardware [3] and software like Xen [4], [5], allow multiple guest operating systems, along with the application stack, to be collected in a single physical machine, reducing the total cost through multi-tenancy. However, the current trend towards hardware with increased susceptibility to soft errors calls for additional research to assure that the reliability of virtual machines is not affected by the forthcoming hardware. These concerns have prompted many researchers to propose fault handling techniques to deal with this problem at different levels of a virtualized system. For example, techniques handling the problem at the application level typically use two or more replicas of the same software to detect errors and tolerate faults by voting or rerunning the processes [6], [7], [8]. Other approaches work at the virtual machine level and focusing on the problem of recovering the software inside virtual machines using service migration, virtual machine migration, lightweight checkpointing mechanisms, among other techniques [9], [10]. Although these approaches are necessary, and in fact complement the approach proposed in this paper, the hypervisor itself may fail, thereby requiring a recovery mechanism. We
propose and evaluate a simple mechanism to trigger hypervisor recovery, whose design was driven by the observations from the fault injection campaigns carried out to observe the impact of soft errors in the behavior of a virtualized environment. Fault injection has been largely used in the past to evaluate the impact of fault on systems and help designing and evaluating fault tolerance mechanisms. Nevertheless, although there are research efforts addressing fault injection for cloud environments, these are still relatively rare and most focus on injecting in guest virtual machines alone [11]. The new software implemented fault injection (SWFI) tool used in this paper was inspired in classic fault injection tools, namely Xception [12] and Goofi [13]. III.
E RROR INJECTOR : TOOL AND METHOD
Fault injection is a technique for studying the behavior of systems under the effects of faults, aiming at verifying the correctness of fault-handling mechanisms, obtaining relevant metrics such as error detection latencies, measuring error coverage, identifying weaknesses in fault-tolerant systems, and so on. It is recognized as an important technique for building dependable systems, and in this paper it is the main technique for characterizing the behavior of virtualized cloud applications with respect to soft errors. Since all systems are subject to the fault-error-failure chain of events, it is possible to study the effect of faults in a system either by introducing actual faults or by emulating their effects by introducing errors. In other words, one may emulate faults by directly altering the state of a system, in a way which represents the effects of faults. Hence, in this paper we use the term fault injection as a general term to refer to the general technique, and the term error injector to refer to the tool that we developed and used for our experiments. The tool emulates the effects of hardware faults by injecting errors in microprocessor registers and memory locations. A. Fault model We apply the commonly used single bit-flip error model to emulate transient hardware faults. Such faults are frequently referred to as soft errors due to the non-damaging nature often resulting from interferences of the environment on a hardware circuit. Soft errors are incorrect values in state-holding circuits, and the term soft indicates that the circuit continues to operate correctly in spite of the incorrect value. Hence, if the value is corrected, the error vanishes, while it may remain permanently in the system if left uncorrected. The error injector introduces one random bit-flip in each experiment, and classifies the outcome by examining the output and the state of key components in the target system. Errors are injected in microprocessor registers and memory locations. Injection in microprocessor registers emulates faults affecting the processor directly, and injection in memory locations emulates faults affecting circuits (including the CPU) while a value is in transit for storage in main memory. The goal is not to emulate direct memory errors, since main memory is typically protected with error-correcting codes and caches are protected with parity codes.
B. Injection technique Error injection is achieved through a software-implemented technique. The core of the error injector is a loadable kernel module for Linux, which receives a set of parameters specifying when and where to introduce a bit-flip. The kernel module is loaded into the target system, during execution, through an SSH connection. Cloud services are typically managed through an SSH connection, and therefore the virtual machines as well as the hypervisor accept SSH connections without any modification. Once loaded, the module locates the kernel-specific data structures related to the targeted process. To corrupt the value of a register, the location in which processor registers are held during context switches is XORed with a mask specifying the bit to be flipped. Once a context switch to the specified process is made by the kernel, that process continues execution with a corrupted value. To inject errors in memory locations, the module maps one of the memory pages belonging to the target process into the module’s active memory address space and performs the bit-flip operation. Both in the case of register injections and memory injections the target for the error injection is determined by the pid value of a process. In order to inject register errors in a specific process, the tool actually manipulates the register values while these are stored in memory, before a context switch loads the state into the microprocessor. Memory bit-flips are also injected in a specific process by manipulating a memory location belonging to the specified process, during the duration of the execution of that process. In this paper, kernel modules are used to inject errors in virtual machines and in the hypervisor’s dom0 component, which also uses a Linux kernel (the section that follows describes this component in detail). However, in order to inject errors in Xen (the hypervisor engine) we modified its source code and recompiled it with a minimalist hypercall. This hypercall receives the memory location and the bit to be flipped, and performs the injection. Only memory errors are presently supported in this part of the hypervisor. C. Implementation details The error injection module resides in the target system and can be instantiated through a command, using an SSH session or similar approaches. The kernel module is loaded with a set of parameters specifying the location for injecting the error and the temporal trigger for injection. Figure 1 shows the experiment controller machine, instantiating the error injector and executing JMeter as the means to impose not only a faultload but also a workload on the target system. The parameters which are passed at module loading time specify the essential aspects of an injection. Namely, which process is to be targeted, the location and the bit to be flipped, and a path to a stable storage. To analyze the results of one injection, even after a host crash, the tool requires a stable storage to save key information. In our experiments, the value of the injected location before and after injection are stored, along with the value of the Instruction Pointer register, and a memory dump of the neighboring code area to which the IP
Fig. 1.
Error injector and experiment controller.
register was pointing at the moment. This memory dump is essential for conducting many post-injection analysis. The present version of the injector only supports temporal triggers, and experiments are not precisely repeatable, as in many injectors, due to timing imprecisions. For our present use, such triggers suffice, and allow us to achieve statistical reproducibility of results. The processor registers presently supported for error injection are: IP (Instruction Pointer), userspace and kernel-space SP (Stack Pointer), AX, BX, CX, DX, CS, SS, SP, BP, Flags, SI, DI, ES, DS, FS, GS and R8 to R15. The error injector module remains dormant until the temporal trigger is activated, meaning that the temporal intrusiveness is as low as possible, to modify the contents of the injection location. An error injection experiment begins by resetting the target system, extracting the contents of the stable storage (from the preceding experiment), waiting for the hypervisor to restart, along with the virtual machines, loading the kernel module, and running the workload. Given the length of this process, it is currently possible to perform a little over 100 injections per day. IV.
E XPERIMENTAL SETUP
The choice of experimental setup took into consideration that it should have characteristics as similar as possible to those used in real cloud infrastructures. The hardware was carefully selected so that the CPU supported all virtualization extensions and the motherboard included the necessary chipset for virtualization to work like in a cloud datacentre. Xen 4.4.1 was chosen as the hypervisor of our virtualization platform, and CentOS 7 was chosen to be the dom0, responsible for managing Xen. In order to be able to act as the dom0, CentOS kernel was recompiled, using the same default parameters of the standard CentOS, while adding of the required options. The choice of CentOS as the dom0 operating system took into account the fact that it is also the default dom0 operating system for the Citrix XenServer package.
Xen runs in a privileged CPU mode that is higher than the remaining software. It controls memory management and CPU scheduling for all virtual machines (so called “domains”) and instantiates the most privileged domain, which is referred to as dom0. The dom0 domain runs the only virtual machine that has direct access to the hardware, and consists of a server operating system offering a console for managing Xen remotely. Figure 2 depicts the complete virtualization stack. The guest virtual machines run in the so called domU domain. In our case the virtual machines run Debian 7.7, with kernel version 3.11.1. Our experiments used two similar virtual machines running on the virtualization server, both executing the same workload, while errors were injected in only one of these machines. This setup allowed us to run injection campaigns, for example to measure the error isolation provided by Xen. The physical machine used to perform the experiments was a Fujitsu Celsius workstation, including all necessary hardware components supporting the state-of-the-art in virtualization technology. The workstation is equipped with an Intel Core i74770 CPU, which features 4 physical cores capable of HyperThreading, as well as support for virtualization technologies such as VT-x, VT-d and EPT. This processing power is complemented by 8GB of DDR3 RAM and two different kinds of physical storage, in the form of one 120GB SSD and a 1TB HDD. The SSD drive allowed us to speed up the time between experiments, spent reloading the state of VMs from a previous snapshot.
Fig. 2.
Xen is capable of deploying Virtual Machines through two different virtualization modes. These are referred to as paravirtualization (PV) and hardware-assisted virtualization (HVM). In the paravirtualization approach, Xen provides an environment very similar to the real one, but without emulating some functionalities (e.g., network and disk hardware). This allows the use of hardware that lacks virtualization support, at the expense of requiring the use of a modified kernel that knows how to operate in this environment with reduced ability. The HVM mode harnesses the hardware extensions that provide the system with more privilege rings, which Xen then uses to provide an environment similar to the one that applications and operating systems were designed for. Hence, the guest operating system is unmodified. The hardwareassisted virtualization is now quite widespread, given that many hardware platforms fully support virtualization. V.
In order to characterize the behavior of a virtualized application under the occurrence of soft errors, a series of fault injection campaigns were conducted, aiming to understand the failure modes resulting from errors in the hypervisor, in a guest operating system, and in an application running in a virtual machine. The results presented in this section may be grouped into: •
Errors injected in application processes, to characterize the failure modes exhibited to external users of the system, and to determine if errors originating within one virtual machine are able to propagate to other virtual machines or to the hypervisor. In other words, the application client-server interaction results in information flow which allows errors to propagate to the client. However, the hypervisor and other colocated virtual machines should remain unaffected by errors originating in one virtual machine.
•
Errors injected in guest operating system processes, aiming to examine the failure modes exhibited to external users, considering that an error directly affecting a guest operating system has a greater potential to cause failures. Similarly to the previous category of experiments, the virtualization platform is designed to prevent such errors from propagating to the hypervisor and other co-located virtual machines.
•
Errors injected in the hypervisor, in order to characterize the failure modes arising from errors occurring during the execution of the hypervisor’s functions.
A virtualized environment using Xen with CentOS.
A. Workload The workload executed by the target system consisted of the Apache 2.2.22 Web server, with JMeter performing HTTP requests continuously. Responding to each request involved computing a SHA1 hash of a 1GB array, and sending the reply to the client. The workload is representative of typical servers deployed in the cloud, which provide services that are accessed via the HTTP protocol, and the Apache Web server is also a common choice. The clients run in an external workstation, which controls the experiments, and uses JMeter 2.12 to simulate 10 different clients. A ramp-up time of 30 seconds, followed by 5 minutes of execution, during which errors were injected. The load imposed by clients on the server led the CPU load to 100% during most of the experiment time. Both virtual machines run the same workload.
E XPERIMENTAL RESULTS
In all experiments, the outcome of injecting an error was classified according to its impact on the output produced by the workload running inside the virtual machines, and we configured the system to run two virtual machines on the virtualized platform. Moreover, the impact of all errors on the hypervisor was classified by running correctness tests at the end of each experiment. Calibration experiments, conducted along with the development of the error injector and the experimental setup, aimed at characterizing the behavior of faulty components from the qualitative point of view. In other words, with the intent to identify the failure modes of the virtual machines and
the hypervisor, and to calibrate the tool for the remaining experiments. After one injection, there are five distinct failure modes of virtual machines, along with the possibility of an experiment having no effect on the service provided by a virtual machine. All the failure modes represent the view from the client side (i.e., the external view). •
•
•
•
•
Incorrect content. The application running within the virtual machine produces syntactically correct HTML content with wrong values. Hence, incorrect content becomes visible to the service user, which may be another machine or a human. This failure mode is the most serious in its consequences, as it allows errors to propagate to other components in a system, and also the most difficult to handle at the system-level, as it is undetectable unless the output is checked against a redundant computation or using some other form of redundancy. Corrupted output. The application produces a corrupted stream of data, while the socket remains open. The output is syntactically incorrect, and the condition is detectable by clients because the server fails to comply with the HTTP protocol, sends invalid HTML code, or sends code which is not HTML at all. Consequently, the server maintains the TCP connection open but the output is corrupted. This behavior is detectable by clients, and a Web browser would display an error message, and as such this behavior may be considered less harmful than incorrect content, although recovery is necessary. Connection reset. The TCP connection between a client and the server is reset by the server’s network stack. This corresponds to an incorrect behavior which is, at least in part, detected by the guest operating system running inside a virtual machine. To deal with such a situation, the operating system closes the socket and sends a packet with RST flag set to the client. Therefore, the connection is lost and a client rapidly detects the problem. Client-side timeout. One or more clients fail to receive a response to their request, and issue a client-side timeout. The timeout is configured at 20 seconds, which is reasonably high for HTTP interactions (the keep-alive mechanism, for example, typically maintains a connection open for 5–15 seconds at the serverside). Note that numerous client requests are handled simultaneously, and some responses may be correct while some are missing and lead clients to time out. Hang. The virtual machine stops producing output and fails to answer any subsequent requests. In this case, the application running inside the virtual machine no longer produces results, and eventually all connected clients will issue a client-side timeout. A failure classified as a VM hang will not be classified, in our experiments, as a client-side timeout. Nevertheless, due to non-deterministic timing aspects, some experiments are classified as client-side timeouts when the actual behavior might be a VM hang. If a few, isolated client requests are unanswered, the failure mode is a clientside timeout; but when this occurs with many client
requests, the classification may be either a hang or a client-side timeout. •
No effect. The injected error has no visible consequences on the service provided by any virtual machine. Neither the performance or the correctness of the service is affected in a way which could be classified as a failure.
The workload consists of a Web server continuously accepting new connections and replying to clients, in a typical client-server model. For each experiment, the above classification is performed by examining the output produced at every client request. For this reason, a single error may cause more than one kind of failure in a single experiment. An example would be for a single error to lead to a connection reset for one client request and a corrupted output for another client. Although the presentation of multiple failure modes, in a single experiment, is relatively infrequent in our experiments, some results add up to slightly more than 100% as a consequence. In addition to the failure modes, we also tested the hypervisor responsiveness after each injection. To achieve this, we developed a hypervisor correctness test that consists of establishing an SSH connection to obtain a console on the hypervisor, reading a file stored on the file system, and running the uname service. We observed that a test suit like this, although simple, is comprehensive enough to accurately detect the situations when the hypervisor is irresponsive. The hypervisor correctness test lead to two possible classifications of the outcome of a single experiment. •
Unresponsive. The hypervisor hangs and an external probe is unable to execute the correctness tests.
•
Responsive. After an experiment ends, the correctness tests are executed on the hypervisor successfully.
A. Errors injected in application processes In order to examine the behavior of a virtualized application under soft errors with origin in a virtual machine, we conducted an injection campaign targeting application processes. Errors were injected during the execution of processes belonging to the Apache Web server in one virtual machine, and the behavior of the hypervisor as well as the other virtual machine was monitored to look for any evidence of error propagation. This campaign was designed taking into account three major goals. First, to understand the effect of hardware faults (i.e., soft errors) on the virtual machine in which these errors have originated. In a production environment, most of the hardware errors will originate and be confined to the state of a single virtual machine. In fact, assuming that virtual machines represent the majority of a physical machine’s CPU usage, it is likely for a large proportion of hardware errors to affect only one virtual machine, before possibly propagating to other parts of the system. The second goal was to determine whether the hypervisor guarantees isolation, so as to prevent errors from propagating to other co-existing VMs or to the hypervisor itself. The third goal was to determine whether there are differences in error manifestation between HVM and PV modes of virtualization.
The three goals are aligned with the ambition to characterize the behavior of a virtualized system under soft errors, and to guide the design of error-handling mechanisms. To this end, errors were injected during the execution of Apache processes, by selecting one such process randomly, along with a random location and time. Only CPU registers were targeted for injection. The results of error injection targeting Apache processes, within one virtual machine, running in HVM mode, are shown in Table I. TABLE I.
O UTCOMES OF ERROR INJECTION TARGETING APPLICATION PROCESSES WITHIN A VIRTUAL MACHINE , IN HVM MODE .
Virtual machine 1 (faulty) Virtual machine 2 (fault-free) Incorrect content 4 (0.4%) Incorrect content 0 (0%) Corrupted output 2 (0.2%) Corrupted output 0 (0%) Connection reset 12 (1.2%) Connection reset 0 (0%) Client-side timeout 130 (12.7%) Client-side timeout 0 (0%) Hang 5 (0.5%) Hang 0 (0%) No effect 876 (85.3%) No effect 1027 (100%) Hypervisor responsive 1027 (100%)
We may observe, in Table I, that the faulty virtual machine (i.e., the one in which errors were injected) exhibits diverse failures, including a large proportion of client-side timeouts and a small proportion distributed across the remaining failure modes. About 85.3% of the injected errors had no effect on the target virtual machine. From the 14.7% of errors which were effective, the vast majority resulted in client-side timeouts. The duration of these failures was relatively short (as examined in Section V-E) and affected only a few client requests. From all the errors injected in the Apache processes, in HVM mode, only 4 (0.4%) caused incorrect content to be sent in response to client requests. These are transient failures at the server-side, as the server continues correctly after a short duration, but with a potentially permanent effect on the client, as the values received from the server are incorrect. In the same experiments, only 5 (0.5%) led to a permanent failure of the virtual machine, whereby all subsequent requests were left unanswered. Table I also summarizes the results concerning error propagation. We can observe that errors occurring during the execution of an application (within one VM) are correctly isolated by the hypervisor. Soft errors are transient and unintentional in nature, but it is nevertheless important to observe that the hypervisor and co-existing VMs are always protected from such unintentional errors. This observation is common to all other experiments as well. TABLE II. O UTCOMES OF ERROR INJECTION TARGETING APPLICATION PROCESSES WITHIN A VIRTUAL MACHINE , IN PV MODE . Virtual machine 1 (faulty) Virtual machine 2 (fault-free) Incorrect content 6 (0.6%) Incorrect content 0 (0%) Corrupted output 1 (0.1%) Corrupted output 0 (0%) Connection reset 8 (0.8%) Connection reset 0 (0%) Client-side timeout 71 (7.3%) Client-side timeout 0 (0%) Hang 6 (0.6%) Hang 0 (0%) No effect 876 (90.5%) No effect 968 (100%) Hypervisor responsive 968 (100%)
Table II shows the results of targeting application processes, within a virtual machine, running in paravirtualization mode. The results are analogous to those presented in Table I and, effectively, reinforce the observation that the hypervisor
correctly isolates virtual machines and prevents soft error propagation. Moreover, the failure modes of the virtual machine in which errors were injected are similarly distributed, when comparing the results of paravirtualization in Table II with those of hardware-assisted virtualization in Table I. There is a slightly higher proportion of non-effective injections in paravirtualization mode. Hence, although the set of possible failure modes exhibited by a virtual machine running in PV or HVM modes are the same, the exact proportions differ slightly. B. Errors injected in guest operating system processes The software running within a virtual machine includes a guest operating system providing fundamental functionality to the applications. As such, errors occurring during the execution of the operating system are also relevant and we conducted two campaigns targeting operating system processes, injecting errors in microprocessor registers. One campaign targeted a virtual machine running in HVM mode, and the other targeted a paravirtualized system. The fundamental goals of these two campaigns were the same as the ones described in the preceding section. Namely, to understand the effects of soft errors originating in one virtual machine (specifically in the operating system), to determine if the error isolation provided by the hypervisor is adequate, and to examine possible differences between hardware-assisted virtualization and paravirtualization. The results are summarised in Tables III and IV. TABLE III.
O UTCOMES OF ERROR INJECTION TARGETING OS PROCESSES WITHIN A VIRTUAL MACHINE , IN HVM MODE .
Virtual machine 1 (faulty) Virtual machine 2 (fault-free) Incorrect content 0 (0%) Incorrect content 0 (0%) Corrupted output 0 (0%) Corrupted output 0 (0%) Connection reset 0 (0%) Connection reset 0 (0%) Client-side timeout 5 (1.0%) Client-side timeout 0 (0%) Hang 4 (0.8%) Hang 0 (0%) No effect 493 (98.2%) No effect 502 (100%) Hypervisor responsive 502 (100%)
The results presented in Table III indicate that errors injected in operating system processes are less likely to affect the application service than those errors injected directly in application processes. It is also worth noting that all effective errors had a manifestation as client-side timeouts or virtual machine hangs. These results do not necessarily exclude the possibility of a soft error in a guest operating system leading the application software to produce incorrect content, but provide some evidence that such errors are much less likely (compared to injections directly affecting application processes). TABLE IV.
O UTCOMES OF ERROR INJECTION TARGETING PROCESSES WITHIN A VIRTUAL MACHINE , IN PV MODE .
OS
Virtual machine 1 (faulty) Virtual machine 2 (fault-free) Incorrect content 0 (0%) Incorrect content 0 (0%) Corrupted output 0 (0%) Corrupted output 0 (0%) Connection reset 0 (0%) Connection reset 0 (0%) Client-side timeout 2 (0.9%) Client-side timeout 0 (0%) Hang 1 (0.4%) Hang 0 (0%) No effect 225 (98.7%) No effect 228 (100%) Hypervisor responsive 228 (100%)
TABLE V.
O UTCOMES OF ERRORS INJECTED IN PROCESSOR REGISTERS , TARGETING THE HYPERVISOR IN DOM 0.
Experiments
Hypervisor
Unresponsive
182
965
Responsive
783
Virtual machine 1 Incorrect content 0 Corrupted output 0 Connection reset 0 Client-side timeout 75 Hang 107 No effect 0 Incorrect content 0 Corrupted output 0 Connection reset 0 Client-side timeout 0 Hang 0 No effect 783
Virtual machine 2 Incorrect content 0 Corrupted output 0 Connection reset 0 Client-side timeout 75 Hang 107 No effect 0 Incorrect content 0 Corrupted output 0 Connection reset 0 Client-side timeout 0 Hang 0 No effect 783
Both VMs affected
182 (100%)
—
—
—
The same observation can be made from the results in Table IV, for paravirtualization. Moreover, similarly to errors injected in application processes, all errors injected into a guest operating system were isolated from other parts of the system. Both in hardware-assisted virtualisation and paravirtualization the fault-free virtual machine remains correct throughout the experiments, as does the hypervisor.
hypervisor’s dom0 fails, both virtual machines fail and the converse is also true (i.e., whenever a virtual machine fails due to an error occurring in the hypervisor, then the hypervisor itself and the other virtual machine also fail). Moreover, in such cases, the failure mode exhibited by virtual machines is detectable by clients and, perhaps most importantly, there is no incorrect content or corrupted output.
C. Errors injected in hypervisor dom0
The six processes belonging to the dom0 operating system (oxenstored, xenconsoled, qemu, xenwatchdogd, xenbus kernel process, and xenbus-frontend kernel process) had different effects on the system when injected with errors. Of these six processes, four are user-space processes and two are kernel-space processes.
We conducted a campaign targeting the hypervisor’s dom0 component, which in our setup consisted of CentOS 7. The dom0 component is characterized by having full hardware access. Hence, although memory protection prevents it from directly corrupting Xen’s ring 0 address space, the dom0 privileges allow dom0 code to interfere with the execution of the entire system. The aim of this campaign was to characterize the virtualization platform with respect to hardware errors affecting dom0 processes. The campaign focused on injecting errors in processes related to Xen, as these are a part of the virtualization infrastructure. By analyzing the running processes, six were found to be directly related to Xen, namely oxenstored, xenconsoled, qemu, xenwatchdogd, xenbus kernel process, and xenbus-frontend kernel process. Each of these processes has a specific role within the hypervisor, and is therefore a necessary component in a virtualized infrastructure which would not be present in other, non-virtualized environments. The outcome of errors injected, in microprocessor registers, during the execution of dom0 hypervisor processes is summarized in Table V. One may observe that 18.9% of the errors injected were effective and caused the hypervisor to hang, remaining unresponsive and failing the correctness tests. The remaining 81.1% of the errors had no manifestation within the duration of the experiments, neither affecting the hypervisor nor any of the virtual machines. In those cases for which the hypervisor failed, both virtual machines failed as well. Futhermore, the failure modes exhibited by virtual machines was either hang or client-side timeout. As we described earlier, in our experiments some virtual machine failures are classified as a client-side timeout when the behavior is also consistent with a hang. Nevertheless, we believe that the two different classifications have a common root cause, in which virtual machines are unable continue executing. Hence, one of the main observations obtained from the campaign targeting the dom0 component is that whenever the
The two kernel-space processes (xenbus and xenbus-frontend) had several cases in which the system was brough down into a hanged state. Two user-space processes, running within dom0, also had the same effect (qemu and xenwatchdogd). The other two dom0 userspace processes never produced any effect on the system when targeted with errors. This means that dom0 processes, even when running in user-space, have the potential to cause the system to hang. The effect of injecting errors into a dom0 process depends more on the kind of responsibility within the system, than the execution mode D. Errors in hypervisor Xen We conducted a campaign injecting errors into Xen. These were memory injections, targeting the complete 1GB memory locations reserved for Xen. Table VII summarizes the outcomes of this campaign. It is possible to observe that the effectiveness of experiments was very low. Only one experiment caused the hypervisor to hang, along with the virtual machines. The results shown in Table VII are comparable to those in Table VI, which summarizes the results of injecting errors in memory, targeting the dom0 operating system. In order to improve the efficiency of injections, registers should be targeted and a pre-injection analysis would further improve the results. Nevertheless, in the single experiment which brought down the hypervisor, for memory injections, the watchdog timer (described in the next section) was effective in recovering the system. E. Error manifestation latency and duration Along with the failure modes, examined in the previous section, understanding error manifestation with respect to latency and duration is particularly relevant to design appropriate
TABLE VI. Experiments
O UTCOMES OF ERRORS INJECTED IN MEMORY, TARGETING THE HYPERVISOR IN DOM 0. Hypervisor
Unresponsive
0
102
Responsive
TABLE VII. Experiments
102
Virtual machine 1 Incorrect content 0 Corrupted output 0 Connection reset 0 Client-side timeout 0 Hang 0 No effect 0 Incorrect content 0 Corrupted output 0 Connection reset 0 Client-side timeout 0 Hang 0 No effect 102
Virtual machine 2 Incorrect content 0 Corrupted output 0 Connection reset 0 Client-side timeout 0 Hang 0 No effect 0 Incorrect content 0 Corrupted output 0 Connection reset 0 Client-side timeout 0 Hang 0 No effect 102
Both VMs affected
—
—
—
—
O UTCOMES OF ERRORS INJECTED IN MEMORY, TARGETING THE HYPERVISOR IN X EN . Hypervisor
Unresponsive
1
276
Responsive
275
Virtual machine 1 Incorrect content 0 Corrupted output 0 Connection reset 0 Client-side timeout 0 Hang 1 No effect 0 Incorrect content 0 Corrupted output 0 Connection reset 0 Client-side timeout 0 Hang 0 No effect 275
fault tolerance mechanisms (such as the ones proposed in the sections ahead). To this end, we analyzed the results presented in the previous section with respect to the time between error injection and the first manifestation resulting from its injection. This analysis was conducted for errors injected in Apache processes, and excluded those that resulted in hangs and client-side timeouts. Client-side timeouts are configurable and have the value set to 20 seconds in our experiments; hangs are classified at the end of each experiment; error manifestation latency in such cases would therefore provide meaningless values. Figure 3 shows the error manifestation latency distribution, in both HVM and PV modes (values stacked on top of each other). The x-axis shows time using a logarithmic scale. There are a few clusters of errors, and one may observe that many errors remain dormant for little over 10 seconds before the first manifestation. One extreme outlier had an error manifestation latency of 326 seconds. As we discuss below, these times have a potential impact on the design of error-recovery mechanisms, given that backward recovery techniques are designed to eliminate errors, and may consequently require lengthy rollbacks.
Virtual machine 2 Incorrect content 0 Corrupted output 0 Connection reset 0 Client-side timeout 0 Hang 1 No effect 0 Incorrect content 0 Corrupted output 0 Connection reset 0 Client-side timeout 0 Hang 0 No effect 275
Both VMs affected
1 (100%)
—
—
—
cern when designing error-handling mechanisms. Figure 4 summarizes the duration of such events, that is, the number of requests that were affected by a single bit-flip.
Fig. 4.
Number of client requests affected by a single bit-flip error.
The temporal duration, between the first manifestation and the last manifestation of an error, was always below 1 second (the granularity with which we conducted the analysis). For this reason, we use the number of requests affected by a single bit-flip on the x-axis of Figure 4. Excluding hangs (which have a permanent duration), the vast majority of errors affected only a single client request. Nevertheless, in the course of our experiments, we observed that a single bit-flip error may lead up to ten client requests being unresponded, thereby leading to client-side timeout. F. Analysis of individual processor registers
Fig. 3.
Manifestation latency, for effective errors, both in PV and HVM.
Duration of erroneous behavior is also an important con-
The results presented in the preceding sections focus on the behavior of the different components in the system, while the root causes require a more detailed analysis. This section presents an analysis of the root causes leading to failures,
regarding the processor registers which led to the different observed failure modes. The goals of this analysis are to examine the CPU resources which contribute the most to failures, to identify differences between the resources that lead to failures in hardware-assisted virtualization and paravirtualization, and to compare the root causes of failures originating in the hypervisor with those originating within a virtual machine.
most to failures are similar in PV and HVM modes, the two modes are apparently different with respect to the root causes of failures, if we take into account also registers with a lower probability of manifestation.
Fig. 7. Distribution of failure modes across processor registers, for injections in the hypervisor dom0.
Errors injected in the hypervisor’s dom0 showed a very different outcome, when compared to injections in the application. Figure 7 summarizes the results, from which it is possible to observe that the root cause of failures is quite diverse.
Fig. 5. Distribution of failure modes across processor registers, for injections in application processes, in PV mode.
Figure 5 summarizes the failure modes observed for each register, for all experiments in which an error was effective. The results in that figure concern error injected in microprocessor registers during the execution of Apache processes. From all the targeted registers, eight resulted in effective injections. Similarly to what many other studies have shown in the past, the Instruction Pointer and the Stack Pointer have a high contribution to VM failure modes. The BX register is also relevant, both due to the overall contribution to effective errors and the specific contribution to the incorrect content classification. All six experiments in PV mode classified as incorrect content (silent data corruptions) were caused by errors in the BX register.
From these results, it is possible to observe marked differences from errors injected in the hypervisor’s dom0 when compared to errors injected inside virtual machines. Regarding errors injected in application processes, there may be differences between HVM and PV modes, which will require further experimentation to confirm. Accordingly, one may not conclude that some fault tolerance mechanism that works for HVM mode will also work for PV mode, thereby requiring some caution when designing and evaluating such mechanisms. G. Limitations Our experimental platform is limited in its reachability. Inside the hypervisor, reachability is sufficient for the dom0 software, but clearly insufficient for Xen. An alternative method, aiming at injecting register errors in Xen, has been examined, but an implementation providing low intrusiveness will require further development. The main difficulty lies in avoiding intrusive instrumentation of Xen (which we were able to do in the setup presented in this paper). Regarding the classification of failure modes, small temporal delays are not detectable, as the typical variations in processor load cause noise that greatly exceeds the temporal precision with which we record events. Another similar problem is related to the classification of client-side timeouts and system hangs, which are often exchanged. This limitation is relatively benign, given that the client-side perception of the failure is always a timeout.
Fig. 6. Distribution of failure modes across processor registers, for injections in application processes, in HVM mode.
Figure 6 shows the distribution of outcomes of injections in individual registers, corresponding to errors injected in a system running with hardware-assisted virtualization. Compared to the PV mode, the HVM mode has the same set of registers contributing the most to effective errors, while there are twelve registers in total. Hence, while the registers that contribute the
Lastly, although we ran many weeks of experiments, statistical significance is difficult to assure, especially for the events with very low probabilities. Hence, our study provides evidence regarding the possibility of such events (e.g., it is possible for a VM to produce incorrect content) but the confidence intervals are necessarily wide in the cases with low probabilities. VI.
T OWARDS RECOVERY FOR VIRTUALIZED SERVERS
The experimental results in the preceding sections led to several important observations. Namely, from the campaign
targeting the hypervisor’s dom0, one may observe that whenever the hypervisor fails, both virtual machines fail. The converse is also true, i.e., if an error in the hypervisor causes a virtual machine to fail, then the hypervisor and the other virtual machine fail as well. Furthermore, for errors injected in the hypervisor, the resulting virtual machine failures were detectable by clients. No such experiments resulted in incorrect content or corrupted output being produced. Hence, this leads us to believe that a relevant problem introduced by virtualization, compared to traditional IT infrastructures, is the possibility for an entire physical machine, and all virtual machines, to hang. This failure mode is detectable by clients, as soon as connections time out. Another important observation is that the hypervisor correctly isolated all errors occurring within one virtual machine. Although the number of faults injected is far from full coverage (which would be infeasible in any case, as the fault space is infinite), our experiments indicate that virtualization technology is quite mature in this aspect. Since isolation is robust, one is left with virtual machines failing independently due to errors that only affect them. We argue that such errors should be treated by means of well-known redundancy techniques, such as process pairs, control-flow checking, or redundant virtual machines with output comparison. Lastly, regarding hypervisor errors, both the dom0 component and Xen software may disrupt the service provided by a virtualization server, including all virtual machines. Although this observation is perhaps unsurprising, since both dom0 processes and Xen have full access to the hardware, it leads to the observation that in order to recover a virtualization server after failure one requires a mechanism that is able to recover Xen and dom0 software as well as virtual machines. A. Recovery using an external secure watchdog To enable recovery from hypervisor hangs, we developed a reset mechanism that is triggered externally by a watchdog timer running on a different machine. The watchdog process periodically tests the hypervisor, by attempting to run the correctness tests described earlier, via SSH. Any other tests are feasible, including correctness tests on the service provided by virtual machines, by acting as a client and sending service requests. In the implementation examined in this paper, only the correctness tests targeting the hypervisor were used. A first attempt at designing the watchdog process sent ping requests to the virtualized server, but we observed that this detection mechanism is ineffective. In fact, a fault injection campaign revealed that in all cases in which the hypervisor hanged, it would still respond to ping requests. Based on this observation, we replaced the ping mechanism with the correctness tests through an SSH session. This observation emphasizes the need to execute correctness tests on the target system, rather than using simple but ineffective ping mechanisms or heartbeats. Once the watchdog process determines that the hypervisor is in a hanged state, it issues a remote command to physically reset the hardware. This is achieved by using Intel’s AMT technology [3], which supports remote power cycling. Using
Intel’s Ethernet interface, network packets are inspected by the hardware immediately at reception. Specific network packets, recognized by Intel AMT hardware, lead the hardware to physically reset the machine. In order to prevent unintended remote restarts, which could potentially lead to security vulnerabilities, Intel AMT supports Transport Layer Security (TLS). Power cycling commands, used to reset the physical machine, may therefore be encrypted with a key shared between the watchdog and the physical machine running the virtualization server. Once the virtualisation server restarts, Xen is started, along with CentOS in dom0, and the virtual machines are restarted. Dealing with application restarts is the matter of much research conducted in the past, and the following section examines how it may be achieved in a distributed system. It is worth noting that Xen uses a watchdog timer to monitor the dom0 operating system. That is the purpose of the xenwatchdogd process which runs within the hypervisor dom0 operating system (in our case CentOS). This mechanism enables Xen to restart dom0 software in case it crashes. Our experiments showed that the mechanism is able to recover dom0 from all hangs. However, this mechanism is unable to recover Xen itself, since dom0 software is unable to reboot Xen. For this reason, Xen’s watchdog is inneffective in recovering Xen itself. Furthermore, as described earlier, our proposed external watchdog may execute correctness tests on the target host (e.g., performing an HTTP request) in order to identify other classes of problems as well. B. Experimental evaluation In order to validate the proposed external watchdog, we conducted a fault injection campaign to measure the efficiency – the proportion of errors which are correctly recovered – and the latency – the time it takes to reset the physical machine, reboot the hypervisor, and restart the virtual machines. In this campaign we shortened the execution of the workload to 2 minutes, and selected only the IP, BX, and SP registers for fault injection, in order to promote a high error activation (these registers are among the most sensitive). After an injection, we left the system running and the external watchdog recovered the hypervisor correctly in 100% of the cases in which the system hanged. The time to recover was monitored and, as soon as possible, an SSH connection was established to the hypervisor, to run the correctness tests, and one minute of HTTP requests were issued to the Web server running inside the virtual machine. The HTTP service was also resumed correctly in all cases. TABLE VIII. System state Total experiments Hang No effect
E VALUATION OF THE RECOVERY IMPLEMENTATION . 206 203 3
Hypervisor recovery Min Max Avg 30 s 34 s 31.9 s
Min 54 s
VM recovery Max Avg 103 s 75.2 s
Table VIII shows the evaluation results, in which we focused the experiments in injecting errors in the hypervisor dom0 to determine effectiveness and recovery latency. The external watchdog recovered the system in all 203 experiments in which the system hanged. The recovery time for the hypervisor is on average 31.9 seconds, with a worst case of 34 seconds;
TABLE IX. No effect Corrupted output Connection reset Client-side timeout Hang
Frequncy High Very low Very low Low Very low
Incorrect content
Very low
S UMMARY OF THE FAULT TOLERANCE MECHANISMS . Detection
Action 1) No action
Coverage
Output not parseable Socket exception
2) Request repetition
Idempotent requests
Timeout
3) Request repetition & history of responses
Non-idempotent requests
Redundancy
1) No action 4) Checkpoints & roll backs 5) Byzantine algorithms
Occasional errors acceptable Errors not acceptable All cases
Voting
the subsequent recovery time for a virtual machine is 75.2 seconds on average, with a worst case of 103 seconds. VII.
T OLERATING FAULTS IN DISTRIBUTED APPLICATIONS
Once incorrect data starts circulating among distributed peers, nothing short of a distributed rollback can bring the state back to coherency. Needless to say, this effort can hardly succeed, because peers might be using or have persisted incorrect data already. However, the fact that distributed interactions are often client-server, with little or no interaction between clients, is a great simplification that makes this problem tractable. This is precisely the case of the HyperText Transfer Protocol [14], which we consider throughout this paper. In light of our results in Tables I to VII, most errors occurring in the cloud environment cause one of the following effects: the VM hangs, the client socket closes (connection reset) or a request goes unanswered (client-side timeout). Apart from delayed interactions, the first two cases look somewhat benign for a distributed application. The VM restarts or the client reopens the socket and proceeds with the interaction as before. Recovering from a client-side timeout (the third case) depends on the idempotence o the request. Informally, the outcome of an “idempotent” operation is the same regardless of the number of times it runs. For example, deleting a file is an idempotent operation, but transferring money is not. The kind of operation giving the timeout thus makes an important difference. If the request is idempotent, a client can just resend it with the help of a timeout. If the request is non-idempotent, the client must not send the request again. Indeed, if the application has no means to tell whether the request was executed, the server might have to roll back, to previously coherent state, to let the client repeat its request. To avoid this, programmers may carefully implement a history of responses [15] that minimizes the effects of soft errors, through write-ahead logging and persistent transactional storage. This lets the client resubmit the same request if it gets no response. According to our measurements, a rarer, but possible event is a corrupted output. Since, in this case, the client can immediately recognize the problem, the solution for this case depends, again, on the correct handling of non-idempotent operations. Also rare, but slightly more frequent, is an incorrect, but undetectable result (“incorrect content”). This case is not necessarily a problem. For example, errors in text, which is highly redundant by nature, might be tolerable. Other cases require some protection. For example, a train timetable must display accurately in the browser, and a bank transfer must move the exact amount of money. These interactions must not contain errors. One possible approach here is to minimize their occurrence, by using techniques that might
make the VMs more robust. Techniques known from slightly different contexts, such as control-flow checking [16] or VM introspection [17] could make hazardous events less likely. Redundancy is key for incorrect content detection. Interestingly, for large classes of unintentional mistakes, we may employ relatively cheap solutions that do not imply complete server replication. Possibly, when we think about HTML content, only one or a few numbers in a web page need protection. For example, a server might compute a subset of results twice, to enable comparison, possibly in the client, e.g., via Javascript. This approach assumes that error correcting codes, like Reed Solomon [18] or low density parity codes [19] protect disks and memory of the server, thus restricting the errors mostly to computation. We could also think of other application-independent approaches, where the programmer could annotate certain parts of the code for automatic container re-execution. This opens the possibility for specific redundancy services offered by the cloud provider. To overcome errors, we consider the simple approach of taking periodic checkpoints of the system state. Then, upon error detection, the server rolls back to a previous (presumably) correct state. This, however, raises a problem for clients, as their previous interactions may vanish with the rollback. Again, we can take a dividing line between operations that can be lost and operations that cannot. For example, the trains timetable request does not need repetition, whereas a request to upload textual information to a web site, or to transfer money must be repeated. Unfortunately, in this latter case, it might be unreasonable to repeat the operation; the client cannot be warned of the failure at a later time, learning after the HTTP confirmation that the transfer failed or that the text was actually not uploaded. A reasonable compromise might also be possible here. As seen in Figure 3, we might assume that the effects of errors are detectable within a few seconds. Hence, checkpoints could be executed with small separations, making it possible for clients to block until the next checkpoint. Despite all best efforts, we must accept the possibility that the checkpoint interval is too large for a client to wait, or that, once in a while, some error will evade these measures, e.g., by having a very large manifestation latency (again, according to Figure 3, this is rare, but may happen). In this case, Byzantine fault tolerance might be the appropriate solution, if developers accept to pay a 300% overhead in resources. The general solution for these schemes require 3f + 1 redundancy, for f failures. We evaluated a Byzantine fault tolerance system before, in [20]. As we can see in Table IX, we have a relatively large number of cases that require different solutions. Overall, we can cover the events we observed with five different levels of
increasing complexity, from doing nothing, to using Byzantine fault tolerance: 1) take no action; 2) simple repetition of the request; 3) request repetition with a history of responses on the server (for the non-idempotent cases); 4) use redundancy in the execution and checkpoints for rolling back; 5) use Byzantine fault-tolerance. As we said, cases up to 3 should cover almost all error cases, whereas case 5 should only be necessary in extremely critical cases. VIII.
C ONCLUSION
In this paper we characterized the failure modes and effects of soft errors in virtualized systems, which are a part of cloud computing infrastructures. Using fault injection, we targeted hypervisor processes, belonging to Xen and to the dom0 operating system, and found that errors frequently led to a failure mode in which the hypervisor became unresponsive and all virtual machines hanged. These failures are detectable by clients by generating timeouts. There were no injections targeting the hypervisor’s dom0 in which virtual machines produced incorrect content. These results lead us to believe that a relevant failure mode introduced by virtualisation technologies, under the effect of soft errors, is the possibility for an entire physical machine, and all virtual machines, to hang. To recover from hypervisor hangs, we developed and evaluated an external watchdog, that monitors the virtualisation server and issues a physical reset whenever it fails to respond. Such situations are best detected by periodically running a correctness test on the target system, such as establishing an SSH connection and executing some commands. The reset mechanism is implemented using existing hardware supporting remote power cycling, by means of specific network packets that are sniffed by the network interface, and that may be secured using TLS. In the course of our experiments, this watchdog mechanism was able to recover the hypervisor and the virtual machines from all hangs. Soft errors that originate within one virtual machine were unable to propagate to the hypervisor or to other co-located virtual machines. This serves as further evidence that the isolation mechanisms are adequate and sufficiently robust. Nevertheless, errors originating in a virtual machine may lead it to produce undetected content errors, which may propagate through values passed to clients or users of those values.
an operation with a suitable request-response protocol may be suitable for a very wide spectrum of applications, in which a marginal probability of error is acceptable. ACKNOWLEDGMENT This work has been supported by the FCT, Fundac¸a˜ o para a Ciˆencia e a Tecnologia, in the scope of Programa Operacional Tem´atico Factores de Competitividade (COMPETE) and Fundo Comunit´ario Europeu FEDER, through project DECAF, An Exploratory Study of Distributed Cloud Application Failures (EXPL/EEI-ESS/2542/2013). R EFERENCES [1] [2]
[3] [4]
[5] [6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
To improve the robustness of virtual machines, including the application software and the operating system, efficient mechanisms are needed to handle an increasing soft error rate. Implementing such mechanisms in software may yield the best results, but this issue is challenging. One observation from our experiments is that the behavior of a virtual machine running in hardware-assisted virtualisation mode may differ from the behavior in paravirtualization mode. This kind of observation often makes it difficult to develop effective fault tolerance mechanisms that are also efficient in terms of cost. In the proposed recovery approach, virtual machine failures are handled at the level of the distributed system. Several alternatives are offered to practitioners, resulting in a spectrum of solutions that depend on the design decisions, the ability to accept incorrect content at the client-side, and the overall cost of implementing such solutions. Distributed redundancy may resolve nearly all errors, at a high cost, while retrying
[14] [15] [16]
[17]
[18] [19] [20]
ITRS, International Technology Roadmap for Semiconductors, 2013. S. S. Mukherjee, J. Emer, and S. K. Reinhardt, “The soft error problem: An architectural perspective,” 2014 IEEE 20th Intl. Symposium on High Performance Computer Architecture (HPCA), pp. 243–247, 2005. Intel, “4th Generation Intel Core vPro Processor Family Overview,” Intel, Tech. Rep., 2013. P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield, “Xen and the art of virtualization,” SIGOPS Oper. Syst. Rev., vol. 37, no. 5, pp. 164–177, Oct. 2003. D. Chisnall, The Definitive Guide to the Xen Hypervisor. Prentice Hall, 2007. D. J. Scales, M. Nelson, and G. Venkitachalam, “The design of a practical system for fault-tolerant virtual machines,” SIGOPS Oper. Syst. Rev., vol. 44, no. 4, pp. 30–39, Dec. 2010. A. Shye, V. Janapa, R. Joseph, B. Daniel, and A. Connors, “Using process-level redundancy to exploit multiple cores for transient fault tolerance,” in Proc. 37th Int’l Conf. Dependable Systems and Networks (DSN 2007), 2007. Y. Zhang, J. Lee, N. Johnson, and D. August, “Daft: Decoupled acyclic fault tolerance,” International Journal of Parallel Programming, vol. 40, no. 1, pp. 118–140, 2012. B. Cully, G. Lefebvre, D. Meyer, M. Feeley, N. Hutchinson, and A. Warfield, “Remus: High availability via asynchronous virtual machine replication,” in Proc. Networked Systems Design and Impl., 2008. L. Wang, Z. Kalbarczyk, R. K. Iyer, and A. Iyengar, “Checkpointing virtual machines against transient errors,” 11th IEEE International OnLine Testing Symposium, pp. 97–102, 2010. M. Le, A. Gallagher, and Y. Tamir, “Challenges and opportunities with fault injection in virtualized systems,” in 1st Int. Workshop on Virtualization Performance: Analysis, Characterization, and Tools, 2008. J. Carreira, H. Madeira, and J. G. Silva, “Xception: A technique for the experimental evaluation of dependability in modern computers,” IEEE Trans. Softw. Eng., vol. 24, no. 2, pp. 125–136, Feb. 1998. D. Skarin, R. Barbosa, and J. Karlsson, “Goofi-2: A tool for experimental dependability assessment,” 2014 44th Annual IEEE/IFIP Intl. Conference on Dependable Systems and Networks, pp. 557–562, 2010. R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, and T. Berners-Lee, “RFC 2616, HTTP/1.1,” 1999. A. Z. Spector, “Performing remote operations efficiently on a local computer network,” Commun ACM, 1982. N. Oh, P. Shirvani, and E. McCluskey, “Control-flow checking by software signatures,” Reliability, IEEE Transactions on, vol. 51, no. 1, pp. 111–122, Mar 2002. T. Garfinkel and M. Rosenblum, “A virtual machine introspection based architecture for intrusion detection,” in In Proc. Network and Distributed Systems Security Symposium, 2003, pp. 191–206. S. B. Wicker, Reed-Solomon Codes and Their Applications. Piscataway, NJ, USA: IEEE Press, 1994. R. G. Gallager, “Low-density parity-check codes,” IRE Trans. Info. Theory, pp. 21–28, 1962. R. Nogueira, F. Araujo, and R. Barbosa, “CloudBFT: Elastic Byzantine Fault Tolerance,” in 20th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC 2014), Singapore, November 2014.