How to Realize Self-Healing Operating Systems? - IEEE Xplore

0 downloads 0 Views 211KB Size Report
the whole benefit of self management if the platform on which they run, specially its ... Keywords: Operating System; Autonomic; Self Healing;. Kernel; Supervisor ...
How to Realize Self-Healing Operating Systems? Hossein Momeni Computer Engineering Department Iran University of Science and Technology [email protected]

Omid Kashefi Computer Engineering Department Iran University of Science and Technology [email protected]

Abstract— Operating systems serve as executing platforms and resource managers for applications. With the development of more complex computer systems and applications, the required operating systems become complex too. But the proper management of such complex operating systems by human beings has shown to be impractical. Nowadays, self managing concepts provide the basis for developing appropriate mechanisms to handle complex systems with less human interventions. Although the implications of deploying self managing and autonomic attributes and concepts at the application levels have been studied, their deployment at system software level such as in operating systems have not been fully studied. Self managed applications may not enjoy the whole benefit of self management if the platform on which they run, specially its operating system, is not self managed. Given this requirement, this paper highlights the most experienced faults and anomalies of operating systems, and proposes a tiered operating system architecture and a corresponding self healing mechanism to show how selfmanaging can be realized at operating system level. The main objective has been to make the operating system resilient to operating system faults without restarting the operating system. Keywords: Operating System; Autonomic; Self Healing; Kernel; Supervisor; Hypervisor

I. INTRODUCTION Nowadays, the need for more powerful, complex and reliable systems and applications are rapidly increasing. With the introduction of such applications, keeping them healthy, maintaining them properly, and keeping them well perform and secure has become more difficult and complicated. This complication is mostly attributed to the desire for ubiquity of heterogeneous and distributed computing resources, services and mechanisms, or in a nutshell, heterogeneity challenge. The full human control over such a heterogeneous and ubiquitous world is not feasible or better say practical. So what is the solution? Autonomic computing presents the idea of reducing (ideally, removing) human interventions and making the system management autonomous and more accurate [1]. IBM [2] defines an autonomic system as a system with four basic features: self-healing, self-protecting, self-optimizing, and self-configuring. A self-healing system is featured by being aware of system state and its ability to detect and recover from known/unknown faults.

Hadi Sharifi Computer Engineering Department Iran University of Science and Technology [email protected]

A self-protecting system is a system that can detect, identify and protect the system from arbitrary attacks. Self-optimization is the ability to monitor and control the resources of a system, to get the optimal performance based on performance requirements. Self-configuration is about autonomic (re)configuration of system’s components. Autonomic computing presents a promising prospect for management of modern complex systems by lowering or removing the necessity of human interventions in the management of such systems, a la, operating systems and applications. Self managing applications can be immune from their executing platforms if the operating systems under which they are managed are self managed too [3]. Otherwise, applications are vulnerable to all kinds of platform faults and anomalies. It is thus necessary to have self healing support at the operating system level as well. The deployment of self-managing and autonomic attributes and concepts at application levels have been studied, but their deployment at system software level such as at the operating system level has not been fully studied. In this paper, we investigate how one can apply autonomic properties in general, and specifically self healing properties to operating systems to reduce human involvement in their management and configuration. We further propose a new three layered operating system architecture in support of self healing attributes. The rest of paper is organized as follows. Section II presents the common operating system faults. Section III presents some notable related works. Section IV presents our proposed architecture. Section V discusses the prototyping of this architecture, and finally, Section VI concludes the paper. II. OPERATING SYSTEMS FAULTS Generally speaking, all types of software, including operating systems and user applications, may mostly suffer from the following faults [4, 5]: •

Syntactic faults: input parameter faults.



Semantic faults: inconsistent behavior and incorrect results.



Service faults: QoS faults like real-time violations.



Communication and interaction faults: time-out and service unavailability.



Exceptions: I/O related exceptions and securityrelated exceptions.

Anomalies in operating systems, as execution platforms for user applications, may well infect some or even all user applications’ behavior and correctness, while anomalies of an application are often constrained to that application and are not propagated to other applications or the operating system; this is in fact ensured by the operating system [6]. Therefore, the fault resiliency of operating systems is paramount to reliability of systems. Application crashes are caused by faults that can be considered as soft faults. The system running such an application can be recovered from these faults in most cases without restarting the operating system. On the other hand, the most critical faults in operating systems are in the form of exceptions and system call faults when accessing I/O. These faults are mostly caused by the failure of device drivers and are considered as hard faults that require the restart of operating system.

The second work is the predictive self-healing mechanism included in Solaris-10 [9] by the Sun Microsystems. The fault manager and service manager in Solaris-10 are made responsible to predict the failures of components before they occur, and reconfiguration agents take proactive actions to handle failures. Simplified administration, fast and easy repair, and maximum availability are reported as the strengths of the included mechanism. Roy Campbell et als. have taken a different approach to improve the reliability of Choices operating system [10] via a special exception handling technique. An error handler manages the kernel errors [3, 11]. Component isolation using a wrapper prevents error propagation, and using a speculative recovery plan, recovery orders are directed only to specific components as it is shown in Figure 2. Code reloading is used for fixing transient faults and recovering faulty components.

We primarily intend to propose an operating system architecture to avoid the occurrences of such hard faults in the operating system in the first place, and try to detect and automatically heal hard faults without restarting the operating system. III.

RELATED WORK

There have been three notable efforts on introducing autonomic operating systems. Chronologically, the first one is the Minix3 [7] operating system that has been developed in Virije University by Andrew. S. Tanenbaum. Minix3 focuses on reliability and self-healing features, by reducing the size of kernel and moving device drivers to application layer [8]. The Minix3 architecture is shown in Figure 1. It recovers from some failures in device drivers and applications with minimal user intervention. Although this technique recovers most of operating systems faults and reduces -by moving up- nearly 70% of the size of operating system kernel code [8, 12], many operating system kernel faults remain uncontrolled and need to be recovered by human intervention; faults such as resource manager faults in lock management, panic in the kernel, processes in harmful states (i.e. zombies) and so on.

Figure 1. The Minix3 architecture

Figure 2. The Choices component isolation using wrappers

The works done so far on making self managed operating systems reduce human interventions to relatively prevent and remedy operating system faults, but they fall short of presenting a comprehensive architecture to realize all autonomic features. Therefore, the challenges of developing autonomic platforms still remain. IV.

PROPOSED ARCHITECTURE

As we mentioned earlier, the hard faults are harmful and can interrupt the system to function as required. To decrease the probability of occurrences of operating system’s hard faults, we opt to minimize the functionality of the core of the operating system’s kernel, as is done Minix3 [7], and then try to make this minimal core as stable as possible. This core is placed in a three layered architecture in support of selfhealing attributes. The architecture has a user layer, a supervisor layer and a hypervisor layer as is shown in Figure 3. We have conceived that any operating system conforming to this architecture, must be able to heal operating system faults and anomalies by itself without human intervention or restart of the operating system. To achieve this, the operating system must be enabled to perceive when it is performing healthily and when it is performing unhealthily due to the occurrences of operating system faults. In case it is unhealthy, the operating system must recover from its current unhealthy state to a healthy state, doing necessary adjustments.

To have an operating system with the above features, we propose a self healing mechanism that works in four phases: 1.

Monitoring

2.

Analysis

3.

Detection

4.

Healing

In monitoring phase, two healing units as are shown in Figure 3 periodically monitor system states. The superhealing unit in the supervisor layer monitors the application layer’s activity states, and the hyper-healing unit in the hypervisor layer, monitors the kernel activity states. In analysis phase, system states are analyzed in search of unhealthy and healthy states. System states are induced by using learning methods; during execution of operating system, healing units train a classifier function with healthy system states, and states that do not comply with healthy criteria, are considered as unhealthy ones.

technique for surgically recovering faulty application components. This low-cost recovery technique leads to the disappearance of some operating system faults such as deadlocks, and environment-dependent errors. Microreboot can be done at multiple levels until the faults disappear [13]. Since our proposed architecture for self healing operating system is modular, only crashed modules need to be rebooted to recover the system. The deployment of the above self-healing mechanism at the operating system level provides a more stable environment for running applications and can reduce stopping their execution even when they execute correctly without any faults in them. Detection of Fault

Healthy State

Healing

In detection phase, system faults and anomalies are detected and appropriate plans for handling them are planned. In healing phase, healing units heal the system by transition from unhealthy states to healthy states as is shown in Figure 4. Since most operating systems’ faults are attributed to device drivers [12], we have chosen to move parts of kernel, such as device drivers, to user layer in our operating system architecture. With this approach we have minimized the kernel core and made the super-healing unit responsible for healing device driver faults. Kernel faults are healed by the hyper-healing unit in the hypervisor layer. User Layer User App.

User App.

Device

Device

Network

Application

Syscall

User App.

Supervisor Layer Memory Manager

I/O Manager

Super-Healing Unit

Kernel

Process Manager

Hyper-Healing Unit Kernel Process

Boot Strap

Micro Kernel

Hypercall

Hypervisor Layer

Figure 3. Proposed architecture for a self-healing operating system.

Transformation in healing phase may be achieved by a number of recovery and repair mechanisms such as microreboot, i.e., individual rebooting of fine-grain application components, without disturbing the rest of the application; It can achieve many of the same benefits as whole-process restarts, but an order of magnitude faster and with an order of magnitude less lost works. It is a fine-grain

Unhealthy State

Figure 4. State chart of a self-healing operating system

V. IMPLEMENTATION We implemented a prototype self healing operating system based on the proposed architecture using the Linux kernel as our kernel and enabling Xen –A hypervisor virtual machine– on Linux as the hypervisor layer’s micro-kernel [14, 15]. We moved up device drivers as daemons and placed a device driver polling middleware to handle devices and communicate to the kernel. We added all operating system resource control mechanisms, like deadlock manager, paging manager, and lock manager, to the super-healing unit. The system state is periodically monitored in the monitoring phase and at each period the health of the state is analyzed. Unhealthy states are made healthy by recovering the detected faults using the common resource control mechanisms. In other cases that the control mechanisms cannot do any good, recovery is achieved by just restarting the super-healing unit. The hyper-healing unit monitors the kernel state. When a kernel hard fault like kernel panic is detected, only the kernel is rerun, instead of restarting the whole system. Due to the minimal size and functionality of the kernel in the hypervisor layer, we experienced noticeably less faults in the microkernel even in the presence of kernel faults. VI.

CONCLUSION

In this paper, we investigated how one can apply autonomic properties in general, and specifically self healing properties to operating systems to reduce human involvement in the management and configuration of operating systems. A new three layered operating system

architecture in support of self healing attributes, as well as an overview of a self healing mechanism, was proposed. We argued that adding the self-healing property to operating systems offers a more stable execution environment for running applications. This is because, correctly behaving applications are not unnecessarily and adversely affected by (at least all sorts of) operating systems’ faults. Although self-healing has been supported at the operating system level, a comprehensive autonomic operating system additionally requires support for self-protection, selfconfiguration and self-optimization features. ACKNOWLEDGMENT We would like to acknowledge Iran Telecommunication Research Center (ITRC) that has partially granted the research reported in this paper. We would also like to thank our colleagues in the distributed systems research laboratory of the Computer Engineering Department of IUST that helped us in prototyping our ideas reported in this paper. REFERENCES [1] [2] [3]

[4]

[5]

[6]

[7] [8]

[9]

[10] [11]

[12]

[13]

J. O. Kephart and D. M. Chess, "The Vision of Autonomic Computing", IEEE Computer, Vol. 36, No. 1, pp. 41–50, 2003. M. R. Nami, M. Sharifi, “A Survey of Autonomic Computing Systems”, IFIP, Springer Boston, pp. 101-110, 2007. M. D. Francis and R. H. Campbell, "Building a Self-Healing Operating System", The 3rd IEEE International Symposium on Dependable, Autonomic and Secure Computing, Columbia, MD, USA, 2007. A. Avizienis, J. C. Laprie, B. Randell, C. Landwehr, “Basic Concepts and Taxonomy of Dependable and Secure Computing”, IEEE Transactions on Dependable and Secure Computing, Vol. 1, No. 1, pp. 11-33, 2004. L. Mariani, “A Fault Taxonomy for Component-based Software”, Proceedings of International Workshop on Test and Analysis of Component-Based Systems (TACoS), Electronic Notes in Theoretical Computer Science, Vol. 82, No. 6, pp. 55-65, Elsevier Science, 2003. H. Bos, G. Homburg and A. S. Tanenbaum, “Construction of a Highly Dependable Operating System”, The 6th European Dependable Computing Conference, Coimbra, Portugal, 18–20 Oct. 2006. MINIX3 Operating System, URL: http://www.minix3.org, Last Accessed on November 2007. A. S. Tanenbaum, J. N. Herder and H. Bos, "Can We Make Operating Systems Reliable and Secure?", IEEE Computer, Vol. 39, No.5, pp. 44-51. 2006. Sun Micro System, "Predictive Self-Healing in the Solaris™ 10 Operating System", A Technical Introduction, Santa Clara, USA, 2004. R. H Campbell, N. Islam, R. Johnson, and P. Kougiouris, "Choices, Frameworks and Refinement", IEEE Computer, pp. 9-15, 1991. F. M. David, J. C. Carlyle, E. M. Chan, D. K. Raila and R. H. Campbell, "Exception Handling in the Choices Operating System", LNCS, No. 4119, pp. 42-61, 2006. A. Chou, J. Yang, B. Chelf, S. Hallem and D.Engler, “An Empirical Study of Operating System Errors”, In Proceeding of 18th ACM Symposium on Operating Systems, New York, NY, USA, pp. 73–88, 2001. G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman and A. Fox, “Microreboot — A Technique for Cheap Recovery”, In Proceeding of

the 6th ACM Symposium on Opearting Systems Design & Implementation, San Francisco, CA, pp. 3-3, 2004. [14] I, Habib, “Xen”, Specialized Systems Consultants, Inc., Linux Journal, Vol. 2006, No. 145, May 2006. [15] Xen, The Official Project Site, URL: http://www.xen.org, Last Accessed on November 2007.