Building a Self-Healing Embedded System in a Multi-OS Environment Tomohiro Katori
Lei Sun
Dennis K. Nilsson
Tatsuo Nakajima
Waseda University 63-505 3-4-1 Okubo Shinjyuku-ku Tokyo, Japan
Waseda University 63-505 3-4-1 Okubo Shinjyuku-ku Tokyo, Japan
Chalmers University of Technology SE-412 96 Gothenburg, Sweden
Waseda University 63-505 3-4-1 Okubo Shinjyuku-ku Tokyo, Japan
[email protected]. waseda.ac.jp
[email protected]. waseda.ac.jp
dennis.nilsson @chalmers.se
[email protected]. waseda.ac.jp
ABSTRACT
1.
In this paper we describe our approach to improve dependability of a commodity OS for embedded systems. Usually it is too difficult for end-users to resolve the problem inside a single OS, especially for embedded systems. We propose a self-healing mechanism for Linux kernel to improve the system dependability without any operations by administrators. This paper presents our white box approach for monitoring and recovering Linux kernel. Key components are a system monitor and a virtual machine monitor. The system monitor is used to detect the inconsistency of data structures inside Linux kernel. The virtual machine monitor provides a multi-OS environment and it isolates the system monitor from Linux kernel. In a multi-OS environment, the system monitor is able to resolve failures inside Linux kernel without stopping crucial services running on another OS. We have developed a prototype for an embedded system to verify our approach. The experiment results show that our system can remove hidden processes and reload buggy kernel modules. The performance evaluation results show that our self-healing mechanism can be used even when Linux kernel is heavily-loaded and the overhead of the system monitor is vanishingly small in actual use.
Embedded systems such as consumer electronics products are vital for our lives. OSes of embedded systems have a strong requirement for dependability since administrating embedded systems manually is difficult. Self-healing [17] is a suitable way to improve the dependability of embedded systems. For example, web servers are typically managed by system administrators. When a system is exploited by malicious attacks or exhausts resources, a veteran system administrator may analyze logs and recover the system with graphical tools. However, a TV located in a living room is rarely managed by a system administrator. End-users typically do not have enough knowledge of their computer systems. Furthermore, embedded systems generally have very limited user interfaces and low-capacity disks. A keyboard, a screen and log files cannot always be utilized to recover an embedded system. To improve the availability of embedded systems, an OS should be dependable enough to continue running without any (or with a few easy) instructions from an administrator. To meet this requirement, bugs and vulnerabilities must be removed from an OS. If possible, the system should be able to heal itself. The goal of this research is to build a self-healing system that recovers from problems that occur inside an embedded commodity OS, such as system inconsistencies and security attacks. We chose Linux as a target commodity OS since it is one of the most successful operating systems as open source software. Today, Linux is also used in embedded systems, such as cellular phones, TVs, and DVD recorders. Sometimes OSes of these consumer electronics products crash, and users reluctantly pull cords out of the wall to stop and reboot it. We cannot say that this is an appropriate way to recover the system. Linux is still continuously evolving. Many kernel developers customize the kernel, add new functions, and port them to various architectures. The structure of Linux kernel has become increasingly complicated as its code size has grown. As a result, it has been more and more difficult to keep Linux dependable, since more complicated software results in more possibilities for bugs and vulnerabilities [6]. Moreover, a problem inside the kernel could be fatal. For example, previous research shows that there are many bugs that can potentially cause a lockup, and that bugs remain in Linux kernel for an average of 1.8 years before being fixed
Categories and Subject Descriptors D4.5 [Operating Systems]: Reliability; D4.6 [Operating Systems]: Security and Protection
General Terms Reliability
Keywords self-healing, monitoring, multi-OS environment
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC’09 March 8-12, 2009, Honolulu, Hawaii, U.S.A. Copyright 2009 ACM 978-1-60558-166-8/09/03 ...$5.00.
293
INTRODUCTION
&' $%% !"# ())
[8]. Once the kernel hangs up, the whole system is out of service.
;