Software Implemented Fault Tolerance: Technologies ... - IEEE Xplore

13 downloads 155869 Views 813KB Size Report
they include software faults as well as faults in the un- ... software fault tolerance based on availability and data .... This model for fault tolerance has been ana-.
Software Implemented Fault Tolerance: Technologies and Experience Yennun Huang and Chandra Kintala AT&T Bell Laboratories Murray Hill, NJ 07974 Abstract

ance has two dimensions: availability and data consistency of the application. For example, users of telephone switching systems demand continuous availability whereas bank teller machine customers demand the highest degree of data consistency[l3]. Most other applications have lower degrees of requirements for faulttolerance in both dimensions but the trend is to increase those degrees as the costs, performance, technologies and other engineering considerations permit; see Figure 1 below. In this paper, we discuss three

B y software implemented fault tolerance, we mean a set of software facilities t o detect ‘and recover from faults that are are not handled by the underlying hardware or operating system. W e consider those faults that cause a n application process t o crash or hang; they include software faults as well as faults in the underlying hardware and operating system layers i f they are undetected in those layers. W e define 4 levels of software fault tolerance based on availability and data consistency of a n application in the presence of such faults. Watchd, libft and nDFS are reusable components that provide up t o the 3rd level of software fault tolerance. They perform, respectively, automatic detection and restart of failed processes, periodic checkpointing and recovery of critical volatile data, and replication and synchronization of persistent data in an application software system. These modules have been ported t o a number of UNIX’ platforms and can be used by any application with minimal programming egort. Some newer telecommunications products in AT&T have already enhanced their fault-tolerance capability using these three components. Experience with those products to date indicates that these modules provide eficient and economical means to increase the level of fault tolerance in a software product. The performance overhead due t o these components depends on the level and varies from 0.1% to 14% based on the amount of critical data being checkpointed and replicated.

1

Baoli

Teller Machines

Availability

Figure 1: Dimensions of Fault Tolerance cost-effective software technologies t o raise the degree of fault tolerance in both dimensions of the application. Following Cristian[7], we consider software applications that provide a “service” t o clients. The applications in turn use the services provided by the underlying operating or database systems which in turn use the computing and network communication services provided by the underlying hardware; see Figure 2. Tolerating faults in such applications involves detecting a failure, gathering knowledge about the failure and recovering from that failure. Traditionally, these fault tolerance actions are performed

Introduction

There are increasing demands to make the software in the applications we build today more tolerant to faults. From users’ point of view, fault toler‘UNIX is a registered trademark of UNIX System Laboratories, Inc.

2

0731-3071/93$3.00 0 1993 IEEE

per on End-to-End Arguments[21],Saltzer et.al. claim that such hardware and operating system based methods to detect and recover from software failures are necessarily incomplete. They show that fault tolerance cannot be complete without the knowledge and help from the endpoints of an application, i.e., the application software. We claim that such methods, i.e. services at a lower layer detecting and recovering from failures at a higher layer, are also inefficient. For example, file replication on a mirrored disk through a facility in the operating system will be more inefficient than replicating only the “critical” files of the application in the application layer since the operating system has no internal knowledge of that application. Similarly, generalized checkpointing schemes in an operating system checkpoint entire in-memory data of an application whereas application-assisted methods will checkpoint only the critical data[l7, 31. A common but misleading argument against embedding checkpointing, recovery and other fault tolerance schemes inside an application is that such schemes are not transparent, efficient or reliable because t hey are coded by application programmers; we claim that well-tested and efficient fault tolerance methods can be built as libraries or software reusable components executing in the application layer and that they provide as much as transparency as the other methods do. All the three components discussed in this paper have those properties, i.e. efficient, reliable and transparent.

A*-

FT-DBMS, ...

.*.duplex, T’MR,

...

Figure 2: Layers of Fault Tolerance

in the hardware, operating or database systems used in the underlying layers of the application software. Hardware fault tolerance is provided using Duplex, Triple-Module-Redundancy or other techniques[l9]. Fault tolerance in the operating and database layers is often provided using replicated file systems[22], exception handling[23], disk shadowing[6], transactionbased checkpointing and recovery[l8], and other system routines. These methods and technologies handle faults occurring in the underlying hardware, operating and database system layers only. Increasing number of faults are however occurring in the application software layer causing the application processes to crash or hang. A picocess is said to be crashed if the working process image is no longer present in the system. A process is said to be hung if the process image is alive, its entry is still present in the process table but the process is not making any progress from a user’s point of view. Such software failures arise from incorrect program designs or coding errors but, more often than not, they arise from transient and nondeterministic errors[9]; for example, unsatisfactory boundary value conditions, timing and race conditions in the underlying computing and communication services, performance failures etc.[7]. Due to complex and temporal nature of interleaving of messages and computations in a distributed system, no amount of verification, validation and testing will eliminate all those software faults in an application and give complete confidence in the availability and data consistency of that application. So, those faults will occasionally manifest themselves and cause the application process to crash or hang. It is possible to detect a crash andl restart an application at a checkpointed state through operating sytem facilities, as in IBM’s MVS[24]. In their pa-

The above observations and the End-to-End argument [21] lead t o our notion of software fault tolerance as defined below:

Software fault tolerance is a set of software facilities to detect and recover from faults that cause an application process to crash or hang and that are are not handled by the underlying hardware or operating system. Observe that this includes software faults as described earlier as well as faults in the underlying hardware and operating system layers if they are undetected in those layers. Thus, if the underlying hardware and operating system are not fault-tolerant in an application system due t o performance/cost trade-offs or other engineering considerations, then that system can increase its availability more cost effectively through software fault tolerance as described in this paper. It is also an easier migration path for making existing applications more fault-tolerant. 3

2

Model

many nodes, one of which is designated as primary and the others as backups. All the requests for the service are sent t o the primary site. The primary site periodically checkpoints its state on the backups. If the primary fails, one of the backups takes over as primary. This model for fault tolerance has been analyzed for frequency of checkpointing, degree of service replication and the effect on response time by Huang and Jalote[ll, 121. This model was slightly modified, as described below, to build the three technologies described in this paper. The tasks in our modification of the primary site approach are:

For simplicity in the following discussions, we consider only client-server based applications running in a local or wide-area network of computers in a distributed system; the discussion also applies to other kinds of applications. Each application has a server process executing in the user level on top of a vendor supplied hardware and operating system. To get services, clients send messages to the server; the server process acts on those messages one by one and, in each of those message processing steps, updates its data. We sometimes call the server process the application. For fault tolerance purposes, the nodes in the distributed system will be viewed as being in a circular configuration such that each node will be a backup node for its left neighbor in that circular list. As shown in Figure 3, each application will be exe-

a watchdog process running on the primary node watching for application crashes or hangs, a watchdog process running on the backup node watching for primary node crashes, periodically checkpointing the critical volatile data in the application

Backup .......................................... ....7 A. .... ....i I

logging of client messages to the application,

..................................

replicating application’s persistent data and making them available on the backup node,