Sep 29, 1994 - Taking a conservative approach, we only trust the (shaded) re- gion bounded by \tested" ..... 3] T. BeLanc and J. Mellor-Crummey. \Debugging ...
ARCHITECTURE OF TAMER: A TOOL FOR DEPENDABILITY ANALYSIS OF DISTRIBUTED FAULT-TOLERANT SYSTEMS Richard A. DeMillo, Tsanchi Li, and Aditya P. Mathur Department of Computer Sciences Purdue University West Lafayette, IN 47907 September 29, 1994
Abstract
The need for fault tolerant software has grown signi cantly with the need for providing computer-based continuous service in a variety of areas that include telecommunications, air and ground transportation, and defense. TAMER (a Testing, Analysis, and Measurement Environment for Robustness) is a tool designed to assess the dependability of such systems. Three key ideas make TAMER dierent from several existing tools aimed at dependability assessment of distributed fault tolerant systems. These three ideas are incorporated in: (a) a two-dimensional criterion for dependability assessment, (b) interface fault injection, and (c) a scheme for partitioning the system under assessment into subsystems that could be analyzed \o-line". The interactive nature of TAMER allows an assessor to identify portions of software that may need attention for additional testing, redesign, or recoding. Such identi cation becomes possible with the help of code and fault coverage information derived during the testing process under TAMER's control. Here we describe the architecture of TAMER.
1 Introduction Fault-tolerant design becomes increasingly important with the growth in the demand for non-stop computer based service. The criticality of fault-tolerant software suggests the importance of testing it to examine how well it can tolerate unforeseen faults. In the past, researchers have proposed several techniques, built and used tools towards this end. A representative among these eorts is the AAS fault tolerance testing [19]. Implicitly built into the above tools is a notion of \adequacy" of a test for fault tolerance to re ect how well the test has been conducted. The \adequacy" suggests possible improvements. It also serves as a basis to decide the release time of the software. One limitation of the current methods for assessing the adequacy of testing the fault tolerance of a system stems from their lack of coverage information. To understand why this lack can be a problem, consider the following scenario. This scenario is based upon the premise that a fault tolerant system should not only be good at tolerating faults that arise during operation but also relatively free from faults. Suppose that a system has been tested 1
for fault tolerance and is found acceptable at tolerating faults. Such a conclusion could be derived using tools such as FIAT. Such a conclusion, based on fault-injection testing, does not indicate the chances of faults arising in the system during operation. Thus, for example, if a system is released because it has passed the fault-tolerance test, then such a system may spend most of its time in operation tolerating faults. This might happen because inadequate testing of the code has not revealed hidden faults that arise frequently during operation causing the well designed fault handling mechanism to tolerate these faults. The following example can further elucidate the need of code coverage.
Example 1 Two (fault tolerance) tests on the same system were carried out by X and Y.
X obtained CF(X) = 0.8, with only 50% of the statements of the source code exercised. Y derived CF(Y) = 0.8, while the executed statements attained 80%.
Both X and Y scored identically in fault coverage. But, is it sucient to validate the system for release solely based on this numerical data? What if some critical functions such as billing or the accounting subsystem lie in the untested code for X? On the other hand, how can we account for the extra eort made by Y? An overview of the existing metrics reveals the lack of consideration of code coverage (e.g., statement coverage, decision coverage, alluses coverage, and mutation coverage). Though it is true that testing for the fault tolerance of a fault-tolerant system requires an added dimension of fault coverage, the evidence that improved code coverage improves software quality [13] suggests a new approach. Improved code coverage implies improved code quality (lesser faults), while improved fault coverage implies improved fault tolerance to any residual faults. Thus, we consider fault coverage as well as code coverage in the evaluation of test sets for a fault-tolerant system. We propose a new quantitative adequacy criterion that overcomes the limitation pointed about above. This criterion can be measured with the aid of TAMER. Our criteria, the underlying rationale, and the architecture of TAMER are described below. The remainder of this paper is organized as follows. Section 3 describes the two dimensional adequacy criterion to assess dependability of a fault tolerant system. The ideas of interface fault injection and system partitioning are elaborated in Sections 4 and 5. The architecture of TAMER is described in Section 6. An example to illustrate the use of TAMER and a two-dimensional adequacy criterion appears in Section 7, with some preliminary results included. A summary of our work appears in Section 8.
2 Previous work in fault injection testing There are several dimensions in the Software Fault Insertion Testing (SFIT) area. The major dierences among them are mainly in target systems, fault types, and method of injecting faults. The target systems range from real-time distributed dependable systems [1] to largescaled operating systems [4]. The fault types injected into the target systems also vary greatly from simple memory bit faults and processor level fault to system/communication 2
faults. TAMER aims at large distributed applications such as telecommunications systems, using high level faults. The insertion method applied to software is inspired by the success of hardware fault injection testing in the expectation that this kind of testing can be used to gain direct control over the fault occurrence. There are two forms of implementing fault injection: state injection and code injection. The former is achieved by altering the state or behavior of a running system. More speci cally, a system-level error "manifestation" is injected as an erroneous state. This faulty state is intended to mimic the error produced by a fault. For example, data inconsistency, a common error between two copies of data in a system, can be simulated by corrupting the data of either one. There are several approaches to implementing state injection:
Process-based: The injection is to be accomplished by a high-priority process modifying the state of the computation [17]. Such an approach often relies on the support of the underlying operating system. Debugger-based: Using a debugger (e.g. dbx, gdb), errors can be injected to a running process through the composition of built-in facilities such as breakpointing, evaluating, and jumping. Message-based: For message-oriented communication between two software components, the erroneous state can be created by disrupting message sequences using message-based tools. Such fault types have been employed in the AAS testing project [19].
Yet another form of implementation, code injection, is possible. Unlike the previous one, code injection is to change the some tangible objects (e.g. input data, object code, source code) before the system runs. One way is to feed the system with erroneous inputs and examine its input checking capability [18]. Another way is to apply a faulty patch to the object code at a designated location [12]. This approach may save compilation time, however, it is not a systematic and exible way of testing. TAMER uses source code injection, which has drawn much on the authors' experience with the Mothra project [6]. By changing a predetermined portion of source code to produce a syntactically correct but semantically incorrect mutant [6], Mothra is a testing environment that supports mutation testing. To expedite the compilation time, TAMER just needs single compilation for all \mutants", which is detailed later.
3 An adequacy criterion
3.1 Adequacy as measure of correctness
Criteria for test data adequacy are rules for deciding when enough testing has been performed. They have long been an important objective of research in software testing. Although there are many dierent forms of test adequacy criteria [11, 16, 15], the goal is 3
subject subsystem (SSS) AM-1 application module
failure state
FM
recovery actions
fault manager
failure state
AM-2
recovery actions
application module
Figure 1: A zoom-in of the entire system unique, namely, correctness of the tested program. Therefore, we may consider the adequacy of testing fault tolerance as how close the fault manager (Fig. 1) functionality is to correctness. In other words, we postulate the following. The \closer" test adequacy is to 100%, the \closer" the program is to being correct. Listed below are some useful notations for the following discussion. F
: the desired function to be computed.
P : the fault-tolerant program that computes . We may think of P as the sub-program SSS (Fig.1), consisting of, say, an FM and F
two AM's.
T : a test set of SSS. t := a test case in T.
To simplify the following discussion, at the present time no fault injection is considered to test the fault tolerance capability of a system; instead, designing new test data is the only way. Assuming the simplistic scheme of fault tolerance as in Fig. 1, the correctness of P responding to any input is guaranteed by either of the following conditions: (1) both AM's are correct, or (2) at least one AM is incorrect, but FM reacts correctly. The rst condition is not our present concern for fault tolerance testing. To maximally exercise the functionality of FM, we may assume, by selection of T , that each test data in T will drive the program P to step through at least one hidden fault in some application module. The fault manager is thereby forced to react to this triggered fault. As will be de ned later, fault coverage (FC ) is, roughly speaking, the percentage of successful fault handlings on T . Indeed, FC as well as other coverage estimators [21, 18, 14] do indicate the capability of FM to some extent. Simply using such metrics is, however, insucient to infer the correctness of FM. Let us rst recall the scenario of Example 1 in Section 1. Suppose a test with T has been conducted on P . The result shows T attains 100% fault coverage and also indicates only two modules (the shaded boxes) have been well exercised, with AM-2 (the white box) left untested. Solely judged by FC , the system would have passed the test. However, there is a potential danger behind this simple numerical gure, that is, the incompleteness of the test set T . During the test, the interaction between FM and AM-1 may be perfectly 4
tested-%
untested-%
1.0
unrecovered-%
fault coverage (FC)
recovered-%
1.0 code coverage (CC)
Figure 2: Contributors to test adequacy tested in that FM handles all the faults stemming from AM-1. It does not imply the same satisfaction for the interaction between FM and AM-2, since the faults hidden in AM-2 were not activated at all. This observation suggests that the eort of Y may be worthwhile in Example 1. Moreover, although testing for the fault tolerance of a fault-tolerant system requires an added dimension of fault coverage, the evidence that improved code coverage improves software quality [13] suggests a new approach. Improved code coverage implies improved code quality (lesser faults), while improved fault coverage implies improved fault tolerance to any remaining faults. To cover the weakness of the current adequacy criterion solely using FC and the like, we consider incorporating the code coverage (CC ) into the evaluation of test adequacy. To be exact, we de ne test adequacy as follows.
De nition 1 Let P be a fault-tolerant program, and T be a set of test data for P 's fault
tolerance capability. Then the adequacy of T is de ned as
Ad(P; T ) = CC (P; T ) FC (P; T ):
(1)
The rationale of the above de nition is further shown in Fig. 2. Along the CC -axis, the percentage of the code that has been exercised is \tested-%", and the remaining \untested%" represents the percentage of un-executed code. Similarly, along the FC -axis, \recovered%" and \unrecovered-%" mean the percentage of numbers of successful and unsuccessful recovery, respectively. Taking a conservative approach, we only trust the (shaded) region bounded by \tested" and \recovered", while all the other rectangles either marked \untested" or \unrecovered" are not deemed accountable for program correctness. The above test adequacy is de ned to be the degree of con dence for program correctness, namely, the area of the shaded rectangle. Applying the above de nition to Example 1, the adequacy of X achieved is 0.4 (= 0:5 0:8), while Y achieves 0.64 (= 0:8 0:8), thus Y has conducted a better test on the target system. Ad de nes a family of adequacy criteria as CC is varied. Note that the above de nition also extends the notion of traditional adequacy criterion in two ways. First, as in testing non-fault-tolerant software, we may consider FC = 1:0; thus Ad = CC , suggesting the 5
Y (faults)
input space for fault tolerance testing X (test data)
Figure 3: 2-dimensional input space traditional code coverage a special case of our adequacy criterion. Second, if code coverage has been improved to 1.0, then the fault tolerance is responsible for the remaining de ciency since Ad = FC for this instance. Furthermore, to enhance the adequacy, one has to increase both coverages in two ways. On one hand, to increase the code coverage necessitates improving the test set by incorporating new test cases, i.e., the quality of the test set. On the other hand, to enhance fault coverage, one has to improve the fault tolerance capability. As both improvements proceed, the quality of both test set and code are enhanced. As an ideal case when Ad = 1, program correctness can be assured with certain probability. The adequacy Ad de ned previously is signi cant from several aspects. To begin with, it extrapolates the fault tolerance using the currently available test results such as fault coverage and code coverage. In addition, this measurement provides a basis to assess the eectiveness of testing carried out so far and that of its fault tolerance capability. More importantly, this measurement may also be used to decide on system release for eld operation.
3.2 The need for code coverage and fault coverage By only incorporating test data, testing for fault tolerance cannot be ecient; therefore, as emphasized earlier, fault injection techniques have to be exploited. In Fig. 3, the X -axis stands for test data (functional inputs), and the Y -axis for faults (non-functional inputs). Listed below are some useful notations.
Let X = fx1; x2; :::; xm g, a set of test data for P . Let Y = f1 ; 2; :::; n g, where each i is a mutant program of P . Such a mutant is induced by injecting a failure mode into some location in the program P .
As we may sample the fault space as well as the test data, the previous restriction (for the convenience of discussion) of not using fault injection must be removed, and the de nitions of some traditional coverage criteria need to be amended since they have not been used in combination with fault injection before. First, the de nition of code coverage can be re ned as
De nition 2 [Code coverage (CC) ] CC (P; X; Y ) = CCtrad(P (X; Y )); 6
(2)
where CCtrad denotes the traditional code coverage computation, P stands for the generic mutant of P . In notation, P (xi; yj ) equals yj (xi ), namely, the execution of the j -th mutant on the i-th input data. We may think of P (X; Y ) as the execution result of the program P on the input X Y . Here, execution result should be understood as some execution history (e.g. pro le, trace le), based on which the code coverage CC can be computed. Regarding FC , there does not exist any coverage estimator that can assess the fault tolerance capability in terms of 2-dimensional input space (Fig. 3). As discussed in Section 1, EDC used by Vrsalovic, et al. [21] or Siewiorek, et al. [18] is good only for testing with erroneous inputs; it cannot be used with 2-dimensional inputs. Moreover, FC de ned by Powell, et al. [14] is only useful for hardware fault injection, where the concept of \mutant" does not exist. Therefore, a more appropriate de nition of FC is needed. We say a mutant is triggered by a test case x if the location of the planted fault has been on the execution path of on x. An injected fault is tolerated or recovered if mutant is triggered by x and the mutant also behaves identically as the original program P . A quantity evaluating the recovery capability of P is de ned as:
De nition 3 [fault coverage (FC)]:
For the entire test execution, let T be the total number of \triggers" and R the total number of triggered faults that have been recovered, or more explicitly, T = Pmi=1 Pnj=1 t(i; j ), ( 1 if j has been triggered by xi ; where t(i; j ) = 0 otherwise and R = Pmi=1 Pnj=1 r(i; j ), ( 1 if j has been triggered by xi and j (xi ) = F (xi) : where r(i; j ) = 0 otherwise Note that R T and we de ne
FC (P; X; Y ) = R T:
4 Interface-based fault injection
The concept of encapsulation is one of the main characteristics of object-oriented design, which emphasizes the concepts of abstract data types and information hiding. An object consists of some data structures and operations to act on these data. The internal data structure can only be accessed from outside the object by well-de ned routines, called interfaces. We take advantage of the restriction on data accessing to minimize the eorts needed for fault injection as stated below. Recall that the term fault manager (FM) refers to the software component within the system to detect and recover from various kinds of error. The level of places for fault injection may range from hardware, such as heavy-ion radiation [9], and memory bits [8, 4], to software, such as program mutation [6]. However, there 7
process_crash() { exit(ERROR_CODE); } incorrect( val) int val; { return(val+OFF_VAL); }
Figure 4: Possible implementation of some failure modes are several incentives to injecting faults through a module interface. First, we believe that injection at a higher level (i.e. language level) can reduce the cost and increase portability without signi cant loss of test validity. Second, by our assumption of strong encapsulations, faults inside some module will manifest themselves at module interface, in other words, module interfaces are the gateways of fault propagation. Furthermore, fault injection at a relatively high level in the software architecture tends to give the designer/programmer direct insight into the weakness of the function under examination. A large number of possible faults can hide inside a module. Fortunately, there is only a small set of signi cant common fault manifestations, or failure modes. It is the set of failure modes visible to the fault manager. Our idea is to sample possible module faults by injecting these failure modes at the interface level. Then, we examine the fault manager's ability to tolerate such failures. What failure modes can be candidates for testing? We may start with the speci cation of the fault manager, which expresses the expected types of failure from the standpoint of the designer. Another possibility, based on empirical data [2, 5, 4], is to choose a set of typical common failure modes such as crash, abort, response late, hung, incorrect answer, message omission. A possible implementation of crash failure for a process is shown in Fig. 4, where ERROR CODE is some non-zero integer. When a tester selects to inject a crash failure at a certain location of the software, the statement process crash(); place. At the point of process crash
will be inserted at the desired being executed, the process will exit immediately, which is perceived as a crash by the system. Similarly, the incorrect failure incorrect, right below in Fig. 4, which perturbs the input value by OFF VAL, will yield the intended failure upon execution.
5 System partitioning and o-line execution It is necessary to conduct an analysis such as integration testing or system testing in the original system. However, if some component of the system is the only focus of analysis, it is not so pleasant to work with the entire system, especially a large system (e.g. telecommunications system) for several resons. To begin with, some problems that arise in non-relevant 8
system partition
data recording target environment
system emulation interface injection test execution
tool environment
Figure 5: An o-line scheme modules usually appear annoying, may distract the tester, or even interrupt the activity. Furthermore, such component analysis usually takes repetitive measures to produce desirable details about its behavior, running the entire system only for iteration of the intended component is not ecient. Indeed, a scheme of o-line testing would be highly desirable for eciency. In brief, the scheme takes one execution of the entire system in the original environment, then iterates all the follow-ups only on SSS in TAMER. To o-line execute the testing on SSS, we need to solve the following.
Problem:
How can we reproduce the execution of a subsystem in a \foreign" and \isolated" environment? This problem may be further divided into two sub-problems, referred to as the F-problem and the I-problem respectively. It is foreign since the tool environment, say Sun OS, may dier from the counterpart (e.g. MVS, Unix) of the target system. Moreover, it is isolated since the tool is ideally to be hosted on a workstation, which is not connected to the original system. To solve the F-problem, the rst task is to emulate the target operating system on top of the TAMER's operating system. Mach's approach sheds some light. In fact, it has been a goal of Mach to support multiple operating systems on top of its microkernel, where emulation is done by a software layer outside the kernel [7]. A Mach user can resort to any favorite operating system (emulated by Mach) as easily as invoking a shell (Ksh, Csh, etc.). Solving the F-problem means the Target Operating System Emulator (TOSE) is available to TAMER. Now with the support of TOSE, the SSS can execute in TAMER as home, except for the I-problem. To re-execute the SSS in isolation, TAMER must replenish the lost communication and synchronization for SSS. This can only be done by recording all the messages
owing across SSS during the operation of the target system; then TAMER \replays" these recorded data to the exiled SSS.
Solution: To transport a test of SSS to a tool environment (Fig. 5), a bridge is needed: 9
at one end, communication data are recorded in the target environment, at the other, the tool has to emulate the original system. Therefore, the emulation of the original system needs TOSE as well as communication data. To demonstrate the idea of an o-line approach, we are content to consider that the target system has the same environment (hardware and operating system) as TAMER's, that is, emulation of the communication channel is the only task. How the data are recorded and replayed is discussed later.
5.0.1 Data recording Data recording refers to the eort of capturing the data owing between SSS and others in the original environment. Assume that message passing and le sharing are the only means used for interprocess communications. To reproduce the behavior of the exiled SSS, TAMER needs to replay these recorded data. But is it sucient? There is no direct answer to this question; however, some related work does suggest some promise for data recording. As we are aware, the non-deterministic execution behavior of parallel programs makes the debugging activity dicult. Consecutive execution in a parallel (distributed) environment, aected by complex race conditions, may deliver dierent outputs. To reproduce the execution, the history-preserving concept is fundamental to many approaches to this problem. Particularly in parallelism using message-passing, execution history means the sequence of messages recorded in the order of occurrence, a basis for several fast replay schemes [3, 20]. Applications of the same concept in other areas are also popular. One example is the recovery scheme for a database system, where checkpoint and rollback are commonly used to preserve and repeat the execution of the past. Therefore, we have the following heuristic.
Heuristic: To reproduce the execution of SSS in TAMER, it is sucient to record the
data in the occurrence order owing into SSS during the execution of the target system, where data may be received in messages or retrieved from a le. As described previously, SSS relies on two forms of communication with the remaining system, these are send/receive message and write/read le. To record all the communication data means to checkpoint each message and le data that have been passed. This can be done by modifying the system's interprocess communication routines (e.g., send, receive, etc.) as well as le access routines (e.g., write, read) to include checkpointing in the existing semantics. When the target system is invoked by system input data, each communication data is checkpointed to the communication input pro le (CIP) in its occurrence order, with important information such as timestamp, sender address, port id, etc. Similarly, each message sent out by SSS is logged as well to a communication output pro le (COP). Therefore, per input data to the original system, CIP captures all the incoming messages for SSS. On the other hand, COP represents its outward behavior. In summary, associated with each system input data, the collection of CIP and COP is what SSS needs to execute
10
n
n remaining subsystem
2 1 system input data
Subject incoming hardware outgoing SubSystem commucommunication operating system nication
2 1 system output data
hardware
Figure 6: Execution in the original environment n
n
2
2 1 CIP
Subject SubSystem
1 COP
TOSE TAMER CIP: commmunication input profile COP: commmunication output profile TOSE: target operating system emulator
Figure 7: O-line execution in TAMER in TAMER or other systems.
5.0.2 System emulation Fig. 6 depicts the execution in the original environment of the tested system with n input/outputs. To execute the test in a tool environment needs o-line support as Fig. 7 shows: emulation of the target operating system (TOSE) and n communication input/output pro les (CIP/COP). For emulation, the routine receive for interprocess communication is interpreted as checking out data from a CIP, while send, as checking in data which are to be compared against a COP. file read and file write are treated similarly. TAMER is ready for testing SSS when all the previous steps have been completed. Before any further analysis, to assure the correct emulation of the target system, the following is necessary. Each CIP (corresponding to per system input data) will be \replayed" to SSS in TAMER, while the stored COP will be compared against the new communication data generated by SSS. Once TAMER completes setting up the virtual environment for SSS, fault injection testing ensues, the results are analyzed, and fault coverage as well as code coverage are assessed.
6 TAMER: a tool prototype As a realization of fault injection with the prescribed features, i.e., interface-based injection, injection with (high level) failure modes and the like, a tool prototype named TAMER has been designed and is under development. The following delineates the architecture of 11
3
User level
Displayer
5 Injector
Subject subsystem (SSS)
4
Adequacy Manager
Tool kernel
Scheduler
Execution Optimizer
Driver Test Manager
2
Monitor
Test Visualizer Browser
1
Remaining subsystem (for on-line tesitng) or System emulator (for off-line testing)
Figure 8: TAMER architecture TAMER and then elaborates each of its components.
6.1 TAMER architecture The architecture of TAMER has two levels of abstraction (Fig. 8): user level and tool kernel. Test visualizer (TV) is the only user interface, via which a user can select where/what/how to fault inject the target system by browsing the source code, monitor/control test execution, and get post-test information. The tool kernel has four major components: before execution, the fault injector (INJ) is to inject the selected failures to SSS; during execution, the driver (DRV) is to harness the process of test execution (for o-line testing, TAMER is supported by the emulator, EMU); after execution, the adequacy manager (ADQ) is to update the user about the test result.
6.2 Tool components
1. Test visualizer (TV): TAMER supports source code injection. Before execution, the Browser takes a test parameter from the user by letting the tester navigate through the source code and make the fault injection speci cation (FiSpec), which is a list of fault injection descriptors (FiD), where FiD is a triple consisting of three attributes: (code location; failure mode; persistency) 12
Notable is that each FiD corresponds to a mutant program in mutation testing [6]. In the text window provided by the Browser, the user identi es the desired text string (function call or variable) by highlighting it, then maps one/more failure modes with the intended persistency (transient/permanent) to it. To ease highlighting, the user is merely required to click the mouse pointer at any character among the text string of interest, then the complete string is highlighted. Fig. 10 shows what a user sees during the interaction with the Browser. During execution, the observation of fault injection is essential to the tester/designer, since it provides the basis for judging the system behavior. Designed for such purpose, the Monitor enables the user to monitor and control test execution. The monitor facility is useful, since the user may be in particular concerned with the micro-behavior when the injected failure gets triggered; the reaction of the fault manager toward the failure can thereby be closely examined. After the entire process of execution, the most expected are all the test results such as code coverage, fault coverage, and test adequacy. It is the Displayer that reports these outcomes to the user. In summary, TV, with all the visualization facilities, is a device for specifying test parameters, monitoring/controlling test execution, and retrieving test results. 2. Injector (INJ): Taking FiSpec from the Browser, the major task of INJ is to instrument the source code with three purposes: fault injection, coverage computation, and execution monitoring. For fault coverage, the injector generates a generic mutant (GM) according to FiSpec, which is done basically by replacing the code per FiD with a fault injection trigger (FiT). Suppose that there is a variable named \Var" in FiSpec, then the injector will replace the original text with FiT(FiD,Var). In semantics, FiT is a run-time routine called during the execution of the GM, leaving the control of the fault activation to the driver. Note that GM needs just one compilation for all \mutants", a signi cant reduction in compilation time. For code coverage, the injector incorporates ATAC in code instrumentation [10]. For monitoring, the injector attaches probes to the source code for later execution. 3. Driver (DRV): The Test Manager prepares a test input and the associated output from a test database. Taking a test input from the test manager, the Execution Optimizer (OPT) lters any mutant (corresponding to an FiD among the list of FiSpec) that will not be activated by the test input, avoiding all redundant executions. The extra eort of ltering pays o since it signi cantly optimizes the testing eciency, with identical test results. The idea is as follows. For ease of presentation, the i-th CIP (communication input pro le) is aliased by xi from now on. Viewing the tested program as a owgraph G , then a test input xi will exercise an execution path (xi) on G , which requires dynamic analysis. FiSpec can be considered as a marking on G ; we may divide GM into two components: reachable and unreachable GM with respect to xi . The reachable GM are precisely the mutants which will be triggered by xi , while the unreachable ones, will not. On the other hand, the task of the Scheduler 13
Q recover_block
dist P
b
A a
old accept_test
dist_1
dist_2
dist_3
a+b
a*cos(A)+b*sin(A)
sqrt(a*a+b*b)
new accept_test if | dist*dist - a*a - b*b | < error
if dist >= max(a,b)
then return OK;
then return OK;
accept_test
Figure 9: A simple program that uses recovery block is straightforward: it simply iterates all reachable mutants on each given test input. DRV runs in two modes: auto-mode and man-mode. In auto-mode, DRV can run the execution without human intervention until the completion of the entire test; in man-mode, DRV is waiting for the command from the Monitor (TV), which is in turn from the user. 4. System emulator: The architecture in Fig. 8 shows two options of test execution, on-line testing and o-line testing, are supported by TAMER; system emulation is only required for the latter. To prototype our idea of o-line testing, the environment (hardware, operating sytem) of the target system and TAMER are assumed to be identical, namely, no TOSE (target operating system emulator) is necessary. The system emulation only refers to check-out and check-in data from CIP (communication input pro le) and COP (communication output pro le). 5. Adequacy manager (ADQ): After the test, ADQ computes the fault coverage, the code coverage, and the test adequacy, then sends these results to the Displayer (of TV) for user review.
7 A preliminary experiment
7.1 A tour through TAMER
To illustrate how the Browser works, let us rst examine a program Euclidean Distance, which uses the recovery block scheme to attain data precision. Then we show how to use TAMER to browse the code and make a selection of FiSpec.
Example 2 [A target program]
Fig. 9 sketches the underlying algorithm of the program. In the upper left corner, the diagram depicts the distance between two given points, P and Q. Within the middle box, recovery block attempts three routines dist1, dist2, dist3 sequentially until the result satis es accept test, which has two implementations on each side, titled old accept test 14
and new accept test. Two programs, identical execept for using dierent accept test's, are named old and new program, respectively. For such a simple example, partitioning and data recording are not necessary. Fig.10 shows how the Browser appears to a user when the code of the above program is examined. To specify the fault injection, user may identify these interfaces, and then attach the failure(s) and preferred persistency to this location. As the bottom panel shows, two options are available at present. The rst option is \once", which indicates the injected failure will be triggered for the rst occurrence, while the \always" one will keep the failure spurting until termination. Fig. 10 also shows that when the interface dist1 is picked (highlighted), three failure modes \incor(rect)", \hung", and some \user def" mode are mapped to this location. Inside TAMER, three dierent \mutants" are generated, each of which on the return from dist1 will deviate to the designated failure. For example, the execution of the \incor" mutant will result in a dierent value of d from that of the original program. To the left of the TextArea, a bright rectangle area labeled \GlobeViewer" displays an array of three horizontal dashes, each corresponding to a selected failure mode. In general, there is a mapping between the dashed arrays and the fault injected interfaces: the array height indicates the interface percentiled location. This facility is particularly useful when the inspected le is huge. On a workstation with a color monitor, TAMER enables the user to better distinguish among the failure modes with distinct colors. GlobeViewer can help the user to \edit" FiSpec in that it allows the user both to view and to modify FiSpec. When a modi cation on FiSpec is intended, the user can re-center the interested interface of selection in the text window by clicking the mouse cursor at the dashed array in GlobeViewer; then the change of selection may be followed.
7.2 Adequacy results Before full implementation of TAMER, several experiments were conducted in the GDB environment, where faults were injected using combinations of debugging functions such as breakpoint, evaluate, jump. To illustrate our main ideas, the experiment used Example 2 as a target program. Note that the old accept test is intentionally designed to be imprecise; thus when the faults are injected, the program exposes its de ciency easily. After improvement, the new accept test works satisfactorily. To exhibit the strength of fault injection, the following presents the experimental results. The tables in Fig. 11 are the test results for the two programs in the previous Example. Listed horizontally for each table are six dierent failure modes, all variations of incorrect failure. For example, injecting the failure mode \A1+e" will cause the returned values of the function dist1 to slightly deviate by some erroneous amount \e". Listed vertically are three levels of results: \triggered", \detected" and \recovered". (Prior to a fault being recovered by the program, it must have been detected; similarly, to be detected it must have been triggered earlier.) Each entry is either \0" or \1". For example, the rst column 15
Figure 10: A tour through TAMER 16
failure A1+e
A1-e
A2+e
A2-e
triggered
1
1
1
1
1
1
6/6
detected
0
1
1
1
1
1
5/6
recovered
0
1
0
0
1
1
3/6
A1+e
A1-e
A2+e
A2-e
triggered
1
1
1
1
1
1
6/6
detected
1
1
1
1
1
1
6/6
recovered
1
1
1
1
1
1
6/6
result
A3+e A3-e
old program
failure result new program
A3+e A3-e
Figure 11: Experimental results of fault injection in the table for the old program (upper one) indicates the fault \A1+e" is triggered, but FM did not detect or recover from it. The bottom table shows the increase of fault coverage after FM has been modi ed. Several lessons were learned from the above experiment. First, a set of test data was designed earlier to meet the requirement of 100% statement coverage, which was made to emphasize the power of fault injection. As the numerical example shows, even the test of the original program demonstrates perfect CC , yet there is some potential weakness. Therefore, using test data is not sucient in testing for the fault tolerance of a program. We have to exploit fault injection to catalyze the process. Second, by adopting statement coverage as CC (thus CC = 100% = 1.0) and recovery ratio as FC , the test adequacy of the original program that used old accept test is
Ad = CC FC = 1:0 0:5 = 0:5:
After code improvement (i.e., new also increases,
accepttest
is made more precise), the test adequacy
Ad = 1:0 1:0 = 1:0:
8 Summary We have described the underlying rationale and architecture of TAMER, a tool for testing fault tolerant software. TAMER is currently a prototype with several components already in operation. The components, ADQ and DRV, based on adequacy computation are two of the key modules yet to be designed and coded. Once the complete prototype is ready we plan to perform experiments with a variety of fault tolerant software to determine the \eectiveness" of our adequacy criterion. 17
References [1] A. Avizienis and D. Ball. \On the achievement of a highly dependable and fault-tolerant air trac control system". IEEE Computer, 20(20):84{90, February 1987. [2] J. Barton, E. Czeck, Z. Segall, and D. Siewiorek. \Fault injection experiments using at". IEEE Trans. on Computers, 39(4):576{582, April 1990. [3] T. BeLanc and J. Mellor-Crummey. \Debugging parallel programs with instant replay". IEEE Trans. on Computers, C-36(4):471{482, April 1987. [4] R. Chillarege and N. Bowen. \Understanding large system failures - a fault injection experiment". In Proc. 19th Int. Symp. Fault-Tolerant Comput., pages 356{363, 1989. [5] F. Cristian. \Understanding fault-tolerant distributed systems". Communications of ACM, 34(2):56{78, 1991. [6] R. DeMillo, D. Guindi, W. McCracken, A. Out, and K. King. \Extended overview of the mothra software testing environment". In Second Workshop on Software Testing, Veri cation and Analysis, pages 142{151, July 1988. [7] R. Rashid et al. \ Mach: A foundation for open systems". In Proc. 2nd Workshop on Workstation Operating Systems, pages 109{113, 1989. [8] Z. Segall et al. \Fiat { fault injection based automated testing environment ". In Proc. 18th Int. Symp. Fault-Tolerant Comput., pages 102{107, 1988. [9] U. Gunne o, J. Karlsson, and J. Torin. \Evaluation of error detection schemes using fault injection by heavy-ion radiation". In Proc. 19th Int. Symp. Fault-Tolerant Comput., pages 340{347, 1989. [10] J. Horgan and S. London. \Atac: A data ow coverage testing tool for c". In Proc. of the IEEE Assessment of Qual. Software Development Tools, pages 1{10, New Orleans, LA, 1992. [11] W. E. Howden. \Reliability of the path analysis testing strategy". IEEE Trans. on Software Eng., SE-2:208{215, 1976. [12] M. Lai, C. Chen, C. Hood, and D. Saxena. \Using software fault insertion to improve ccs network operation". In GLOBECOM '92, pages 1723{1728, December 1992. [13] M. Ohba. \Software quality = test quality test coverage". In Proc. 6th ICSE, Tokyo, pages 287{293, 1982. [14] D. Powell, E. Martins, J. Arlat, and Y. Crouzet. \Estimators for fault tolerance coverage evaluation ". In Proc. 23rd Int. Symp. Fault-Tolerant Comput., pages 228{237, 1993. 18
[15] A. J. Outt R. DeMillo. \Constraint-based automatic test data generation". IEEE Trans. Software Eng, 17(9):900{910, September 1991. [16] S. Rapps and E. Weyuker. \Selecting software test data using data ow information". IEEE Transactions on Software Eng., SE-11(4):367{375, April 1985. [17] H. Rosenberg and K. Shin. \Software fault injection and its application in distributed systems". In Proc. 23rd Int. Symp. Fault-Tolerant Comput., pages 208{217, 1993. [18] D. Siewiorek, J. Hudakand B. Suh, and Z. Segall. \Development of a benchmark to measure system robustness ". In Proc. 23rd Int. Symp. Fault-Tolerant Comput., pages 88{97, 1993. [19] D. Yaskin T. Dilenno and J. Barton. \Fault tolerance testing in the advanced automation system". In Proc. 21st Int. Symp. Fault-Tolerant Comput., pages 18{25, 1991. [20] K. Tai, R. Carver, and E. Obaid. \Debugging concurrent ada programs by deterministic execution". IEEE Transactions on Software Eng., 17(1):45{63, July 1991. [21] D. Vrsalovic, Z. Segall, and J. Ready. \Is it possible to quantify the fault tolerance of distributed/parallel computer systems". In Proc. 20th Int. Symp. Fault-Tolerant Comput., pages 219{225, 1990.
19