Reliability Testing of Applications on Windows NT Timothy Tsai Reliable Software Technologies 21351 Ridgetop Circle, Suite 400 Dulles, VA 20166 USA
[email protected]
Abstract The DTS (Dependability Test Suite) fault injection tool can be used to (1) obtain fault injection-based evaluation of system reliability, (2) compare the reliability of different applications, fault tolerance middleware, and platforms, and (3) provide feedback to improve the reliability of the target applications, fault tolerance middleware, and platforms. This paper describes the architecture of the tool as well as the procedure for using the tool. Data from experiments with the DTS tool used on the Apache web server, the Microsoft IIS web server, and the Microsoft SQL Server, along with the Microsoft Cluster Server (MSCS) and Bell Labs watchd (part of NT-SwiFT) fault tolerance packages is presented to demonstrate the utility of the tool. The observations drawn from the data also illustrate the strengths and weaknesses of the tested applications and fault tolerance packages with respect to their reliability.
1. Introduction Microsoft Windows NT is becoming a platform of choice for many applications, including services with dependability requirements. While the advantages of NT include decreased cost and leveraging of commercial development, testing, and support, the dependability of NT is a concern, especially compared to Unix systems that have traditionally formed the foundation for many high dependability products. In order to address this concern, several vendors, including Microsoft [10] and Bell Labs [13], have produced high availability software solutions that mostly depend on resource and process monitoring coupled with application restarts to handle error conditions. The available commercial solutions all claim to increase availability by tolerating a variety of faults. However, these
This work was performed while the author was with Bell Labs Re-
search, Lucent Technologies, Murray Hill, NJ, USA.
Navjot Singh Bell Labs Research, Lucent Technologies 600 Mountain Ave, Rm. 2B-413 Murray Hill, NJ 07974 USA
[email protected]
claims are usually not substantiated by rigorous testing but rather are based on a combination of analytical modeling, simulation, component analysis, and experience. One significant obstacle to the task of systematic testing of system dependability, in terms of either availability or another quantity, is the lack of easy-to-use fault injection tools. Fault injection is a necessity when testing the robustness of a system to unintended or unexpected events, because such events are often difficult to produce through traditional testing methods. This work addresses the need for an easy-to-use fault injection tool that can be used for a variety of software projects based on Windows NT. The Dependability Test Suite (DTS) is a tool for testing the error and failure detection and recovery functionality of a server application. Most of the code for the tool has been written in Java to produce a simple, yet practical graphical interface and to facilitate portability among different applications. This paper describes the DTS fault injection tool and illustrates its use with actual applications. The DTS tool is described in Section 3. Section 4 gives the results of experiments to illustrate the use of the DTS tool in (1) comparing the reliability of fault tolerance middleware, (2) comparing the reliability of applications with similar functionality, and (3) providing useful feedback to improve the target system. A summary and ideas for future work are given in Section 5.
2. Related Work The current state of the art in fault injection includes many fault injection mechanisms and tools. Iyer [6] and Voas [17] provide good summaries of many techniques and tools, as well as background and further references. DTS depends on a method of fault injection called software-implemented fault injection (SWIFI). Instead of using hardware fault injectors or simulation environments, SWIFI tools use software code to emulate the effects of hardware and software faults on a real system. Such tools include FIAT [1], FERRARI [7], FINE [8], FTAPE [15],
DOCTOR [5], Xception [2], and MAFALDA [12]. These tools have been implemented for a variety of operating systems, including many Unix variants and real-time operating systems. In contrast to DTS, none of these tools were implemented on Windows NT, although the architectures of these tools do not preclude such an implementation. Rather, interest in Windows NT and its reliability have recently begun to increase. Also, many of these tools focus on the reliability of the operating system or the platform rather than the reliability of applications. Fuzz [11] is one fault injection tool that tested the reliability of Unix utilities, applications, and services. The basic DTS architecture is not dependent on a particular fault injection mechanism. However, the initial DTS tool implementation is based on the interception of library calls and corruption of library call parameters. This method of fault injection is not unique. The Ballista [9] tool uses a similar technique to test the robustness of operating systems by fault injecting a set of common system calls used to access the file system. The Ballista work was performed on machines running Mach and various flavors of Unix. Ghosh [4] presents a tool for testing the reliability and robustness of Windows NT software applications. It should be noted that this fault injection technique injects faults during the execution of the target programs and therefore is very different from mutation testing [3], which injects faults into source code before compilation. None of these tools or studies injects faults into high availability systems. Thus, the focus of the testing is mostly on the target applications or the OS, in the case of Ballista. In addition, most of the tools were developed specifically for the types of fault injection performed, rather than being modular to be compatible with a variety of fault models and target programs.
3. DTS The main goals in designing the DTS fault injection tool were ease of use, automation, extensibility, portability, and most importantly, the ability to produce useful results. These considerations were important in determining the architecture, coding language, and user interface. The tool is distributed with the management and user interface software residing on the control machine and the fault injection mechanism, workload generator, and data collector present on a separate target machine. This separation of the control and target machines is necessary if there is a possibility of a machine crash caused by an injected fault. Otherwise, a machine crash would require human intervention to restart the testing process. In addition, a distributed design allows for testing of distributed systems, especially if failover may occur or if correlated faults on multiple machines are to be injected. Nonetheless, although
the tool is distributed in nature, it may be used with all components on a single machine if none of the above issues is pertinent. The majority of the DTS code is written in Java. The Java language includes many features that facilitate fast code development. These features include socket creation and use, thread management, object-oriented software reuse, convenient graphical libraries, and portability. The small portion of the code that could not be implemented in Java uses the Java Native Interface (JNI) and C. The JNI-implemented code is used for process control and other system-dependent tasks such as Windows NT event log access. For portability reasons, Java does not support a notion of process identifiers (PID’s), which is needed to properly terminate processes, especially those that have been fault injected and therefore may not be responding to normal termination messages. DTS is controlled via a graphical interface and a set of configuration files. One main configuration file is used to specify test parameters such as timeout periods, a fault list file name, and workload parameters. The fault list file contains a list of faults to be injected. Workloads are specified by creating parameter files with names of applications or services to execute or by creating Java classes that are used by the DTS workload generator. More details on the DTS architecture are contained in [14]. The user’s manual [16] contains detailed information about the steps needed to configure and to use the tool. The DTS tool injects faults by corrupting the input parameters to library calls. The resulting errors emulate the effects of several different types of faults, including application design and coding defects and unintended interactions of the application with the environment and nonstandard input. For the results in Section 4, the main goals of the experiments are to compare different applications and fault tolerance middleware. Thus, the main considerations for selecting faults are the ability to trigger error detection and recovery, the ability to discover failure coverage holes, and reproducibility. Other experiments that aim to produce a characterization of a single system’s reliability (e.g., in terms of a reliability or availability estimate) will require a real-world profile of the faults being modeled. The workload is the combined system resource usage (e.g., usage of operating system data structures, communication ports, etc.) caused by the execution of the application programs, the fault tolerance middleware, and the operating system. A workload generator is a set of programs that initiates the programs and creates the program inputs and environments in a controlled and reproducible manner to generate a particular workload. DTS assume that the workload is created by a client-server set of programs. This assumption is valid for many applications of interest because reliability concerns are particularly important for server programs. The server program is also referred to as the “target pro-
gram” because the focus of the fault injections is to evaluate the reliability of the server program, in the context of the operating system, fault tolerance middleware, and the client program. Note that the client program affects the overall reliability of the client-server system because client-initiated actions, such as client request retires, may be required for correct operation in the presence of faults. Non client-server types of workload scenarios are also supported by DTS, including applications with direct user interaction. However, some additional coding of Java classes may be necessary. The DTS data collector presents results that include the following:
of fault injection runs. A fault injection run includes the actions associated with the injection of a single fault. For each workload (W), a set of faults is injected. The set of faults depends on the set of functions to inject (F), the number of parameters for a particular function (P), the number of iterations to inject per function (I), and the number of fault types (T). This means that for a fault injection run, the workload programs are started, one fault is injected, and the workload programs are terminated. The fault injection run is repeated until all parameters of all functions have been injected with all fault types (actually, some faults are skipped if DTS determines that the fault will probably not be activated).
Outcome: The outcome for each injected fault is one of the following:
1. Normal success: The server was able to provide correct responses to all requests without any server restarts or request retransmissions. 2. Server restart with success: After a restart of the server, the server provided a correct response. 3. Server restart and client request retry with success: After a restart of the server and the retransmission of at least one client request, the server provided a correct response. 4. Client request retry with success: After at least one client request was retransmitted, the server provided a correct response. 5. Failure: At least one of the client requests did not succeed, either because no response was received or an incorrect response was received. This means that the server has failed, and the fault tolerance middleware, if present, has not prevented the failure of the server.
Response time: The total time for the client and server programs to complete. Detailed results: The specific response to each individual request.
Most of the results are client-oriented, which means that most of the results can be determined by examining the client program behavior. Usually the client program is a synthetic program that is specifically written for DTS. Some results, such as whether the server program has been restarted, cannot be determined from examining the client program output. The determination of server program restarts is dependent on the middleware used to perform the restart. Some middleware, such as Microsoft Cluster Server [10], write output to the Windows NT event log. Other middleware, such as NT-SwiFT [13], create a separate log file. Figure 1 shows the sequence of actions performed by the DTS tool for an experiment. An experiment consists of a series of workload sets (e.g., all faults for Apache, for IIS, and for SQL Server). Each workload set consists of a set
Workload Set START
Fault Injection Run
END START FI Run
foreach (w0 , w 1 , ..., wA) foreach (f 0 , f1 , ..., f B) foreach (p 0 , p1 , ..., pC)
Create fault param file Prepare workload progs Start server prog (fault is injected)
foreach (i 0 , i 1 , ..., i D) Wait for server to be up foreach (t 0, t 1 , ..., tE) Start client prog Fault Injection Run Workload termination w f p i t
workload function parameter iteration fault type
Gather results END FI Run
Figure 1. Experiment flow chart
4. Experimental results To demonstrate the utility of the DTS tool, several experiments were performed. The server programs studied were (1) Apache web server version 1.3.3 for Win32, (2) Microsoft Internet Information Server (IIS) version 3.0, and (3) Microsoft SQL Server version 7. Although IIS can serve as an HTTP server, an FTP server, and a gopher server, only the HTTP functionality was tested in these experiments. The first three programs were executed as NT services in three different configurations: (1) as a stand-alone service, (2) with Microsoft Cluster Server (MSCS), and (3) with the watchd component of NT-SwiFT. All experiments were conducted on the same machines. The hardware platform was a 100 MHz Pentium PC with 48 MB of memory running Windows NT Enterprise Server 4.0 with Service Pack 4. Additional experiments were conducted on a faster 400
MHz Pentium II PC with 128 MB of memory running Windows NT Enterprise Server 4.0 with Service Pack 4. Only the results for the slower 100 MHz Pentium machine are presented here because the faster machine was not yet equipped with MSCS in our lab. However, on the faster machine, the results for Apache, IIS, and SQL Server as standalone services and with watchd were essentially identical to those on the slower machine. For each server program, a simple client program was created to send requests to the server program. For the Apache and IIS web servers, the HttpClient program sends two types of requests: (1) an HTTP request for a 115 kB static HTML file and (2) an HTTP request for a 1 kB static HTML file via the Common Gateway Interface (CGI). For the SQL Server, the SqlClient program sends an SQL select request based on a single table. Both HttpClient and SqlClient check the correctness of the server reply. If the reply is incorrect or if the reply is not received within a timeout period (a default of 15 seconds), the request is retried. A second retry is attempted if necessary. Each client program waits 15 seconds before attempting a retry. After a correct reply is received or the third attempt fails, the client program outputs information about the success or failure of the requests and the number of retries attempted. For the NT programs, faults were injected by intercepting all calls to the functions in KERNEL32.dll. On our machine, KERNEL32.dll contains 681 functions. Of those 681 functions, 130 functions had no parameters and thus were not candidates for function parameter corruption. The remaining 551 functions were injected. To decrease the total time for the experiments, only the first invocation of the each function was injected (i.e., the CreateEventA() function is injected the first time it is called, but not the second or subsequent times). Further invocations can also be injected, but preliminary experiments showed that such injections produced similar results. For each function, each function parameter was injected with three types of faults: (1) reset all bits to zero, (2) set all bits to one, and (3) flip all bits (i.e., one’s complement for the parameter value). Each parameter of every function is injected with these three types of faults. Thus, for functions with two parameters, 6 different faults will be injected (2 parameters with 3 fault types for each parameter). Only one fault is injected for each execution of the server program. Although these types of corruption may seem simplistic, they were already effective in differentiating among different workloads (e.g., MSCS vs. watchd) and in helping to discover bugs that lead to failure scenarios. It may be interesting to introduce additional types of corruption based on data types (e.g., treating pointers and Boolean variables differently). However, this requires symbolic information and is compiler dependent, thus affecting the portability of
the fault injection method. Three server programs were studied in the experiments: the Apache web server, the Microsoft IIS web server, and the Microsoft SQL Server. Each was executed as an NT service (1) with no fault-tolerance middleware, (2) with MSCS, and (3) with watchd. It should be noted that the outcome of these experiments is dependent on the workload (especially the requests issued by the client and the configuration of the application) and the specific faults that are injected. A particular server program will not necessarily call all functions in a DLL. In fact, the majority of functions in KERNEL32.dll are not called. Table 1 shows the number of activated functions for each workload. See Section 4.1 for an explanation of Apache1 and Apache2. To shorten the total time for the experiments, if an injected function is not called, all other injections for that function will be skipped because it is assumed that the function will also not be called if the server program is rerun for the next fault.
Table 1. Number of called KERNEL32.dll functions per workload Fault-Tolerance Middleware Server Program None MSCS watchd Apache1 13 17 13 Apache2 22 24 22 IIS 76 76 70 SQL 71 74 70
4.1. Comparison of fault tolerance middleware packages Figure 2 shows NT results for comparisons of the Apache web server, the IIS web server, and the SQL Server as stand-alone NT services, with MSCS, and with watchd. For the Apache web server, the NT service consists of multiple processes. The Apache web server was specifically configured to start only two processes for the purposes of these experiments. The first process is a management process that spawns child processes that actually service requests. By default, Apache spawns multiple child processes. Since the tool only targets one process for injection, if one of the other child processes picks up the request, then injected faults may not be activated in a reproducible manner. Configuring Apache for only one child process guarantees that the same child process will pick up the request each time, thus ensuring reproducible results. Two sets of results are given for injections into the Apache web server, one set for injections into the first process (labeled as “Apache1” in this paper) and a second set for injection into the child process (labeled
!"# ,
' &(
*+)
%$
ÒÈÓÊÔ Õ Ö × Ø !Ù Ú "Û #% ÜOݾ$ Þ &ß à 'á â ã ä å æ ç è éëê ¸ ì ¹í îº ï »Oð ñëò¼¾ó ½ô ¿õÀ öÁ ÷ øëù ú û ü ý þ ÿ Ã Ä Å ÆÈ ÇÊ É Ë Ì Í Î Ï Ð Ñ d e f g !h"i #j k$ %l &( ' )o p* +q ,r -su . /1t 0 v 2 3w x 4 y 5z{ 6 | 7 8} ~ 9 : ; ?A
@ B C D E FHG I J K L MON P Q R S
T U V W X YO Z [ \ ] ^¡ _¢ `£ a¤ b¥ c
m n ¦§ ¨ © ª « ¬O ® ¯ ° ± ²³ ´ µ ¶ ·
K LNM O P Q R SUT V W X Y Z\[ ] ^ _ ` aUb c d e f g\h i j k l mUn o p q r s\t u v w x yUz { | } ~
\ U \ ¡£¢ ¤ ¥ ¦ § ¨U© ª « ¬ ®°¯ ± ² ³ ´¶µ · ¸ ¹ º¼» ½ ¾ ¿ À¶Á Â Ã Ä ÅÇÆ È É Ê Ë ÌÍ Î Ï Ð Ñ ÒÇÓ Ô Õ Ö ×ÇØ Ù Ú Û Ü ÝUÞ ß à á â ã ä å æ ç èêé ë ì í î¼ï ð ñ ò ó¶ô õ ö ÷ ø¼ù ú û ü ý¶þ ÿ - .0/21 3 4 5 6 798 : ; < = > ? @ A BDC E F G H I J
=?b*( ),>c @,d +.e AC-*f B7/ g 0D 13E h F2 G i H 4 j I 57k J l 6K L m 8 MOn 9 o p : N q; P < Qr R s S t T Uu v V7w x W X yY[z Z7{ \ |] ^ } _ ~ ` a
C±7 ².,³ ´ µ ¶ · ¸ ¹ * ¡ ¢ £[¤¦¥ § ¨ª©*« ¬ ® ¯ °
Figure 2. Standalone/MSCS/watchd comparisons for Windows NT as “Apache2”). IIS and SQL server both consist of a single process and are labeled as “IIS” and “SQL” respectively. Figure 2 shows NT results respectively for Apache1, Apache2, IIS, and SQL. Each figure shows the results for one workload as a stand-alone service, with MSCS, and with watchd. The normalized outcomes of the workload sets are displayed graphically in the charts and numerically below the charts. The possible outcomes are the five outcomes described in Section 3. Each outcome is given as a percentage of the total number of activated faults for that particular workload set. It should be noted that different workload sets, even for the same server program can produce a different number of activated faults, due to the effect of the fault tolerance middleware and the influence of non-determinism inherent in the server programs. However, these effects do not change the conclusions that can be drawn from the data. The faults injected into the extra functions that are called by each server program due to the fault tolerance middleware all result in normal success outcomes,
and only one function exhibited non-deterministic behavior: zeroing out all bits in the nNumberOfBytesToRead parameter for ReadFileEx() for SQL Server with the original version of watchd sometimes caused a detected error and sometimes caused a successful restart. Several interesting observations can be made from Figure 2. First, perhaps the most important and obvious observation is that both MSCS and watchd are effective in increasing the reliability of all three server programs. The solid black portions of the figures represent the fault injection runs that resulted in failures, i.e., cases where the server program was not able to produce the correct response even after repeated client request retries. The failure percentages for all server programs decreased markedly when MSCS or watchd was used. In fact, for Apache1, all failure outcomes were eliminated using watchd. The effectiveness of MSCS and watchd in reducing the number of failures is attributable to their ability to detect situations in which the monitored server program is malfunc-
tioning and then to initiate a recovery action, which entails a server program restart for these experiments. Discounting the effects of non-determinism and additional activated faults caused by using MSCS and watchd, the number of normal success and request retry with success remain essentially the same for each server program. The difference is reflected in the portion of failure outcomes that become success with restart outcomes due to the MSCS and watchd restart mechanisms. Figure 2 also reveals the effectiveness of the Apache architecture in handling faults. The Apache web server consists of multiple processes. The first process (Apache1) functions as a management process. Its duties include spawning the additional processes (Apache2) that actually service incoming web requests. The first process does not service any web requests itself. If one of the Apache2 processes dies, the Apache1 process will spawn another Apache2 process. This failure detection and restart mechanism within Apache is similar to that for MSCS and watchd. For this reason, MSCS and watchd, are effective with the Apache1 process but have no effect on the Apache2 process. The reason for this lack of efficacy is that both MSCS and watchd only monitor the first process that is started for any application. Thus, the child processes that are spawned by the first process are not monitored. Because the Apache1 process does not service any web requests, request retries produce no additional success outcomes, as seen in Figure 2. In addition, because the Apache2 process is not monitored by MSCS or watchd, no restarts initiated by MSCS or watchd occur. However, restarts of the Apache2 process by the Apache1 process do occur and are manifested as normal success and request retry with success outcomes. Figure 2 shows that while both MSCS and watchd decrease the number of failure outcomes, watchd does a much better job for the fault set used. In fairness to MSCS, only the generic service resource monitor is used. A custom service resource monitor that is specially tailored to interact with and monitor all aspects of the IIS and SQL Server programs would probably improve the MSCS results. However, Microsoft only provides an API for creating the custom resource monitors and not the actual custom resource monitors. Thus, the comparison between MSCS and watchd is based on the default MSCS and watchd packages.
4.2. Comparison of applications with similar functionality From the experimental data, some interesting observations about the relative reliability and performance characteristics of Apache and IIS can be made. Figure 3 shows the outcomes of the fault injection runs for Apache and
IIS as stand-alone services, with MSCS, and with watchd. The Apache results are a combination of the Apache1 and Apache2 results because both Apache processes must be considered in a comparison to IIS, which includes its total functionality in a single process. The Apache1 and Apache2 results are weighted based on the relative number of activated faults for each process. Figure 3 shows that the Apache web server exhibits a lower percentage of failure outcomes than IIS as a stand-alone service, with MSCS, and with watchd. As a stand-alone service and with MSCS, the occurrence of failure outcomes for IIS is twice that for Apache. However, if watchd is used, then the difference is not as great (7.60% vs. 5.80%) because far fewer faults result in failure with watchd. Table 1 shows that many more functions are activated for IIS than for Apache. To view Apache and IIS on a more common basis, Table 2 compares Apache to IIS counting only faults that were activated for both programs. Fewer faults were activated for the Apache1 process because the Apache2 process provides most of the web serving functionality. The third row of data shows the Apache1 and Apache2 outcomes added together. As with Figure 3, Apache exhibits fewer failures than IIS as a stand-alone service, with MSCS, and with watchd. However, the difference is even more pronounced (e.g., 5.7% vs. 26.0% failures for Apache vs. IIS as stand-alone services compared to 20.58% vs. 41.90% in Figure 3). It is often useful to consider performance in the presence of faults. Figure 4 shows the average response times for Apache and IIS as stand-alone services, with MSCS, and with watchd. The response times are grouped based on the outcomes of the fault injection runs. The outcome types are the same those as in Figures 2 and 3 with one exception. Failure outcomes are further subdivided into two outcomes: (1) Failures where a response is received from the server program, but the response is incorrect and (2) failures where no response is received. Obviously, if no response is received, the response time will be infinite, and therefore these faults are omitted from Figure 4. No occurrences of a particular outcome exist in some cases. The response times are in seconds and are given with corresponding 95% confidence intervals (shown as error bars in the figure). Some observations about the relative performance of Apache and IIS can be draw from Figure 4. First, there is no appreciable difference in the performance overhead due to the use of MSCS or watchd in either application. Second, for faults that result in normal success outcomes, Apache is faster, especially when MSCS is used. The normal success outcome average response times for Apache and IIS as stand-alone services (14.21 vs. 18.94 seconds) are essentially the same as the corresponding average response times for Apache and IIS when no faults are injected. Third, the average response times associated with applica-
âãá
Þ ßà ÜÛ Ý
רÙÚ ÔÕÖ ÑÒÓ ÎÏÐ ËÌÍ ÈÉÊ ÅÆÇ ÂÃÄ ¿ÀÁ ¼½¾ º»
( ) * + / , . 0 1 2 3 4 5 6 7 / 8 9 : ; < = > ? @ A B E C D F G H I E J K L M N O R P Q S T U V W X Y Z ¤ S¥ U¦V§ W¨XZ ©« Yª[ ¬\]®^¯T `° b±c²!" d³#eg $´%fµ &h¶ 'i·(j¸ )k¹ * + , . / 0 1s 2tà 3uÄ 4vÅ5w6xÆ7yÇ8zÈ9{:|É;} Ê ËÎ ? @ÏÐAÑB CÒ
ÓDEÔF Õ GÖ ×H ØI J K L M ÙN Ú O¡ P¢ÛÜQ£ÝRT a _ l m n o p q r ¼ º » ½ ¾ ¿ À Á û ü ýþÿ Þà ß á â ã ä å¼ æç èéê ëg ì íîïðT ñ òó ô!õT "ö# ÷$ø% ù& ú' ghi def þüöøýûùú÷ a b c òóõô ^ _ ` \] [ j k l m n o p q r s t u vxw y z {|} ~
ª «¬ ®º¯» °¼ ±½ ¾²¡¿³¢´Àµ£¤ ¶¥¦·¸§¹ ¨ © Á Â Ã Ä ÑÅ ÆÒ ÇÓ ÈÔ ÕÉ ÖÊ Ë×Ì Í ÎÏ Ð Ø Ù ÚÛéÜê ÝëÞ ì ßíà áî âï ãð äñå æ ç è ÿ þÿ
ä å*æç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý
Figure 3. Comparison of Apache to IIS
!" # $ %
& ' (*)+ , - . /
Figure 4. Average response times for Apache and IIS (with 95% confidence intervals)
Table 2. Comparison of Apache to IIS counting only common faults Fault-Tolerance Middleware With MSCS
Failure
Restart
Retry
Activated
Failure
Restart
Retry
Activated
Failure
Restart
Retry
Server Program Apache1 Apache2 Apache1+Apache 2 IIS
With watchd
Activated
Stand-alone service
30 111 141 123
20.0% 1.8% 5.7% 26.0%
0% 0% 0% 0%
0% 33.3% 26.2% 33.3%
36 120 156 135
8.3% 2.5% 3.8% 9.6%
8.3% 0% 1.9% 11.1%
0% 30.8% 23.7% 40.0%
30 111 141 123
0% 1.8% 1.4% 12.2%
20.0% 0% 0% 22.0%
0% 33.3% 26.2% 43.1%
tion restarts are lower for IIS than for Apache. Much of this discrepancy is due to the way that Apache seems to handle some problems during service startup. For some faults, the Apache1 process dies immediately after being started by the Windows NT Service Control Manager (SCM). However, the SCM assumes that the service is in the “Start Pending” state. When any service is in a pending state, the SCM locks its database, which causes any state change requests to the SCM to be denied. Thus, both MSCS and watchd must wait until the “Start Pending” state times out before initiating a restart of the service. Although both Apache and IIS experience this scenario, for Apache the number of occurrences is greater and the wait time for each occurrence before the pending state ends is greater. The main lessons drawn from Figure 4 are (1) both MSCS and watchd are comparable in impacting performance and (2) the application being monitored can affect how quickly the fault tolerance middleware is able to recover from detected problems.
4.3. Fault tolerance middleware improvements In addition to comparing fault tolerance middleware, the DTS tool also plays an important role in the identification of fault tolerance middleware weaknesses by suggesting ways in which the failure coverage of the fault tolerance middleware can be improved. All outcomes for individual fault injection runs are recorded. Thus, the specific faults that result in failure can be studied to determine the reason for the hole in the failure coverage. This testing and debugging procedure is much more effective with the use of the DTS fault injection tool. Fault injection is necessary to produce the more esoteric problems caused by the combination of faults with such factors as unexpected input or interactions between threads or processes or with the environment. These problems can be especially potent during non-steady state periods of operation, such as process initialization or termination or during periods of stress. The results from the initial experiment involving
watchd were studied to improve the original version of watchd (Watchd1) and to create an improved version (Watchd2). Watchd1 starts monitored processes by calling a startService() function that communicates with the SCM to start the service process. In order to monitor the newly created process, watchd obtains the handle of the new process by calling the getServiceInfo() function. For operation in the absence of faults, calling startService() followed by getServiceInfo() worked well. However, some faults caused the service process to fail after startService() was called and before getServiceInfo() was called. This small window of opportunity was sufficient to prevent watchd from correctly obtaining the necessary process handle, and therefore the failed process could not be monitored and restarted. The Watchd2 version merged the functionality of getServiceInfo() into startService(). Figure 5 shows the results of using Watchd1 and Watchd2 with Apache1, IIS, and SQL. The results for Apache2 are not shown because watchd has no effect on the outcomes for Apache2, as discussed earlier in Section 4.1. As seen in Figure 5, the Watchd2 improvements had mixed success. The failure outcomes for Apache1 actually increased, while no change was seen for SQL. Only IIS with Watchd2 showed an improvement in the results, with a dramatic decrease in the percentage of failure outcomes. A second iteration of studying the Watchd2 data resulted in the creation of another improved version (Watchd3). The Watchd2 version combined the tasks of starting the service process and obtaining a process handle to the new process in a single startService() function. This decreased the time window of opportunity for the new process to fail in between the two tasks. However, the opportunity for the new process to fail still existed. To address this problem, the Watchd3 version explicitly checks for a valid process handle before returning from the startService() function. If the process handle is not valid, then a new attempt to start the service process occurs. The check for the valid process handle is further augmented by commu-
XY VW
QR
T US
MNOP JKL GHI DEF ABC >?@ ; ?û @ü ýA þB ÿC DE F G H I J KM L N OQ !" # $ % & ')(* + , - . /01 2 3 4 5 6 P R SUT V WYX Z []\ ^ _]` a b 4o 5p 6q 7r 8s 9t :u v w x y z { | } ~ ! " # $ % &
' ( ) * + , - . / 0 1 2 3 ; ?AA@ B C D EAAF G H I JK L M N O PRQ S T U V AWYX ¡ Z ¢ £[ \ ¤ ] ¥A^¦_ §` ¨a b© cªA«d e¬ f ®g ¯hj °i k± ² l ³m n´ µ ¶ ·¸¹ º » ¼ ½ ¾ ¿ À ÁYÂ Ã Ä Å Æ ÇÉÈ Ê Ë Ì ÍÉÎ Ï Ð Ñ Ò ÓÕÔ Ö × Ø ÙÚ Û Ü Ý ÞÕß à á â ã äæå ç è é ê ëÕì í î ï ð u v wx y z { | } ~ º » ¼ ½ ¾ ¿ À Á  à Ä*Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö
¡¢ £ ¤ ¥ ¦ §¨ © ª « ¬ ® ¯ ° ± ² ³´ µ ¶ · ¸ ¹ × Ø Ù Ú Û ÜÝ Þ ß à á âã ä å æ ç èêé ë ì í îðï ñ ò ó ô õö ÷ ø ù ú ûðü ý þ ÿ Z []\ ^ _ ` a b c d e f g h i j k l m n o p q r s t
Figure 5. Comparison of original to improved watchd nication with the SCM to ensure that the service is properly started. These improvements dramatically improved the results for Apache1 and SQL, as shown in Figure 5. The results for IIS were unchanged compared to the results with Watchd2. However, a dramatic improvement had already been obtained with the Watchd2 improvements. It should be noted that the chart in Figure 2 includes the results using Watchd3. For Apache1, IIS, and SQL, the results with Watchd1 were all slightly worse than with MSCS. However, the results with Watchd3 were all much better than with MSCS. The conclusion that can be drawn is that the iterative improvements using the DTS tool helped watchd in a significant way.
5. Conclusion This paper described the architecture and use of the DTS fault injection tool. Experiments with the tool demonstrated the usefulness of the tool in several ways. First, the most practical use of the tool is in the system validation phase of testing. Individual fault injection runs can be used to provide reproducible feedback for improving the target system. The improvement may target the server program, the fault tolerance middleware, or the operating system. The DTS architecture facilitates the testing of different applications, middleware, and systems. This paper
showed the dramatic fault coverage improvement gained for the watchd middleware. Similar improvements are also possible for other fault tolerance middleware, such as MSCS, or for server programs or the operating system. The essential contribution of fault injection is the triggering of scenarios that would not normally be encountered in the course of conventional functional testing. Second, the results of DTS experiments can be used as a starting point for comparing the reliability of applications on Windows NT. Certainly, attention has to be given to the selection of the experimental fault and workload sets. Nonetheless, the DTS tool is useful as a test bed for performing fault injection-based evaluation of specific systems, although care has to be taken in generalizing conclusions about the intrinsic reliability of a particular application, operating system, or fault tolerance middleware. Several experiments using the DTS tool with several server programs and fault tolerance middleware packages on a Windows NT platform were performed. The results indicate that both MSCS and watchd are useful for increasing the failure coverage of the system (as represented by unity minus the percentage of failure outcomes). In particular, the improved watchd exhibited high failure coverage (greater than 90%) for all tested server programs. The watchd failure coverage was higher than for MSCS. The Apache and IIS server programs were both targeted for testing to demonstrate the use of the DTS tool in comparing
server programs with similar functionality. Both reliability and performance results were obtained. The Apache web server exhibited greater reliability and better performance for situations where no application restart or client request retry was required. However, when restart was necessary, IIS was much faster. The current work has been performed on a Windows NT platform. The DTS tool has already been ported to the Linux platform with minimal effort. Only systemdependent Java Native Interface components needed to be rewritten. Testing Apache on Linux with and without watchd has obtained preliminary results. Work is ongoing to determine appropriate fault and workload sets that will allow the Linux results to be compared to the Windows NT results. The fault and workload sets must be described in a system-independent way that can be applied to both types of systems. The DTS architecture has been designed to support Java plugin classes to support different fault injection mechanisms, workloads, and data collection strategies. See the user’s manual [16] for implementation details. Another possible interesting application of DTS is availability modeling. Most commercial systems that are concerned with reliability are described using availability numbers. Usually availability is expressed in orders of magnitude (i.e., number of nine’s of availability). This lack of precision is a result of the lack of tools to measure directly the availability of a system. The state of the art is to combine human experience with analytical models to yield estimates of availability. The DTS tool may play a role in providing testing-based parameters as input to analytical models that would then be able to yield estimates that are more precise. This might provide the basis for work in developing availability benchmarks. The DTS tool is available for download at http://www.bell-labs.com/projects/swift/ntdts.
6. Acknowledgments The authors gratefully acknowledge the design and development effort of Chris Dingman and Michael Vogel, as well as suggestions and feedback from Chandra Kintala. The authors also recognize the role of the reviewers of this paper in providing invaluable comments and suggestions.
References [1] J. H. Barton et al. Fault injection experiments using FIAT. IEEE Transactions on Computers, 39(4):575–582, Apr. 1990. [2] J. Carreira, H. Madeira, and J. G. Silva. Xception: Software fault injection and monitoring in processor functional units. In Proceedings 5th International Working Conference on Dependable Computing for Critical Applications, pages 135–149, Urbana, IL, Sept. 1995.
[3] R. A. DeMillo, D. S. Guindi, K. N. King, W. M. McCracken, and A. J. Offutt. An extended overview of the Mothra software testing environment. In Proceedings of the 2nd Workshop on Software Testing, Verification, and Analysis, pages 142–151, Banff, Alberta, July 1988. [4] A. K. Ghosh and M. Schmid. Wrapping Windows NT binary executables for failure simulation. In Proceedings Fast Abstracts and Industrial Practices 9th International Symposium on Software Reliability Engineering (ISSRE’98), pages 7–8, Paderborn, Germany, Nov. 1998. [5] S. Han, K. G. Shin, and H. A. Rosenberg. DOCTOR: An integrated software fault injection environment for distributed real-time systems. In International Computer Performance and Dependability Symposium, pages 204–213, Apr. 1995. [6] R. K. Iyer and D. Tang. Experimental analysis of computer system dependability. In D. K. Pradhan, editor, FaultTolerant Computer System Design, chapter 5, pages 282– 392. Prentice Hall PTR, Upper Saddle River, NJ, 1996. [7] G. A. Kanawati, N. A. Kanawati, and J. A. Abraham. FERRARI: A tool for the validation of system dependability properties. In Proceedings 22nd International Symposium on Fault-Tolerant Computing, pages 336–344, Boston, Massachusets, July 1992. [8] W.-L. Kao and R. K. Iyer. Define: A distributed fault injection and monitoring environment. In Proceedings of IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems, June 1994. [9] N. P. Kropp, P. J. Koopman, and D. P. Siewiorek. Automated robustness testing of off-the-shelf software components. In Proceedings 28th International Symposium on Fault-Tolerant Computing (FTCS-28), pages 231–239, Munich, Germany, June 1998. [10] Microsoft Windows NT clusters. White Paper, 1997. Microsoft Corporation. [11] B. P. Miller, D. Koski, C. P. Lee, V. Maganty, R. Murthy, A. Natarajan, and J. Steidl. Fuzz revisited: A re-examination of the reliability of UNIX utilities and services. Technical Report CS-TR-1995-1268, University of Wisconsin, Madison, Apr. 1995. [12] M. Rodr`iguez, F. Salles, J. C. Fabre, and J. Arlat. MAFALDA: Microkernel assessment by fault injection and design aid. In Proceedings 3rd European Dependable Computing Conference (EDCC-3), pages 143–160, Prague, Czech Republic, June 1999. Springer. LNCS 1667. [13] SwiFT: Software implemented fault tolerance for Windows NT. http://www.bell-labs.com/projects/swift. [14] T. Tsai and N. Singh. Reliability testing of applications on Windows NT. Technical memorandum, Lucent Technologies, Bell Labs, Murray Hill, NJ, USA, May 1999. [15] T. K. Tsai and R. K. Iyer. An approach to benchmarking of fault-tolerant commercial systems. In Proceedings 26th International Symposium on Fault-Tolerant Computing, pages 314–323, Sendai, Japan, June 1996. [16] T. K. Tsai and N. Singh. ntDTS User’s Manual. Lucent Technologies, Bell Labs, Murray Hill, NJ, USA, 2000. http://www.bell-labs.com/project/swift/ntdts. [17] J. M. Voas and G. McGraw. Software Fault Injection: Inoculating Programs Against Errors. John Wiley & Sons, Inc., New York, 1998.