QUALITY AND RELIABILITY ENGINEERING INTERNATIONAL Qual. Reliab. Engng. Int. 2008; 24:447–465 Published online 1 April 2008 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/qre.917
Research
Computer Systems Availability Evaluation Using a Segregated Failures Model Sergiy A. Vilkomir1, ∗, † , David L. Parnas2 , Veena B. Mendiratta3 and Eamonn Murphy4 1 Software Quality Research Laboratory (SQRL), Department of Electrical Engineering and Computer Science,
University of Tennessee, Knoxville, TN, U.S.A. 2 Software Quality Research Laboratory (SQRL), Department of Computer Science and Information Systems, University of Limerick, Limerick, Ireland 3 Chief Technology Office, Alcatel-Lucent, Naperville, IL, U.S.A. 4 Department of Mathematics and Statistics, University of Limerick, Limerick, Ireland
This paper presents the segregated failures model (SFM) of availability of faulttolerant computer systems with several recovery procedures. This model is compared with a Markov chain model and its advantages are explained. The basic model is then extended for the situation when the coverage factor is unknown and the failure escalation rates must be used instead. A simple practical analytical approach to availability evaluation is provided and illustrated in detail by estimating the availability of two versions of a reliable clustered computing architecture. For these examples, numeric values of availability indexes are computed and the contribution of each recovery procedure to total system availability is analysed. Copyright © 2008 John Wiley & Sons, Ltd. Received 23 November 2006; Revised 28 January 2008; Accepted 31 January 2008 KEY WORDS:
1.
software; fault tolerance; availability; reliability; recovery; fault model
INTRODUCTION
U
sually, when a computer system fails, a variety of recovery procedures are available. Some procedures are expensive, whereas others are cheap; some result in loss of data, whereas others minimize loss of data; some provide full service after restoration whereas others provide reduced service. In this paper we look only at characteristics important for system availability evaluation: time and applicability to various classes of failures. Thus, it is important for us that some recovery procedures are fast, but apply only to a limited class of failures and other procedures are slow but will restore service in more cases. The main reason for using several recovery procedures is to reduce restoration (recovery) time. A recovery strategy involves applying recovery procedures in a specific order until recovery is successful. In other words, a recovery strategy is divided into several levels, with one recovery procedure assigned to each level.
∗ Correspondence to: Sergiy A. Vilkomir, Software Quality Research Laboratory (SQRL), Department of Electrical Engineering and
Computer Science, University of Tennessee, Knoxville, TN, U.S.A.
† E-mail:
[email protected]
Copyright q
2008 John Wiley & Sons, Ltd.
448
S. A. VILKOMIR ET AL.
When a failure has occurred, the procedure with the shortest recovery time, for example, switching to a waiting redundant computer (hot spare), is usually applied first (level 1). If the first recovery attempt is not successful, the level 2 procedure, for example, a computer restart, is applied, and so on. We assume that the highest-level procedure always guarantees a recovery. This way of defining recovery levels is close to1 but is more general because it does not specify the content of procedures. The first recovery procedures are usually automatic. The final procedure is usually a manual repair. In this paper, we assume that the order of procedures is fixed, i.e. there is no examination of the cause of the failure except for specific hardware failures when it can be determined that the further use of automatic recovery is not expedient. In this case, the intermediate levels are skipped and the highest-level procedure (the manual repair) is applied. Several recovery procedures are often used for telecommunication systems. A specific example of a practical use is the reliable clustered computing (RCC)2 . RCC methodology provides an implementation of various fault-tolerance recovery strategies to achieve high availability of commercial non-fault-tolerant systems. We use this system to illustrate our new approach (see Sections 3 and 4). Other application areas of this recovery method are database systems and operating systems. Thus, a database system with the three-level recovery could, for example, use the following recovery technique3 : • With built-in redundant pointers in data structures to be able to recover from certain types of failures. • That maintains backup copies of parts of the data structures. • That keeps a complete backup copy of the database on a separate device. An example of the operating system with several recovery procedures is Sprite, a distributed UNIXcompatible operating system. The following two-level recovery mechanism is used4 : • The system first tries to recover quickly from backup data that it stored in the main memory. • If this fast recovery fails, the system returns to the traditional disk-based hard reboot. For availability evaluation of such systems, Markov chains5–8 , Matrix-Geometric solutions9 , and Petri nets10 have been used. These are powerful mathematical methods. However, analysis using these methods can be quite complicated and often requires special software tools. When such tools are used, calculations that require careful scrutiny are hidden from users. Both the models and the algorithms inside the tools are often based on implicit assumptions; if these assumptions are not valid for the actual application, the results are of doubtful value. In this paper, we propose a new model and a simple analytical approach to availability. The hardware/software failures are divided into several types, and the availability of the system is calculated separately for each type of failures. This model makes calculations more understandable for users and allows determining the impact of each type of failures on the availability of the whole system. This paper is partially based on an extension and refinement of our earlier investigations11,12 . In Section 2, we consider SFM and present a mathematical approach that allows us to calculate the availability of the hardware/software system with several recovery procedures. In Sections 3 and 4, examples of the applications of the proposed approach to two various versions of the RCC product are considered. Availability of these products was previously analysed in6,7 using a Markov chain model. In this paper, we illustrate our approach using the same applications and the same input data. Detailed results of the numerical availability evaluation are provided and the impact of (1) various types of failures and (2) coverage factors on the system down time is analysed. General conclusions are presented in Section 5.
2. 2.1.
SFM OF A SYSTEM WITH SEVERAL RECOVERY PROCEDURES Comparison with traditional systems with one recovery procedure
The difference between a traditional system with one recovery procedure and a system with several recovery procedures is illustrated in Figure 1. As usual, it is assumed that the system can be in only two states: a normal (working) state NS and a failed state FS. A transition from the normal state to the failed state is Copyright
q
2008 John Wiley & Sons, Ltd.
Qual. Reliab. Engng. Int. 2008; 24:447–465 DOI: 10.1002/qre
COMPUTER SYSTEMS AVAILABILITY EVALUATION
449
(a)
(b)
Figure 1. System with one recovery procedure (a) and with several recovery procedures (b)
described by failure rate . For the traditional system (Figure 1(a)), transition from the failed state to the working state is characterized by a restoration rate or, equivalently, mean restoration time = 1/. It is assumed that only one recovery procedure exists and that a result of recovery is always successful. In these circumstances, the term restoration refers to the process of restoration as well as its result. For the system in Figure 1(b), n (n>1) different recovery procedures exist. When a failure occurs, usually recovery procedures are applied sequentially starting from level 1. For every procedure except the nth, the result of the recovery can be either successful or unsuccessful. If the recovery procedure at level 1 is unsuccessful, the level 2 procedure is applied, i.e. the failure is escalated from level 1 to level 2, etc. It is assumed that level n recovery is always successful. Similar to a traditional system, every recovery level i is described by restoration rate i or mean restoration time i = 1/i . However, the meaning of these indexes is slightly different here. Because time here is the time required for the attempt, whether or not it succeeds, the term restoration refers here only to the process of restoration, not to its result (successful or unsuccessful). It is important to note that the use of several recovery procedures is different from the recovery block approach (one of the fault-tolerance software techniques13,14 ) despite some similarities between them. The recovery block approach also uses several procedures (alternates). However, all alternates perform the same desired operation and are executed sequentially until the operation performance is accepted. Thus, in systems with several recovery procedures, these procedures are used for restoration, i.e. the procedures restore service so that the normal programs can be run. In the recovery block approach, the procedures are used to perform some system functionalities (operations), i.e. the alternates are used instead of the normal code. Markov chains have been mainly used to model recovery blocks15–17 . 2.2.
Segregated failures model
Let F be the complete set of all possible failures of the system. As it was mentioned, the result of recovery from a failure can be either successful or unsuccessful. We say that a failure is served at level i if the level i procedure is applied (with two possible results) to this failure. This means that recovery at previous levels has been unsuccessful. Let Fi be a set of failures that are served at level i, Fi ⊆ F. Now consider only failures, for which the result of recovery at level i is successful. Definition 1. A failure f is said to be a failure of type i if and only if i is the lowest level where this failure is successfully served. Denote a set of such failures as Ftypei . It follows from Definition 1 that Ftypei ⊆ Fi and a set of Ftypei partitions F, i.e. F=
n i=1
Copyright
q
2008 John Wiley & Sons, Ltd.
Ftypei
(1)
Qual. Reliab. Engng. Int. 2008; 24:447–465 DOI: 10.1002/qre
450
S. A. VILKOMIR ET AL.
and ∀i, j : 1 ≤ i,
j ≤ n,
Ftypei ∩ Ftype j = ∅
(2)
The ability of the recovery procedure to successfully restore a normal operation after a failure is often described by a coverage factor. Adapting it for our model, consider the following definition. Definition 2. A coverage factor prec,i of the recovery level i is a conditional probability that a failure is successfully served at level i given that this failure is served at level i. More formally, prec,i = P( f ∈ Ftypei | f ∈ Fi )
(3)
We mentioned above that usually failures are escalated sequentially, from level i to level i +1. We assume that the recovery procedure is independent of the nature of the failure and is applied to all hardware and software failures. However, if at any level it is diagnosed for a specific failure that the usage of next recovery levels is not expedient, these levels can be skipped and this failure can be escalated directly to the last level n. Thus, there are three possibilities when a failure recovery is attempted at level i, 1 ≤ i