Availability Modeling of SIP Protocol on IBMc WebSpherec K. Trivedi†, D. Wang† , D.J. Hunt‡ , A. Rindos‡ , W.E. Smith‡ and B. Vashaw‡ † Department of ECE, Duke University, Durham, NC 27708, USA
[email protected],
[email protected] ‡ IBM, Research Triangle Park NC, USA {djhunt, ajrindos, wesmith, vashaw }@us.ibm.com Abstract We present the availability model of a high availability SIP Application Server configuration on WebSphere. Hardware, operating system and application server failures are considered. Different types of fault detectors, detection delays, failover delays, restarts, reboots and repairs are considered. Imperfect coverages for detection, failover and recovery are incorporated. Computations are based on a set of interacting sub-models of all system components capturing their failure and recovery behavior. The parameter values used in the calculations are based on several sources, including field data, high availability testing, and agreedupon assumptions. In cases where a parameter value is uncertain, due to assumptions or limited test data, a sensitivity analysis of that parameter has been provided. Our analysis indicates the failure types and recovery parameters that are most critical in their impact on overall system availability. These results will help guide system improvement efforts throughout future releases of these products.
used for developing many interactive services. Because of SIP’s increasing popularity, several commercial application server software packages, such as the IBMc WebSpherec Application Server (WAS) [5] and BEA WebLogic Serverc [4], support and deliver rich SIP functionality throughout their infrastructure. Many SIP applications are developed on top of these servers, for example, proxies for SIP requests, registrars for SIP services, processing control for voice over IP (VoIP) calls, etc. As people become more dependent on these services to conduct their everyday lives, it is important to provide high availability and accurately evaluate their availability. This paper presents an availability model of a SIP service running on the high availability (HA) SIP Application Server platform consisting of the WebSphere Application Server (WAS) software and IBM BladeCenterc hardware. Blade Chassis
1
AS 1
Replication Domain
1
AS 2
Replication Domain
2
Replication Blade A AS 3 3
Replication Domain AS 4
Keywords: Availability Modeling, Failure Recovery, Fault Tolerance, Mandelbugs, Software Failures.
SIP Proxy 1 External Components
AS 5 Blade G AS 6
Load Test Driver 1
1 Introduction
Blade B Group 3
Load test Driver 2
IP Sprayer Active IP Sprayer Standby
Blade C SIP
Blade Chassis
/ Foundry
2
AS 1
Today we are witnessing a convergence of data, voice and video services which are deployed over the same public IP-based networks and developed using the same communication protocols. This idea of service convergence helps the service providers to easily build new services by integrating and combining the existing ones, and allows the customers to conveniently access multiple services from a single device. As one of the key protocols for the service convergence technology, the session initiation protocol (SIP) [3] is an application-layer control protocol for creating, modifying and terminating sessions, which include Internet telephone calls, multimedia distribution, and multimedia conferences. It is adopted as an IETF standard and is widely
AS 1 thru AS 6 are SIP Application Servers
Replication Domain
4
Replication Domain
5
AS 4
SIP Proxy 1
Blade D
Blade H
AS 2 AS 5
SIP Proxy 1 is Stateless SIP Proxy server
Blade E AS 3
Replication Domain
6
AS 6 Blade 4 Blade F
Figure 1. IBM SIP Application Server cluster The IBM SIP Application Server cluster configuration is shown in Figure 1. It consists of two BladeCenters each having four blade servers (which we call nodes from now on). Each blade server has WebSphere Application Server (WAS) Network Deployment v6.1 installed. In the clus-
ter, two nodes (one in each chassis) are configured as proxy servers to balance SIP traffic and perform failover. The other nodes host application servers. The SIP application installed on the application servers is back-to-back user agent (B2BUA), which acts as a proxy for SIP messages in VoIP call sessions. The software stack on each application server node is shown in Figure 2.
on each cluster node. If the WLM detects the failure, the proxy servers will be notified and failover is triggered; if the NA detects the failure, the application server process will be killed and then restarted by the NA. There are also two software-implemented fault detection mechanisms for proxy servers. One is the IP-sprayer detection after which the switch-over will be performed to stop sending SIP traffic to the failed proxy server; the other is the node agent detection which works in the same way as the one for the application server, i.e., the proxy server process will be automatically restarted by the NA after successful failure detection.
OS WAS SIP Container B2BUA
After the failure in either the proxy server or the application server is detected, a sequence of recovery procedures will be performed which include automatic process restart (if the failure is detected by NA), manual process restart, manual reboot, and manual repair. The first three recovery mechanisms might not always be successful; if not, the next recovery mechanism in the sequence will be tried.
Figure 2. Appserver node software stack Both hardware and software redundancy is used in order to achieve high availability. Each of the application server node has two WAS instances running. Overall, there are 12 application server processes in the cluster with two processes sharing the hardware/OS resources in the same node. The session information in each application server is replicated in a peer application server on the other chassis. These two application servers constitute a replication domain. When one application server fails, the proxy servers will perform a failover to redirect the SIP traffic of the failed server to its peer server in the same replication domain. As shown in Figure 1, there are 6 replication domains (1-6) running on 6 Blades (A-F) as shown below. The two SIP proxy Replication domain 1 2 3 4 5 6
Values for the model parameters were estimated from real measurements. But for confidentiality reasons, they are not presented in this paper. The parameter values shown in the tables of this paper are picked arbitrarily and not at all related to real system values. Prior research close to this paper can be found in [1]. This is a practical experience report on the availability modeling and analysis of a system of current interest. The model is comprehensive in that it includes hardware, OS software and application software failures; detection, restart and reboot delays; imperfect detection, imperfect restart and imperfect reboot. The model includes escalated levels of recovery [7]. The use of time redundancy via restart and reboot as a method of mitigation of failures caused by software bugs suggests that Mandelbugs are the predominant residual faults in operational software systems [2]. Similarly, identical software copies being used to failover to also suggests Mandelbug as the predominant cause of software failures.
Nodes A, D A, E B, F B, D C, E C, F
Table 1. Replication domain configuration servers connect to an IP-sprayer with active-standby configuration. If one proxy server fails, the IP-sprayer will not forward SIP traffic to the failed proxy server until it is recovered. When the active IP-sprayer fails, the standby will take over while, in the mean time, the failed IP-sprayer will be repaired. Since the IP-sprayer is not a part of the IBM SIP Application Server solution, the IP-sprayer failures are not included in the availability model. There are two software-implemented fault detection mechanisms for application servers: one is the fault detection by the workload manager (WLM) of the cluster; the other is the fault detection by the node agent (NA) running
The rest of this paper is organized as follows: In Section 2 we present the availability models for the system as well as the software and hardware subsystems/components in the IBM SIP Application Server cluster. They include behavior for server software failures, automatic failure detection (including WLM, node agent and IP-sprayer), manual failure detection, recovery mechanisms (including process restart, node reboot and manual repair), and imperfect coverage for detection/recovery mechanisms. Subsequent sections provide numerical results and conclusions. 2
2 Availability Models 2.1
BS Failure
BS
System availability model
The downtime definition for the SIP Application Server cluster is any time the available application capacity for SIP and HTTP processing drops below the certified engineering limit of the specific HA SIP Application Server cluster configuration. For our modeling purpose, we quantify the capacity of the SIP Application Server cluster using the number of operational application servers, and we assume the proxy will not be the bottleneck, i.e., as long as one of the two proxy servers is up, the system capacity will not be influenced. We define the system unavailability as the probability that both proxy servers are down and k or more application servers have failed. The faults we consider for the SIP Application Server cluster are classified as in Figure 3.
CM Failure
CM
MP Cool Pwr
Blade failure
Midplane failure
eth
eth 1
nic1
esw 1
OS
eth2
nic2
esw 2
• A leaf node drawn as a square means that there is a submodel defined • A leaf node drawn as a circle means that there is an alternating renewal model with a given MTTF and MTTR specified for that component.
Software failures
Physical failures
Cooling failure
RAID
Figure 5. Chassis and Node fault trees
Failures
Power failure
Base CPU Mem
Network failure
OS
• A leaf node drawn as an inverted triangle means that there is a shared (repeated) component with a given MTTF and MTTR
Application
Memory failure NIC failure
WAS
A failure in application server AS1 is due to either the software process for that application server being down (denoted by component 1A), or the node being down due to hardware failures in the node (denoted by component BSA ), or common failures in the chassis that influence all nodes on that chassis (denoted by CM1 ). And similarly for other application servers and the proxy servers. Left side of Figure 5 shows the mid-level fault tree model for chassis failure, which includes the midplane (M P ), Cooling system (Cool) and Power system (P wr). Any failure of these components will result in a chassis failure. Right side of Figure 5 shows the mid-level fault tree model for hardware failures of the node, which include system board failure (Base), CPU failure (CP U ), memory failure (M em), disk failure (RAID), operating system failure (OS), and Ethernet failure (eth1 and eth2 ). Any of the component failures will cause node failure. For the Ethernet component, each blade has two network interface cards (NICs) (nic1 and nic2 ) that connect to two Ethernet switches (esw1 and esw2 ) on the BladeCenter. The Ethernet will fail if both connections are down. The Ethernet switches esw1 and esw2 are shared components across different blade servers on that chassis. Each BSX , X = A..F and CMi, i=1..2 in Figure 4 is replaced by the models in right side of Figure 5 and left side of Figure 5 respectively to form a single fault tree model. When doing the replacement, the Ethernet switch components will be shared by all nodes in the same chassis, i.e.,
Proxy
CPU failure base failure I/O (RAID) failure
Process hang
Process die
Figure 3. Failure classification We use a 3-level hierarchical model. The top level is a fault tree, the middle level models are several fault trees while at the bottom level are Markov models for individual subsystems such as the midplane, blade CPU, power domain, cooling subsystem, application server, proxy, etc. In reality, the middle level fault trees are imbedded into the top level fault tree but we show them separately for convenience of drawing. The Markov models at the bottom level remain separate models even for solution. Figure 4 shows the toplevel fault tree model including both software and hardware failures. In the figure, • iX (i = 1..6, X = A..F ) represents software failure of the application server in domain i on blade X • BSX (X = A..F ) represents hardware failure on node X • CMi (i = 1..2) represents common hardware failure on chassis i • P1 and P2 represent software failure of the two proxy servers 3
System Failure
system
App servers
proxy
k of 12
AS1
1A
AS2
BSA CM 1
2A
AS3
BSA CM 1
AS7
1D
3B
AS4
BSB CM 1
AS8
BSD CM 2
4D
4B
AS5
BSB CM 1
AS9
BSD CM 2
2E
BSE CM 2
5C
BSC CM 1
AS10
5E
PX1
AS6
6C
AS11
BSE CM 2
3F
BSF CM 2
P1 BSG CM 1
BSC CM 1
PX2
P2
BSH CM 2
AS12
6F
BSF CM 2
Figure 4. SIP Application Server availability model BSA , BSB and BSC share the same esw1 and esw2 , while BSD , BSE and BSF share the same esw3 and esw4 . The availability sub-model for each component in the fault tree has been developed and is used to compute the component availability. The overall system availability is computed from the fault tree using the SHARPE software package [8]. The bottom-level models for each component/subsystem in the fault tree are shown in the following section.
2.2
with both of the redundant communication paths fully operational. State U 1 denotes the midplane with one of the two redundant communication paths fully operational, after the failure of one communication path has been detected and failover is successful. The midplane has a transition from state U P to state U 1 for most failures; any uncovered cases such as common mode failures are represented by the transition to state DN . State DN is a down state and the transition rate to state DN is determined by the common mode factor, cmp . We note here that cmp is not a coverage factor but the fraction of midplane faults that bring the whole midplane down. In both state DN and state U 1, the system is in need of timely repair, and the midplane model shows a transition to state RP on the arrival of the service person with a mean response time to arrival of 1/αsp . State RP is a down state because the chassis must be taken out of service to replace the midplane. 1/μmp is the mean time to repair the midplane. The power and cooling subsystem availability models are adopted from [9] as are the blade base, blade NIC and network switch submodels; these are not shown here. Figure 7 shows the CPU model of the node with 2 processors. Both processors need to be operational in order for the node to be considered up. State U P is the only up state representing the case where the node has two operational processors. When one processor fails the model enters state D1, and a service person is summoned with mean time to response of 1/αsp . When the person arrives, the node must be removed for repair so the processor subsystem enters the RP state for the repair and then returns to state U P once it is completed. Figure 8 shows the memory model of the blade server.
subsystem models
In this section we present the availability models for subsystems and components in the SIP Application Server cluster. Some of the models are directly adopted from [9] and not shown here for brevity. Figure 6 shows the midplane DN αsp
cmp λmp
UP
(1-cmp) λmp
U1
αsp
RP
μmp
Figure 6. Midplane availability model UP
2λcpu
D1
αsp
RP
μcpu
Figure 7. Availability model for node CPU model of the blade chassis. The steady state availability of the midplane model is the probability that the system is in state U P or state U 1. State U P represents the midplane 4
UP
4λmem
D1
αsp
RP
(1-e)δ
1N
μmem
eδ
DN
(1-e)δ
2λhd
U1
UP
αsp
UO
eδ qρa
αsp
RP
μhd
CP
λhd
eδ
dδ γ
λhd UP
dδ δm
1D
(1-d)δ
Figure 8. Memory availability model
UN (1-d)δ
2N
UA
(1-e)δ
(1-q)ρa
(1-r)ρm
UR
UB
(1-b)βm
RE
rρm bβm
DW
μ
χhd μ2hd
Figure 10. Availability model for application server and/or proxy server software
Figure 9. Hard disk availability model Each node has two banks of memory. Each bank is comprised of two memory DIMMs. Both memory banks need to be operational for the node to be up, and the bank is down if any of the two DIMMs is down. The memory model is similar to the CPU model in Figure 7 with different parameter values. 1/λmem is the mean time for one memory DIMM failure. Figure 9 shows the disk model of the blade server. There are two hard disks in the node configured as RAID1. In state U P both disks are in operation. State U 1 and CP are up states with one operational disk. When the RAID controller chip on the node recognizes that the first drive has failed, the RAID subsystem moves from state U P to state U 1, relying on the remaining drive for all data. A repair person is summoned with mean time to response 1/αsp and the subsystem enters down state RP since the node must be removed from service to replace the drive. If the second drive fails before the arrival of the repair person, the RAID subsystem transitions from state U 1 to down state DN with no remaining drives. The model enters state DW from state DN when the repair person arrives. In state RP , the disk drive is replaced with mean time to repair of 1/μhd and the subsystem enters up state CP . Then, the data must be copied onto the new disk drive with a mean time to completion of 1/χhd. If the second disk drive fails before the copy is completed, the subsystem enters the down state DW . In state DW , both disk drives are replaced with fresh preloaded drives with a mean time to repair of 1/μ2hd. Figure 10 shows the availability model of the application server/proxy server software. The model does not include the states and transitions characterizing failover/switchover behavior because failover and switchover do not affect the availability for individual servers. However they play an important role in reducing the SIP message loss and thus are included in the models that compute service oriented measures such as call loss; these will be discussed in a future paper. The description of the states is shown in Table 2. When a failure occurs in an application server, the model transitions from state UP to state UO. In state UO, the failure detection is attempted by two automated detectors: WLM
(workload manager) and NA (node agent). Assume that the detection probability by WLM and NA is d and e, respectively. If the WLM detects the failure first, the model enters state 1D, in which case failover is carried out and in the mean time the node agent is trying to detect the failure. We assume that the node agent will not detect the failure before failover is done. With probability e the failure will be detected by NA, the model enters UA, UR, UB and RE in turn for auto process restart, manual process restart, manual reboot and manual repair. The use of increasingly complex recovery actions in this manner has been called escalated levels of recovery [7]. With probability (1-d)(1-e) neither method can detect the failure, the model enters state UN, in which the failure is manually detected. After manual detection the model will enter state UR, UB and RE for manual process restart, node reboot, and manual repair. The use of identical software copies to failover and restart/reboot as a means of recovery from failure clearly implies that the major cause of failures is assumed to be Mandlebugs. The small fractions of failures are treated by repair actions, presumbaly those caused by residual Bohrbugs [2]. The model for SIP/Proxy is identical except that besides node agent, sprayer is the second automated detector. States UP UO 1D 1N 2N UN UA UR UB RE
Description server is up the server is in undetected failure state failure is detected by WLM(IP sprayer), NA has not yet detected the failure the WLM (IP sprayer) is unable to detect the failure the node agent is unable to detect the failure neither WLM (IP sprayer) nor NA has been able to detect the failure NA detected the failure; performing auto-process restart performing manual process restart on the failed server manually reboot blade assigned to the failed server performing manual repair of the relevant blade
Table 2. States in the server availability model Model parameters (that are not for any real system or based on any real measurements) are shown in Table 3. Figure 11 shows the availability model for the node oper5
Parameters 1/γ 1/δ1 1/δ2 1/δm 1/φ 1/ρa 1/ρm 1/βm 1/μ c d e q r b
Description mean time to server failure mean time for WLM failure detection mean time for node agent failure detection mean time for manual failure detection mean time for failover mean time for automatic process restart mean time for manual process restart mean time for manual node reboot mean time for manual repair coverage factor for failover coverage factor for WLM detection coverage factor for node agent detection coverage factor for auto process restart coverage factor for manual process restart coverage factor for manual node restart
Values 1000 hours 2 seconds 2 seconds 10 minutes 1 second 10 seconds 60 seconds 10 minutes 8 hours 0.9 0.9 0.9 0.9 0.9 0.9
Params 1/λmp 1/λc 1/λps 1/λcpu 1/λbase 1/λOS 1/λswh 1/λnic 1/λmem 1/λhd 1/λsp cmp cps 1/μmp 1/μc 1/μ2c 1/μps 1/μ2ps 1/μcpu 1/μbase 1/δOS
Table 3. Application server Parameters μOS
UP
λOS
DN
δOS
DT
(1-bOS)βOS
DW
αsp
RP
bOSβOS
bOS 1/βOS 1/μOS 1/μswh 1/μnic 1/μmem 1/μhd 1/μ2hd 1/χhd
Figure 11. Availability model for node OS ating system. The model is initially in the U P state which represents the OS is up. With failure rate λOS the model enters state DN where the failure is being detected. After failure detection the model enters DT and the node is being rebooted to recover the failure. 1/βOS is the mean time to reboot the node. With probability bOS the reboot is successful, the model returns to U P state; with probability 1 − bOS the reboot is unsuccessful and the model enters state DW where a repair person is summoned. The model enters RP when the repair begins and returns to U P after the repair is done. Table 4 shows the parameters for the subsystem models presented in this section. Once again these parameter values are purely hypothetical (but reasonable) in nature.
k
Description mean time for mid-plane failure mean time for blower failure mean time for power module failure mean time for processor failure mean time for Base failure mean time for OS failure mean time for ethernet switch failure mean time for NIC failure mean time for memory DIMM failure mean time for hard disk failure mean time for failure detection plus repair person arrival prob. of mid-plane common mode failure coverage factor for power module failure mean time to repair mid-plane mean time to repair blower mean time to repair two blowers mean time to repair power module mean time to repair two power modules mean time to repair processor mean time to repair Base mean time to detect the OS failure coverage factor for node reboot to recover OS mean time for node reboot mean time to repair OS mean time to repair the ethernet switch mean time to repair NIC mean time to repair memory bank mean time to repair hard disk mean time to repair two hard disks mean time to copy disk data min number of failed app. servers for system unavail.
Values 106 hours 106 hours 106 hours 106 hours 106 hours 4000 hours 106 hours 106 hours 106 hours 106 hours 2 hours 0.001 0.99 1 hour 1 hour 1.5 hours 1 hour 1.5 hours 1 hour 1 hour 1 hour 0.9 10 minutes 1 hour 1 hour 1 hour 1 hour 1 hour 1.5 hours 10 minutes 6
Table 4. Parameters for hardware models
times to carry out various recovery steps and the covergae probabilities of various recovery steps. The hardware failure rates are available in most companies in the form of tables [9] while mean times to detect, failover, restart and reboot are measured by means of fault injection experiments. Software component MTTFs are not readily available and expriments are necessary to estimate these. These experiments are hard to conduct because of time they will take, because of the difficulty of attributing the cause of failures and due to the difficulty of ensuring the representativeness of the workload. Coverage probabilities also need to be estimated by means of fault injection experiments. Using the default values in Table 3 and Table 4, the predicted system unavailability is 2.2 × 10−6 . The contribution of WAS and SIP/Proxy to unavailability is miniscule: 2.7 × 10−9 indicating the effectiveness of escalated levels of recovery. We varied the number of spare WAS instances (which is k − 1) and computed the total downtime, OS only downtime, hardware only downtime, proxy only downtime and application server only downtime. The numbers are shown in Table 5. As seen from the table, the OS only and hardware only failures for k = 2i − 1 (i = 1, 2, 3) are the same as those for k = 2i, that is because an OS failure or a hardware failure will bring down an even number (at least two) of application servers. When k is small (k = 1, 2), OS
3 Numerical Results The system availability is computed using the models in the previous sections and arbitrarily (but reasonably) chosen parameter values in Table 3 and Table 4. Since the real paramter values are confidential, we have used arbitrary values. We also perform the sensitivity analysis on a set of parameters including MTTR (the mean time of manual repair for WAS failure), various coverage factors (failover and switchover coverages, WLM detection coverage, node agent detection coverage, automatic and manual process restart coverages, and manual reboot coverage), manual failure detection delay, WLM detection delay, mean time of OS failure, and mean time of WAS failure. The set of model input parameters can be divided into four groups: hardware component failure rates (or equivalently MTTFs), software component failure rates, the mean 6
k 1 2 3 4 5 6 7
OS 1.155e+3 1.155e+3 1.129 1.129 7.113e-2 7.113e-2 7.061e-2
hardware 7.365e+1 7.365e+1 1.131 1.131 1.127 1.127 3.823e-4
proxy 3.631e-4 3.631e-4 3.631e-4 3.631e-4 3.631e-4 3.631e-4 3.631e-4
app server 1.657e+2 2.396e-2 5.248e-7 0 0 0 0
total downtime 1.394e+3 1.228e+3 2.7350 2.413 1.219 1.218 9.283e-2
Therefore the system unavailability goes up as the manual repair time becomes longer. Unavailability vs. MTTF_OS
−6
x 10
14
12
Table 5. Downtime by various components Unavailability
10
only downtime is dominant because OS has a higher failure rate; as k increases to 5 and 6, there are more redundant blade servers that host the application servers, we need 3 or more OS failures to bring down the system. On the other hand, there still exist single point of system failure for hardware faults (e.g., a chassis failure will bring down 6 application servers), therefore hardware only downtime becomes dominant for k = 5, 6. When k = 7, there is no single point of system failure any more for hardware faults, and OS only downtime becomes dominant again. The proxy only downtime is invariant to k because proxy failure will not cause additional application server failure. The application server only downtime decreases rapidly as k increases because of more redundant application servers in the system. The sensitivity analysis for various parameters are shown from Figure 12 to Figure 16. Figure 12 shows system unavailability vs. mean time to OS failure (1/λOS in Table 4). As seen from the figure, the system unavailability drops as the mean time to OS failure increases. But after it reaches 1000 hours, increasing mean time to OS failure does not reduce the system unavailability much. Figure 13 shows system unavailability vs. mean time to WAS server failure (1/γ in Table 3). Same as Figure 12, the system unavailability drops as the mean time to WAS server failure increases. And after it reaches 100 hours, the system unavailability is less affected by the mean time to WAS server failure. Since the main portion of the repair time for application server failure comes from manual detection delay and manual repair, varying the WLM detection delay wouldn’t affect the overall repair time very much. Figure 14 shows the system unavailability vs. manual detection delay (1/δm in Table 3). Longer manual detection delay makes the repair time longer for application server and proxy server. From the figure we can see the system unavailability increases with the manual detection time as expected. Figure 15 shows the system unavailability vs. coverage factors. In this figure we set all coverage factors in Table 3 to be the same value as on the x-axis. Larger coverage factor values indicate faster repair time for various failures. Therefore the system unavailability drops as the coverage factors increase. Figure 16 shows the system unavailability vs. manual repair time (1/μ in Table 3). As mentioned earlier, the manual repair time is one of the main contributors for the overall repair time for application server failure and proxy server failure.
8
6
4
2
0
500
1000
1500
2000
2500
3000
MTTF_OS (hour)
Figure 12. Unavailability vs. MTTF OS
Unavailability vs. MTTF_WAS
−6
7
x 10
6.5
6
Unavailability
5.5
5
4.5
4
3.5
3
2.5
2
0
50
100
150
200
250
300
MTTF_WAS (hour)
Figure 13. Unavailablity vs. MTTF WAS
4 Summary and Conclusions System availability is an important consideration for telecommunication systems. In this paper we developed a detailed availability model of the IBM SIP Application Server cluster, which acts as a proxy for SIP traffic in the VoIP system. We modeled both software and hardware failures in the cluster, which can be classified into 14 failure modes. We developed analytical models that captured the details of failure/recovery behavior for each of the 14 failure modes and failure dependencies among them. Different failure detection mechanisms are modeled including WLM detection, node agent detection, and manual detection. We also modeled the various recovery procedures with imperfect coverage. Our modeling scheme allows different model parameters (such as detection delays, time and coverage factors for recoveries, failure rates, etc) to be applied for 7
different failure modes. Sensitivity analysis of availability was carried out for important model parameters. In a future paper, we will discuss the computation of service-oriented measure; namely call losses due to failures. We also plan to obtain a closed-form solution for availability and call losses so as to get greater insight. Such a closedform solution will allow us to obtain formal derivatives and hence find bottlenecks. We will remove the assumption of exponential distribution wherever needed. Finally we will also assess the effect of dependence due to shared reboot for two application servers sharing the same blade using fixedpoint iteration. Yet another line of investigation is related to aging-related bugs and software rejuvenation. We will include aging-related bugs in addition to non-aging-related Mandelbugs in our model as was done in [6].
Unavailability vs. manual detection time
−6
x 10
10
9
Unavailability
8
7
6
5
4
3
2
0
20
40
60
80
100
120
Manual detection time (minute)
Figure 14. Unavailability vs. manual detection delay
References [1] S. Garg, Y. Huang, C. Kintala, K. Trivedi, and S. Yagnik. Performance and reliability evaluation of passive replication schemes in application level fault tolerance. In Fault Tolerant Computing Symposium (FTCS 1999), pages 322–329, June 1999.
Unavailability vs. coverage factor
−6
7
x 10
6.5
6
Unavailability
5.5
5
[2] M. Grottke and K. Trivedi. Fighting bugs: remove, retry, replicate and rejuvenate. IEEE Computer, 40(2):107–109, 2007.
4.5
4
3.5
[3] http://tools.ietf.org/html/rfc3261.
3
2.5
[4] http://www.bea.com/content/products/weblogic/server/.
2 0.6
0.8
1
coverage factor
[5] http://www.ibm.com/software/webservers/appserv/was/. Figure 15. Unavailability vs. coverage factors
Unavailability vs. manual repair time
−6
2.34
[6] Y. Liu, K. Trivedi, Y. Ma, J. Han, and H. Levendel. Modeling and analysis of software rejuvenation in cable modem termination systems. In Proc. of Int. Symp. on Software Reliability Engineering, pages 159–170, 2002.
x 10
[7] V. Mendiratta. Reliability analysis of clustered computing systems. In Int. Symp. on Software Reliability Engineering, pp. 268–272, 1999.
2.32
Unavailability
2.3
2.28
[8] R. Sahner, K. Trivedi, and A. Puliafito. Performance and reliability analysis of computer system: An Example-Based Approach Using the SHARPE Software Package. Kluwer Academic Publishers, 1995.
2.26
2.24
2.22
2.2
0
5
10
15
20
25
30
35
40
45
[9] W. E. Smith, K. Trivedi, L. Tomek, and J. Ackaret. Availability analysis of blade server systems. IBM Systems Journal, 47(4), 2008.
50
Manual repair time (hour)
Figure 16. Unavailability vs. manual repair time
8