DFT extend static FT to enable modeling of time dependent fail- ures, introducing .... MEM2) or of the disk block (DISK1 and DISK2) drives to the failure of the.
DFT and DRBD in Computing Systems Dependability Analysis Salvatore Distefano and Antonio Puliafito University of Messina, Engineering Faculty, Contrada di Dio, S. Agata, 98166 Messina, Italy {salvatdi,apulia}@ingegneria.unime.it
Abstract. Many alternatives are available for modeling reliability aspects, such as reliability block diagrams (RBD), fault trees (FT) or reliability graphs (RG). Since they are easy to use and to understand, they are widely spread. But, often, the stochastic independence assumption significantly limits the applicability of such tools. In particular this concerns complex systems such as computing systems. Also more enhanced formalisms as dynamic FT (DFT) could result not adequate to model dependent and dynamic aspects. To overcome this limitation we developed a new formalism derived from RBD: the dynamic RBD (DRBD). In this paper we compare the DFT and the DRBD approaches in the evaluation of a multiprocessor distributed computing system. Particular attention is given to the analysis phase, in order to highlight the capabilities of DRBD.
1
Introduction
There are several approaches to represent and analyze system reliability. From these particular mention is for the combinatorial models, i.e. high level specific reliability/availability modeling formalisms such as reliability block diagrams (RBD) [1], fault trees (FT) [2] and reliability graphs (RG). Although RBD, RG and FT provide a view of the system close to the modeler, they are defined on the stochastic independence assumption among components. They do not provide any elements or capabilities to model reliability interactions among components or subsystems, or to represent system configuration changing, aspects conventionally identified as dynamic. In particular these remarks concern computing systems: load sharing phenomena could affect the network availability; standby redundancy and maintenance policies could be considered in the management; interference or inter-dependence among components could arise (wireless devices, sensors, ...); common cause failures could group electric devices (power jumps, sudden changes of temperature, ...). These argumentations awakened the scientific community to the need of new formalisms as the dynamic fault trees (DFT) [3]. DFT extend static FT to enable modeling of time dependent failures, introducing new dynamic gates. But, using DFT it is hard to compose dependencies reflecting characteristics of complex and/or hierarchical systems, to define customizable redundancy schema or policy, to represent load sharing F. Saglietti and N. Oster (Eds.): SAFECOMP 2007, LNCS 4680, pp. 423–429, 2007. c Springer-Verlag Berlin Heidelberg 2007
424
S. Distefano and A. Puliafito
and to adequately model reparability features. To overcome these lacks in reliability modeling, in [4,5] we have defined a new reliability/availability modeling notation named dynamic reliability block diagrams (DRBD), by extending the RBD formalism. In this paper we in depth compare the two approaches, evaluating the reliability of a computing system taken as case study. More specifically, in section 2 the DRBD notation is briefly introduced. The motivating example and the corresponding DFT and DRBD models are described in section 3, then, the analysis of the two models is reported in section 4. Lastly, section 5 provides some final considerations.
2
DRBD Overview
DRBD extend the RBD formalism to the systems’ dynamics representation.Two are the key points of DRBD: the unit dynamics and the dependency concept. In a DRBD model each unit is characterized by a variable state identifying its operational condition at a given time. The evolution of a unit’s state (unit’s dynamics) is characterized by the events occurring to it, as depicted in Fig. 1. The states a generic DRBD unit can assume are: active if the unit works without any problem, failed if unit is not operational, following up its failure, and standby if it is operable but not committable. The events represent transitions between states: a failure models changes from active or standby to the failed states, a wake-up switches from standby to active states, the sleep from active to standby states, the reparation from failed to active state, the adep-switch represents transitions between two active states and sdep-switch between two standby states. adep-switch
ACTIVE wake-up
β
reparation failure
sleep
STANDBY β
failure
FAILED
sdep-switch
Fig. 1. DRBD unit’s states-events finite state automata
The main enhancement introduced by DBRD is the capability to model dependencies among units concerning their reliability behaviours. A dependency establishes a reliability relationship between two units, a driver and a target. Informally a dependency works as follow: when a specified event, named action or trigger, occurs to the driver, the dependency condition is applied to the target. This condition is associated to a specific target event, named reaction. When the satisfied dependency condition becomes unsatisfied, the target unit comes back
DFT and DRBD in Computing Systems Dependability Analysis
425
to the fully active state. The dependent state (standby or active) is characterized by the dependency rate, weighting, in terms of reliability, the dependence of the target unit from the driver. This corresponds to the dormancy factor α of DFT (β = 1 − α), but β could assume values greater than one. A dependency is characterized by the action (trigger) and the reaction events. Four types of trigger and reaction events can be identified: wake-up (W), reparation (R), sleep (S) and failure (F). Combining action and reaction, 16 types of dependencies are identified. The concept of dependence is exploited in DRBD as the basis to represent dynamic reliability behaviors. Details on the dynamics aspects modeling capabilities of DRBD can be found in [4,5].
3
The Multiprocessor Distributed Computing System
The scheme reported in Fig. 2 describes the multiprocessor computing system taken from literature [6,7] used for comparing the DFT and the DRBD approaches. It is composed by two computing module: CM1 and CM2 . Each of them contains one processor (P1 and P2 respectively), one memory (M1 and M2 ) and two hard disks: a primary (D11 and D21 ) and a backup disk (D12 and D22 ). Initially, the primary disk is used for storing data while the backup disk is accessed only periodically for updating operations. If the primary disk fails, it is replaced by the backup disk. The computing modules are connected by the bus N ; moreover, P1 and P2 are energized by the power supply P S: the failure of P S forces P1 and P2 to fail. M3 is a spare memory replacing M1 or M2 in the case of failure. If M1 and M2 are operational, M3 is just kept alive, but it is not accessed to load/store any data by the processors. When M1 or M2 fail, M3 substitutes the failed unit. In order to work properly the multiprocessor computing system of Fig. 2 requires that at least one computing module (CM1 or CM2 ), the power supply P S and the bus N are operating correctly. A computing module is operational if the processor (P1 and P2 ), one between the local memory (M1 and M2 ) and the shared memory M3 and one disk (D11 or D21 for CM1 and D12 or D22 for CM2 ) are not failed. The DFT modeling the multiprocessor computing system is depicted in Fig. 3(a) as it is in [6,7], while Fig. 3(b) reports the corresponding DRBD model.
P1
M1
D11
D12
D21
D22
CM1 PS
N
M3
P2
M2
CM2
Fig. 2. Schematic representation of the Multiprocessor Distributed Computing System
426
S. Distefano and A. Puliafito CS CM1 P1
M1 W/S W/S
0.5
D11 W/W
0.5
0.5
M3
D12 N
PS CM2 P2
(a) DFT
M2 W/S W/S
0.5
D21 W/W
0.5
0.5
M3
D22
(b) DRBD
Fig. 3. Multiprocessor Distributed Computing System models
The DFT is composed by a FDEP and four WSP. The FDEP gate models the dependency among the power supply P S and the two processors P1 and P2 . Since the power supply P S energizes the P1 and P2 processors, the failure of P S does not imply the failure of P1 and P2 thus we represent this behaviour in the DRBD by a series between each processor and P S. The backup disks D12 and D22 are considered as spare units of the primary disks D11 and D21 respectively, thus D11 and D21 drive WSP1 and WSP4 DFT gates in the control of D12 and D22 respectively. The disks management policy is represented in DRBD by a wake-up/wake-up dependency: when the primary disks D11 and/or D21 are operational, the backup disks D12 and/or D22 respectively are partial active, maintaining the backup. The level of activity of the dependent components is numerically translated into the DFT by dormancy factor α and into the DRBD by the dependency rate β, related to α by β = 1 − α. The partly-loaded standby redundancy policy applied to the M1 , M2 and M3 memory units, is represented by the WSP2 and WSP3 DFT gates: if M1 or M2 fail, M3 is activated. Wake-up/standby dependency are instead exploited to model the redundancy policy managing the memories. Such dependency must be applied to M3 if and only if both M1 and M2 are at the same time operational; when one of these fails, M3 must switch to the fully active state. To realize this condition two wake-up/standby dependencies, from M1 to M3 and from M2 to M3 are series composed [5]: when both are simultaneously satisfied the component M3 is placed in standby, otherwise M3 is active. The other DFT gates are static: the internal events DISK1 and DISK2 represent the failure of the corresponding CM1 and CM2 storage blocks, while M EM1 and M EM2 represent the computing modules’ memory block failure. The failure of the processor (P1 and P2 ) or of the memory block (M EM1 and M EM2 ) or of the disk block (DISK1 and DISK2 ) drives to the failure of the corresponding computing module (CM 1 and CM 2 internal events). Finally, if
DFT and DRBD in Computing Systems Dependability Analysis
427
both the computing modules fail, or the power supply P S goes down, or the bus N fails, the overall system fault occurs, represented in the DFT as the top event T E. It corresponds to the series among the two computing modules parallel, the power supply P S and the bus N in the DRBD in Fig. 3(b).
4
The Analysis
The example described and modeled in section 3 has been studied in depth by analyzing the overall system reliability cdf trend in time, knowing the components’ reliability cdfs or the corresponding failure rates. All the components have been modeled by a constant failure rate λ as reported in Table 1, characterizing exponential reliability cdfs or memoryless systems. Table 1. Parameters of the multiprocessor computing system Component λ α β N 2 P1 , P2 500 PS 6000 D11 , D21 , D12 , D22 80000 0.5 0.5 M1 , M2 30 M3 30 0.5 0.5 CM1 , CM2 0.9 0.1
M EM 1_F FM EM 1 M 1_A
M 1_AF
M 1_F M 3_A
M 3_AF
M 3_AS
M 3_F
M 3_SF
D 12_A
D 12_AF
D 12_F
M 3_S M 3_SA1
D 11_A
D 12_APA
D 11_AF D 11_F
D 12_PAF
D 12_PA M 2_A
M 2_AF
M 2_F
D 12_PAA
M 3_SA2 FM EM 2
(a) Memory Subsystem
M EM 2_F
D ISK1_F
FD ISK1
(b) Disk Subsystem
Fig. 4. GSPNs modeling the memory (a) and the disk (b) blocks
Initially the multiprocessor computing systems DRBD reported in Fig. 3(b) is subdivided in three independent subsystems: the first, static, is composed by the series among the power supply P S and the bus N , series connected with the parallel between the two computing modules CM1 and CM2 , that are the other two subsystems. Since these latter are identical, it is possible to study only one computing module subsystem and then apply the parallel structure equation to obtain the reliability of the two computing modules’ parallel. A computing module subsystem is further subdivided into the series of three blocks: the processor,
428
S. Distefano and A. Puliafito
Table 2. Unreliability results of the multiprocessor computing system analysis Time 1000 2000 3000 4000 5000
DBNet DRPFTproc Galileo DRBD 0.006009 0.006009 0.006009 0.006009 0.012245 0.012245 0.012245 0.012245 0.019182 0.019183 0.019183 0.019183 0.027352 0.027355 0.027355 0.027354 0.037238 0.037241 0.037241 0.037240
the memory and the disk. The memory and the disk blocks are dynamic parts. To study these dynamic parts the generalized SPNs (GSPNs) reported in Fig. 4 are exploited. These are generated by applying the DRBD-GSPN mapping algorithm specified in [5]. Thus, analyzing the two GSPNs through the WebSPN tool [8] and putting all together by applying the RBD structure equations, the results shown in the last column of Table 2 are obtained. By the same way, the DFT model depicted in Fig. 3(a) corresponding to the motivating example discussed, has been analyzed in [7] by exploiting three different tool: DBNet [7], DRPFTproc [9] and Galileo [3]. The first analyzes the DFT by translating it into a dynamic Bayesian network (DBN) and therefore by solving the DBN. DRPFTproc is based on modularization and conversion to stochastic well-formed nets (SWN) of the dynamic gates, tracing back the problem to a SWN solution. Galileo approaches the problem by firstly modularizing it, then solving the obtained modules by exploiting binary decision diagrams and CTMCs. Also the results obtained by such analysis are summarized in Table 2, where they are compared to the DRBD approach. Table 2 reports some system unreliability probabilities calculated in specific time instants. The time is expressed in hours. These results demonstrate and validate the effectiveness of the DRBD approach, providing consistent values for all the tests.
5
Conclusions
In this paper, the effectiveness of the DRBD methodology in the evaluation of system dynamic reliability is demonstrated. An in depth comparison among the DRBD and the DFT methodologies is investigated in the paper by exploiting a case study reporting the modeling and the analysis of a multiprocessor computing system, for which evaluate the overall system reliability. The results obtained allow to identify DRBD as a valid alternative in dynamic reliability/availability evaluation scenario.
References 1. Rausand, M., Høyland, A.: System Reliability Theory: Models, Statistical Methods, and Applications, 3rd edn. Wiley-IEEE (2003) 2. Vesely, W.E., Goldberg, F.F., Roberts, N.H., Haasl, D.F.: Fault Tree Handbook. U.S. Nuclear Regulatory Commission, NUREG-0492, Washington D.C. (1981)
DFT and DRBD in Computing Systems Dependability Analysis
429
3. Sullivan, K.J., Dugan, J.B., Coppit, D.: The galileo fault tree analysis tool. In: Proceedings of the 29th Annual International Symposium on Fault-Tolerant Computing, pp. 232–235. IEEE, Madison, Wisconsin (1999) 4. Distefano, S., Puliafito, A.: System modeling with dynamic reliability block diagrams. In: Proceedings of the Safety and Reliability Conference (ESREL06), ESRA (2006) 5. Distefano, S.: System Dependability and Performances: Techniques, Methodologies and Tools. PhD thesis, University of Messina (2005) 6. Malhotra, M., Trivedi, K.S.: Dependability modeling using petrinets. IEEE Transaction on Reliability 44(3), 428–440 (1995) 7. Montani, S., Portinale, L., Bobbio, A., Raiteri, D.C.: Automatically translating dynamic fault trees into dynamic bayesian networks by means of a software tool. In: Proceedings of the The First International Conference on Availability, Reliability and Security, ARES 2006, pp. 804–809. IEEE Computer Society Press, Los Alamitos (2006) 8. Scarpa, M., Puliafito, A., Distefano, S.: A parallel approach for the solution of non Markovian Petri Nets. In: Dongarra, J., Laforenza, D., Orlando, S. (eds.) Recent Advances in Parallel Virtual Machine and Message Passing Interface. LNCS, vol. 2840, pp. 196–203. Springer, Heidelberg (2003) 9. Bobbio, A., Franceschinis, G., Gaeta, R., Portinale, L.: Parametric fault tree for the dependability analysis of redundant systems and its high-level petri net semantics. IEEE Trans. Softw. Eng. 29(3), 270–287 (2003)