Computing Systems' Reliability Dynamics

Computing Systems’ Reliability Dynamics Salvatore Distefano∗ University of Messina, Messina, Italy.

The size and complexity of computing systems has increased from one single processor to multiple distributed processors, from individual-separated systems to networked-integrated systems, from small-scale program running to large-scale resource sharing, and from local-area computation to global-area collaboration. A computing system today may contain many processors and communication channels and it may cover a wide area all over the world. They combine both software and hardware that have to function together to complete various tasks. They may incorporate multiple states and their failures may be correlated with one another. These factors make the computing system modeling and analysis complicated. It is necessary to specify more precise models, carefully identifying and quantifying aspects usually approximated or ignored at all. In this paper we face the problem of individuating and evaluating the most common dynamic behaviors and dependencies affecting computing systems. We propose some models to represent such aspects in terms of reliability/availability, basing on dynamic reliability block diagrams (DRBD), a new formalism derived from RBD we developed. In this way we want to provide the guidelines for adequately evaluating computing systems’ reliability/availability. Keywords: Computing Systems, Dependability, Dynamics, Dynamic Reliability Block Diagrams.

1

Introduction

There is no common approach to assess computing systems. Reliability is a quantitative measure that can be broadly interpreted as the ability for a system to perform its intended function or mission. As the functionality of computing operations becomes more essential, there is a greater need for a high reliability of the computing systems. In fact, in order to increase the performance of the computing systems and to improve the development process, a thorough analysis of their reliability is needed [16]. Sometimes, system disasters are caused by neglecting the principles of redundancy and failure independence, which are obvious in retrospect [14]. More often they are caused by considering over-simplistic models, coarsely approximating or cutting of aspects that, instead, must be adequately represented in the model. This is the case of dependent behaviors for which a system’s unit depend on another unit affecting its behavior. The bigger question here is whether catastrophic situations could have been avoided if the system had been designed in an appropriate, reliable manner. It is necessary to adequately represent such aspects through detailed models, exploiting specific modeling techniques. The most widely spread reliability modeling techniques are combinatorial models/notations such as: reliability block diagrams (RBD) [12] and fault/event trees (FT/ET) [15, 13]. All of them are useful modeling tools but, on the other hand, they have no capabilities to represent the dynamic behavior of the system in terms of availability/reliability. Generally speaking, it is really possible that two or more units/subsystems composing a system influence each other. Examples of such behaviors are: load-sharing, standby ∗

Email: [email protected]

1

redundancy, interferences, dependencies, dependent, on-demand, cascade, and common cause failures, and so on. Since it is not possible to use combinatorial models, lower level techniques and formalisms are needed, such as: state space methods (Markov models, Petri nets [2], Boolean logic driven Markov process (BDMP) [3], etc), hybrid (combinatorial/state space) techniques (dynamic fault trees (DFT) [7], dynamic reliability block diagrams (DRBD) [6, 5]) or simulation (Monte Carlo [10], discrete event, etc). The main aim of this work is to investigate the dynamics of computing systems from the reliability/availability point of view, providing the guidelines for representing and evaluating it. Section 2 summarizes the state of art of the system reliability evaluation techniques. Then, in section 3, we investigate some specific computing systems dynamic reliability behaviors, explaining how to represent them in a reliability model basing on DRBD, a notation we developed by extending RBD. Finally, the remarks and the considerations on the methodology of section 4 closes the paper.

2

Computing System Reliability Evaluation Techniques

Reliability modeling aims at using abstract representation of systems as means for assessing their reliability. Two basic approaches have been used: empirical and analytical [14, 1]. According to empirical models, a set of N systems are operated over a long period of time and the number of failed systems during that time are recorded. The percentage of the failed system to the total number of operated systems is used as an indication of the reliability of the systems. The analytical approach is based on the use of the probability of failure of individual units of a given system in order to arrive at a measure for the probability of failure of the overall system. Analytical techniques are grouped in two modeling classes: state-based and combinatorial models. The former (state-based) are based on the concept of state and state’s transition. A system state is a combination of faulty and fault-free states of its units. State transition shows the changes in the systems state as time progresses. So, in state-based models, (Markov models, Poisson processes, Petri nets PN, and variants) the states a system can assume, considering those assumed by its units, must be univocally identified and numbered [13]. Such kind of models can be used to evaluate different dependability metrics for a system. Unfortunately, often it is very complex to analyze state-based models accurately, so this approach is not viable for many systems. There are many works in literature using state space based models for the computing systems reliability/availability evaluation. Markov models are used to model both simple structural reliability relationships (series, parallel, repeated events or blocks, k out of n voting policies, etc) and dynamic behaviors (standby redundancy, common cause failure, fault coverage models, load sharing configurations, multiple mode operations, etc). Some interesting applications of both Markov and Poisson models in the reliability/availability modeling and analysis of distributed, clustered and grid computing systems, also including several dynamic behaviors (standby redundancy, common cause failure, fault coverage models, load sharing configurations, multiple mode operations, etc), can be found in [16]. Other examples of redundancy policies (k/n, N-modular redundancy, standby, etc), software and networks models, optimization techniques, and applications (RAID configurations, Tandem, Stratus, etc.) of computing systems reliability/availability evaluation based on Markov models can be found in [14]. In [13] the authors investigate the reliability of some specific examples of computing systems by exploiting both Markov models and Petri nets. Combinatorial models enumerate the number of ways in which a system can continue to operate, given the probability of failure of its individual units. The system reliability is expressed as a combination of its units reliability, by exploiting the structure’s relationships equations [12]. While state-based models are more powerful and general than combinatorial models, these latter are instead more user friendly, characteristic that motivates their success. The most widely used combinatorial models are fault trees and reliability block diagrams. Static fault trees (FT) [15] use Boolean gates to represent how units’ failures combine to produce system failure. Dynamic

trees (DFT) ([4, 7]) add a temporal notion, since system failures can depend on the order of unit failures. They can model dynamic replacement of failed units from pools of spares (CSP, WSP and HSP gates); failures that occur only if others occur in certain orders (PAND gates); dependencies that propagate the failure of one unit to others (FDEP gates); and specification of constraints on failure orders that simplify analysis computations (SEQ gates), gates. In a reliability block diagram (RBD) [12, 13, 14], the logic diagram is arranged to indicate the combinations of properly working units keeping the system operational. There are many applications of combinatorial models in computing systems reliability evaluation. Among the others, the most significant involve DFT, due to the limitations of static combinatorial models in representing redundancy policies. In [8] the authors combine DFT and Markov model to represent redundant fault tolerant systems. [7] deals with reliability and redundance QoS requirements’ specification, complex fault and error recovery techniques, with dynamic implications in computing systems. Other models and more details on the evaluation of computing systems reliability/availability by DFT can be found in [4].

2.1

DRBD wake-up

ACTIVE wake-up

β

repair failure

sleep repair

STANDBY β

FAILED failure

Dr

p

/W|S|F|R

β

c

sleep

(a) States-Events

Tg

(b) Dependency

Figure 1: DRBD States-Events Machine and Generic Dependency DRBD

RBD ensure interesting features in reliability modeling such as simplicity, versatility and expressive power. Characteristics inherited by the DRBD notation, which moreover allows to take into account the system dynamics. In a DRBD model each unit is characterized by a variable state identifying its operational condition at a given time. The evolution of a unit’s state (unit’s dynamics) is characterized by the events occurring to it. The states a generic DRBD unit can assume, depicted in Figure 1, are: active if the unit works without any problem, failed if unit is not operational, following up its failure, and standby if it is reliable but not available. Active units participate actively to perform the work, the task of the system, while standby units do not contribute to this, they do not interact with the other units as consequence of a dependencyapplication. But, at the same time, a unit in standby is not failed, it just performs its internal activities. A DRBD unit state is moreover characterized by the state’s reliability or maintenance cdf, identifying the probability the unit fail from the specific state (active or standby states) or it is repaired (if failed), respectively. So, three main classes of states are above characterized as active, standby or failed (see [5] for details). An event represents the transition from a units state to another one: the failure event models a state change from active or standby to failed state, the wake-up switches from standby to active states or among different active states, the sleep from active to standby states or among different standby states, the repair from failed to active or standby state, The main enhancement introduced by DBRD is the capability to model dependencies among subsystems or units concerning their reliability behaviors. A DRBD dependency, depicted in Figure 1(b), establishes a dependent reliability relationship between two units, a driver (DR) and a target (T G). When the trigger event (tr) occurs to the driver, the reaction event (re ∈ {W, S, F, R})

is propagated/applied to the target with propagation probability p. When the propagated dependency condition becomes unsatisfied, the target unit comes back instantaneously to the fully active state, except in the case the dependency has a failure reaction. Four types of trigger and reaction events can be identified: wake-up (W), repair (R), sleep (S) and failure (F). Combining action and reaction, 16 types of simple dependencies are identified. They can also be composed into more complex dependencies, identifying complex/composed dependent behaviors. In the generic dependency shown in Figure 1(b), trigger and reaction events are identified by a string close to the circle identifying the target side. In it, the simple event composing both trigger and/or reaction is indicated by a letter (W, R, S or F as above) separated by a slash (/). In case of composition among dependencies the composed trigger event is represented as a condition of the simple triggers. Two main classes of trigger events composition operators are identified: temporal (>,

Computing Systems' Reliability Dynamics

Computing Systems' Reliability Dynamics

Suggest Documents

Computing Systems' Reliability Dynamics

A simplified reliability analysis method for cloud computing systems ...

Computing In-Service Aircraft Reliability

Computing system failure frequencies and reliability importance ...

Cloud Computing and Enterprise Data Reliability

stochastic computing systems - CiteSeerX

GridTorrent - Computing Systems Laboratory

Reliability and Availability in Reconfigurable Computing - CiteSeerX

COMPUTING SCIENCE & INFORMATION SYSTEMS ...

embedded computing systems

Theory of Computing Systems

Reliability and Availability in Reconfigurable Computing - CiteSeerX

Amorphous computing systems - Lix

ENGINEERING AMORPHOUS COMPUTING SYSTEMS

Computing, Information Systems & Development

Computing for Embedded Systems

Reconfigurable Computing Systems Design

Dependent Reliability of Distributed Cloud Computing ...

Reliability Evaluation of Power Systems

Operational Reliability Assessment of Systems

Hybrid Human-Machine Computing Systems - Distributed Systems ...

Hybrid Human-Machine Computing Systems - Distributed Systems ...

Hybrid Human-Machine Computing Systems - Distributed Systems ...

Systems Dynamics of Innovation Systems - System Dynamics Society