techniques, and applications (RAID configurations, Tandem, Stratus, etc.) ... dundance QoS requirements' specification, complex fault and error recovery techniques ..... The parity disk only works when data change (write operations), managing.
of the cloud computing system, the reliability of the system is defined as the ... Cloud computing, reliability modeling, common-cause failure, simplification, fault ...
Dec 1, 2006 - L. Saintis is based at Airbus France, 31060 Toulouse cedex 3, France (+33-5-61-18-99-84; [email protected]). E. Hugues is based ...
Jan 19, 2009 - E-mail: [email protected], [email protected]. ⢠S.V. Amari is with Relex Software Corp., 540 Pellis Rd., Greensburg, PA. 15601.
Cloud Computing and Enterprise Data Reliability. Luan Gashi1. 1 Lecturer, Faculty of Computer Science and Engineering, University for Business and.
cognition and machine learning, and introduces the concept of stochastic ... have been represented as probabilities, and finally the logical development.
Torrent can be used to transfer files directly from established GridFTP servers or other GridTorrent peers that are simultaneously requesting the same information ...
M. L. Silva is with the Department of Electrical and Computer Engineering,. Faculty of ... online FPGA logic space defragmentation and transparent online test operations must .... BS test infrastructure, there is no overhead at board level. Being.
Students complete a project to analyze information systems problems and ...
Accounting Information Systems; 11th Edition; Romney & Steinbart; Pearson ...
thousands of manual pages, excl. CPU — ... Audi A6 (C6), detail. Breadboarding
of a ... Network complexity. 31. D2, 1993. D4, 2010. 5 vs. > 100. : Audi A8 ...
Moreover, unlike previous linear co-connectivity algorithms, this algorithm admits ... adjacency list representation, a simple sequential algorithmâe.g., one ...
Hoyle; or the 1999 Hugo Award novel “A Deepness in the Sky” by the
mathematician and computer scientist Vernor Vinge where the advanced
amorphous ...
gramming computers, and perhaps in our view of computation itself. ..... be thought of as computing a breath first search tree, while the lateral inhibition algorithm ...
Dec 15, 2004 - differentiate both the new and the used oils by their increased mileages. In [3] ..... side there is the influence of socio-economic development.
embedded software design as a model of computation, and to ... PATTERNS. Software engineering has advanced considerably in recent .... C, Java, and VHDL.
Abstract— Reconfigurable computing system (RCS) is emerging as an important new paradigm of system design for present and future computing demands of ...
Aug 18, 2011 - his guidance and encouragement in course of this research work. Finally, Dr. Rajkumar Buyya; a Redmond Barry. Distinguished Professor and ...
N. AlIan. — 2nd ed. p. c». Includes bibliographical references and Index. ...... transmission constraints, although it has been shown [39] how these constraints can ...... Generating capacity—frequency and duration method 107 ... 0.1 99252 x 10:.
[1] Mostafa Abd-El-Barr. Design and Analysis of Reliable and Fault-Tolerant Computer Sys- tems. World Scientific Publishing Co., Dec. 2006. [2] Gunter Bolch ...
Computing Systems’ Reliability Dynamics Salvatore Distefano∗ University of Messina, Messina, Italy.
The size and complexity of computing systems has increased from one single processor to multiple distributed processors, from individual-separated systems to networked-integrated systems, from small-scale program running to large-scale resource sharing, and from local-area computation to global-area collaboration. A computing system today may contain many processors and communication channels and it may cover a wide area all over the world. They combine both software and hardware that have to function together to complete various tasks. They may incorporate multiple states and their failures may be correlated with one another. These factors make the computing system modeling and analysis complicated. It is necessary to specify more precise models, carefully identifying and quantifying aspects usually approximated or ignored at all. In this paper we face the problem of individuating and evaluating the most common dynamic behaviors and dependencies affecting computing systems. We propose some models to represent such aspects in terms of reliability/availability, basing on dynamic reliability block diagrams (DRBD), a new formalism derived from RBD we developed. In this way we want to provide the guidelines for adequately evaluating computing systems’ reliability/availability. Keywords: Computing Systems, Dependability, Dynamics, Dynamic Reliability Block Diagrams.
1
Introduction
There is no common approach to assess computing systems. Reliability is a quantitative measure that can be broadly interpreted as the ability for a system to perform its intended function or mission. As the functionality of computing operations becomes more essential, there is a greater need for a high reliability of the computing systems. In fact, in order to increase the performance of the computing systems and to improve the development process, a thorough analysis of their reliability is needed [16]. Sometimes, system disasters are caused by neglecting the principles of redundancy and failure independence, which are obvious in retrospect [14]. More often they are caused by considering over-simplistic models, coarsely approximating or cutting of aspects that, instead, must be adequately represented in the model. This is the case of dependent behaviors for which a system’s unit depend on another unit affecting its behavior. The bigger question here is whether catastrophic situations could have been avoided if the system had been designed in an appropriate, reliable manner. It is necessary to adequately represent such aspects through detailed models, exploiting specific modeling techniques. The most widely spread reliability modeling techniques are combinatorial models/notations such as: reliability block diagrams (RBD) [12] and fault/event trees (FT/ET) [15, 13]. All of them are useful modeling tools but, on the other hand, they have no capabilities to represent the dynamic behavior of the system in terms of availability/reliability. Generally speaking, it is really possible that two or more units/subsystems composing a system influence each other. Examples of such behaviors are: load-sharing, standby ∗
redundancy, interferences, dependencies, dependent, on-demand, cascade, and common cause failures, and so on. Since it is not possible to use combinatorial models, lower level techniques and formalisms are needed, such as: state space methods (Markov models, Petri nets [2], Boolean logic driven Markov process (BDMP) [3], etc), hybrid (combinatorial/state space) techniques (dynamic fault trees (DFT) [7], dynamic reliability block diagrams (DRBD) [6, 5]) or simulation (Monte Carlo [10], discrete event, etc). The main aim of this work is to investigate the dynamics of computing systems from the reliability/availability point of view, providing the guidelines for representing and evaluating it. Section 2 summarizes the state of art of the system reliability evaluation techniques. Then, in section 3, we investigate some specific computing systems dynamic reliability behaviors, explaining how to represent them in a reliability model basing on DRBD, a notation we developed by extending RBD. Finally, the remarks and the considerations on the methodology of section 4 closes the paper.
2
Computing System Reliability Evaluation Techniques
Reliability modeling aims at using abstract representation of systems as means for assessing their reliability. Two basic approaches have been used: empirical and analytical [14, 1]. According to empirical models, a set of N systems are operated over a long period of time and the number of failed systems during that time are recorded. The percentage of the failed system to the total number of operated systems is used as an indication of the reliability of the systems. The analytical approach is based on the use of the probability of failure of individual units of a given system in order to arrive at a measure for the probability of failure of the overall system. Analytical techniques are grouped in two modeling classes: state-based and combinatorial models. The former (state-based) are based on the concept of state and state’s transition. A system state is a combination of faulty and fault-free states of its units. State transition shows the changes in the systems state as time progresses. So, in state-based models, (Markov models, Poisson processes, Petri nets PN, and variants) the states a system can assume, considering those assumed by its units, must be univocally identified and numbered [13]. Such kind of models can be used to evaluate different dependability metrics for a system. Unfortunately, often it is very complex to analyze state-based models accurately, so this approach is not viable for many systems. There are many works in literature using state space based models for the computing systems reliability/availability evaluation. Markov models are used to model both simple structural reliability relationships (series, parallel, repeated events or blocks, k out of n voting policies, etc) and dynamic behaviors (standby redundancy, common cause failure, fault coverage models, load sharing configurations, multiple mode operations, etc). Some interesting applications of both Markov and Poisson models in the reliability/availability modeling and analysis of distributed, clustered and grid computing systems, also including several dynamic behaviors (standby redundancy, common cause failure, fault coverage models, load sharing configurations, multiple mode operations, etc), can be found in [16]. Other examples of redundancy policies (k/n, N-modular redundancy, standby, etc), software and networks models, optimization techniques, and applications (RAID configurations, Tandem, Stratus, etc.) of computing systems reliability/availability evaluation based on Markov models can be found in [14]. In [13] the authors investigate the reliability of some specific examples of computing systems by exploiting both Markov models and Petri nets. Combinatorial models enumerate the number of ways in which a system can continue to operate, given the probability of failure of its individual units. The system reliability is expressed as a combination of its units reliability, by exploiting the structure’s relationships equations [12]. While state-based models are more powerful and general than combinatorial models, these latter are instead more user friendly, characteristic that motivates their success. The most widely used combinatorial models are fault trees and reliability block diagrams. Static fault trees (FT) [15] use Boolean gates to represent how units’ failures combine to produce system failure. Dynamic
trees (DFT) ([4, 7]) add a temporal notion, since system failures can depend on the order of unit failures. They can model dynamic replacement of failed units from pools of spares (CSP, WSP and HSP gates); failures that occur only if others occur in certain orders (PAND gates); dependencies that propagate the failure of one unit to others (FDEP gates); and specification of constraints on failure orders that simplify analysis computations (SEQ gates), gates. In a reliability block diagram (RBD) [12, 13, 14], the logic diagram is arranged to indicate the combinations of properly working units keeping the system operational. There are many applications of combinatorial models in computing systems reliability evaluation. Among the others, the most significant involve DFT, due to the limitations of static combinatorial models in representing redundancy policies. In [8] the authors combine DFT and Markov model to represent redundant fault tolerant systems. [7] deals with reliability and redundance QoS requirements’ specification, complex fault and error recovery techniques, with dynamic implications in computing systems. Other models and more details on the evaluation of computing systems reliability/availability by DFT can be found in [4].
2.1
DRBD wake-up
ACTIVE wake-up
β
repair failure
sleep repair
STANDBY β
FAILED failure
Dr
p
/W|S|F|R
β
c
sleep
(a) States-Events
Tg
(b) Dependency
Figure 1: DRBD States-Events Machine and Generic Dependency DRBD
RBD ensure interesting features in reliability modeling such as simplicity, versatility and expressive power. Characteristics inherited by the DRBD notation, which moreover allows to take into account the system dynamics. In a DRBD model each unit is characterized by a variable state identifying its operational condition at a given time. The evolution of a unit’s state (unit’s dynamics) is characterized by the events occurring to it. The states a generic DRBD unit can assume, depicted in Figure 1, are: active if the unit works without any problem, failed if unit is not operational, following up its failure, and standby if it is reliable but not available. Active units participate actively to perform the work, the task of the system, while standby units do not contribute to this, they do not interact with the other units as consequence of a dependencyapplication. But, at the same time, a unit in standby is not failed, it just performs its internal activities. A DRBD unit state is moreover characterized by the state’s reliability or maintenance cdf, identifying the probability the unit fail from the specific state (active or standby states) or it is repaired (if failed), respectively. So, three main classes of states are above characterized as active, standby or failed (see [5] for details). An event represents the transition from a units state to another one: the failure event models a state change from active or standby to failed state, the wake-up switches from standby to active states or among different active states, the sleep from active to standby states or among different standby states, the repair from failed to active or standby state, The main enhancement introduced by DBRD is the capability to model dependencies among subsystems or units concerning their reliability behaviors. A DRBD dependency, depicted in Figure 1(b), establishes a dependent reliability relationship between two units, a driver (DR) and a target (T G). When the trigger event (tr) occurs to the driver, the reaction event (re ∈ {W, S, F, R})
is propagated/applied to the target with propagation probability p. When the propagated dependency condition becomes unsatisfied, the target unit comes back instantaneously to the fully active state, except in the case the dependency has a failure reaction. Four types of trigger and reaction events can be identified: wake-up (W), repair (R), sleep (S) and failure (F). Combining action and reaction, 16 types of simple dependencies are identified. They can also be composed into more complex dependencies, identifying complex/composed dependent behaviors. In the generic dependency shown in Figure 1(b), trigger and reaction events are identified by a string close to the circle identifying the target side. In it, the simple event composing both trigger and/or reaction is indicated by a letter (W, R, S or F as above) separated by a slash (/). In case of composition among dependencies the composed trigger event is represented as a condition of the simple triggers. Two main classes of trigger events composition operators are identified: temporal (>,