Speci cation and Solution of Dependability Models of ... - CiteSeerX

3 downloads 1422 Views 940KB Size Report
May 14, 1993 - 4.15 GSPN/SRN subnet for modeling processor-sharing repair discipline : : : : : : : : : : 44 .... 8.7 MTTDL versus mean time to hard disk failure ...
CS{1993{12

Speci cation and Solution of Dependability Models of Fault-Tolerant Systems Manish Malhotra

Department of Computer Science Duke University Durham, North Carolina 27708{0129 May 14, 1993

Speci cation and Solution of Dependability Models of Faul-Tolerant Systems Manish Malhotra May 14, 1993

Supervised by Kishor S. Trivedi Dissertation submitted in partial ful llment of the requirements for the degree of Doctor of Philosophy in the Department of Computer Science in the Graduate School of Duke University

This document is a reformatted version of the disseration, and equivalent in content.

c 1992 by Manish Malhotra Copyright All rights reserved

Abstract The modeling and analysis methodology consists of three main phases: model speci cation, model generation, and model solution. We consider some speci c problems in each of these areas. First, we establish a hierarchy of dependability model types according to their modeling power. Algorithms to convert one model type into another are provided. We show that fault-trees with repeated events (FTRE) are the most powerful combinatorial model type. Then we show how Petri-net based models can be used for dependability modeling. Algorithms to convert a FTRE model to equivalent generalized stochastic Petri net (GSPN) and stochastic reward net (SRN) models are presented. Our comparison reveals that SRNs permit a much more concise description of dependability models than GSPNs do. We then present a methodology for formal expression of hierarchy in model speci cation and solution that o ers a uni ed view of various kinds of hierarchical modeling techniques including iterative hierarchical modeling based on xed-point iteration, noniterative hierarchical modeling, reward-based performability modeling, behavioral decomposition, and approximate model decomposition. Model generation consists of converting from the speci cation-model-type to solution-modeltype. We consider the conversion of semi-Markov models to Markov models using the technique of phase approximations. We describe a complete approach to phase-approximations, including choice of phase-approximation class, estimation of the selected parameters, and implementation of the approximation approach in a modeling toolkit. We also describe a new hybrid approach for parameter estimation that combines moment-matching with least squares tting. Model solution is the next step after model generation. We describe an approach to design ecient methods for numerical transient solution of sti Markov chains. Our approach uses a combination of explicit and implicit ODE methods. Finally we describe an application of dependability modeling and analysis. We model and analyze dependability of disk array systems. We develop detailed models and introduce new measures to compare various RAID (Redundant Arrays of Inexpensive Disks) architectures. The coverage of disk failures is analytically computed based on the error detection and correction mechanism. Models that take into account placement strategies of support hardware are also developed.

i

ii

Acknowledgements This thesis would not have been possible without the support of many people. The constant support, encouragement, and love of my parents during the twenty years of my school education has nally led to this thesis. My gratitude for their understanding and support of my decision to leave my home town for undergraduate education and then leave my home country for graduate studies can not be expressed in words. I am greatly indebted to my advisor Prof. Kishor Trivedi. I have constantly learnt from his knowledge and invaluable experience in the last four years. His patience while I weekly tossed between the \Ph.D." or \No Ph.D." decision for a long while was remarkable. He provided me the moral support when I needed it most (after I decided to return from Vienna). Whether I failed or succeeded, words of inspiration and encouragement never ceased coming from him. Besides the academic support he provided, he has always been concerned with my welfare and any personal problems I had. For all that he did, I cannot thank him enough! While my parents have been far away in India, it is my sister and her family who provided me the care and family support through the good and bad times I had in USA. She always encourages me to do better and sets goals for me higher than I usually achieve! Jogesh helped me integrate with the FTSS group and provided invaluable help when I was about to enter the world of modeling. He introduced me to the modeling tool SPNP which I have extensively used ever since. Words cannot express my gratitude for Ricardo Pantazis. Rarely have I seen such an extraordinary willingness to help others and share knowledge. Extensive discussions with him on various numerical techniques aided my research signi cantly. Another friend whose in uence I shall always cherish is Pankaj. He provided me the motivation I needed and urged me to do independent research which eventually led to an extremely fruitful summer of 92. I would like to thank members of FTSS group including Dimitris, Hoon, Lorrie, Ramesh, Steve, Varsha, Wei and others for useful discussions. Many thanks to Hemant, Ramesh, Sandeep, and Srinivas for helping me through the dicult process of settling down in a new country. My of cemates (Apratim, Boyce, and Thomas) deserve special thanks for making our oce a wonderful and lively place to spend hours at a stretch. Ingrid deserves special mention for pleasent surprises (home-made chocolate cakes left on my desk)! Boyce and Jon deserve special thanks for respectively bringing the dart and chess boards in the oce. Thanks also to two special friends, Brock and Dave, who made my stay at Duke a memorable one. No matter how much, I cannot thank Subro enough, for his help on several occasions. I am grateful to him and Pankaj for providing their help and support when I had to defend this thesis despite my sickness. I am thankful to AT&T Bell Laboratories, Holmdel for o ering me two summer internships. During these two summers, I learnt a lot from working with Michele Carey, Michael Luvalle and Andrew Reibman. I am grateful to Andrew Reibman for providing critical review of some of my research. I would also like to acknowledge the support of National Science Foundation and Naval Surface Warfare Center for a research assistantship under grants CCR-9108114 and N60921-92-C0161 respectively. Last but not the least, I am very grateful to Eric Smith and Dianne Himler at iii

iv the engineering library for always lending me a helping hand in my search for books and journals.

Contents Abstract Acknowledgements 1 Introduction

1.1 Contributions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1.2 Organization : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

2 Literature Survey

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

3.1 Fault-Tolerant Multiprocessor System : : : : : : : : : : : : : : 3.2 Combinatorial Model Types : : : : : : : : : : : : : : : : : : : : 3.2.1 Reliability Block Diagrams (RBDs) : : : : : : : : : : : : 3.2.2 Fault Trees Without Repeated Events (FTs) : : : : : : 3.2.3 Fault Trees with Repeated Events (FTREs) : : : : : : : 3.2.4 Reliability Graphs (RGs) : : : : : : : : : : : : : : : : : 3.3 Hierarchy Among Combinatorial Model Types : : : : : : : : : 3.3.1 Fault Trees to Reliability Block Diagrams : : : : : : : : 3.3.2 Reliability Block Diagrams to Fault Trees : : : : : : : : 3.3.3 Fault Trees to Reliability Graphs : : : : : : : : : : : : : 3.3.4 Reliability Graphs to Fault Trees with Repeated Events 3.3.5 Fault Trees with Repeated Events to Reliability Graphs 3.4 Markovian Model Types : : : : : : : : : : : : : : : : : : : : : : 3.5 Conclusions : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : :

: : : : : : : : : : : : : :

: : : : : : : : : : : : : :

: : : : : : : : : : : : : :

: : : : : : : : : : : : : :

: : : : : : : : : : : : : :

: : : : : : : : : : : : : :

: : : : : : : : : : : : : :

: : : : : : : : : : : : : :

: : : : : : : : : : : : : :

: : : : : : : : : : : : : :

: : : : : : : : : : : : : :

2.1 2.2 2.3 2.4

Model Speci cation : : : : : : : : : : : : : : : : : : : : : : : : Model Generation : : : : : : : : : : : : : : : : : : : : : : : : Model Solution : : : : : : : : : : : : : : : : : : : : : : : : : : Dependability Modeling and Analysis of Disk Array Systems

3 Power-Hierarchy of Dependability Model Types

4 Dependability Modeling Using Petri-Net Based Models 4.1 Generalized Stochastic Petri-Nets : : : : : 4.1.1 FTREs to GSPNs : : : : : : : : : 4.2 Stochastic Reward Nets : : : : : : : : : : 4.2.1 FTREs to SRNs : : : : : : : : : : 4.3 Modeling Repair (Without Dependency) : 4.3.1 Modeling Repair in GSPN Models 4.3.2 Modeling Repair in SRN Models : v

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

i iii 1 1 3

5 5 6 8 9

11

12 13 13 13 14 14 16 16 16 18 19 22 23 25

27

28 28 34 34 36 36 39

CONTENTS

vi 4.4 Modeling Repair Dependencies : : : : : : : : : : : : : 4.4.1 FCFS Repair Discipline : : : : : : : : : : : : : 4.4.2 Pre-emptive Resume Priority Repair Discipline 4.4.3 Non-pre-emptive Priority Repair Discipline : : 4.4.4 Processor-sharing Repair Discipline : : : : : : : 4.5 Conclusions : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

5.1 Hierarchical Modeling : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5.2 Formal Expression of Hierarchy in Model Solution : : : : : : : : : : : : : 5.3 Examples : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5.3.1 Hierarchical Reliability Model of Disk Arrays : : : : : : : : : : : : 5.3.2 Hierarchical Reliability Model Based on Behavioral Decomposition 5.3.3 Performance Model of an Interactive System : : : : : : : : : : : : 5.3.4 Performability Model of a Mirrored Disk System : : : : : : : : : : 5.3.5 An Availability Model : : : : : : : : : : : : : : : : : : : : : : : : : 5.4 Conclusion : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

5 Formal Expression of Hierarchy in Model Solution

6 Phase-Approximations for semi-Markov Models

6.1 Mathematical Formalism : : : : : : : : : : : : : : : : : 6.2 Phase Approximation Methods : : : : : : : : : : : : : 6.2.1 Phase Approximations to Various Distributions 6.2.2 Fitting Parameters of Phase Approximations : 6.3 Software Description : : : : : : : : : : : : : : : : : : : 6.3.1 User Interface : : : : : : : : : : : : : : : : : : : 6.3.2 Conversion Process/Algorithm : : : : : : : : : 6.4 Numerical Example : : : : : : : : : : : : : : : : : : : : 6.5 Conclusions : : : : : : : : : : : : : : : : : : : : : : : :

7 Ecient Transient Analysis of Sti Markov Models 7.1 Mathematical Formalism : : : : : : : : : : : : : : : 7.2 Sti Markov Chains and ODE Solution Methods : 7.2.1 Non-sti methods : : : : : : : : : : : : : : 7.2.2 Sti Methods : : : : : : : : : : : : : : : : : 7.3 Basic Approach : : : : : : : : : : : : : : : : : : : : 7.3.1 TR-BDF2 Method : : : : : : : : : : : : : : 7.3.2 Third Order Implicit Runge-Kutta Method 7.4 Implementation Aspects : : : : : : : : : : : : : : : 7.5 Numerical Results : : : : : : : : : : : : : : : : : : 7.5.1 Models Used : : : : : : : : : : : : : : : : : 7.5.2 Numerical Results : : : : : : : : : : : : : : 7.6 Conclusions : : : : : : : : : : : : : : : : : : : : : :

8 Dependability Modeling of Disk Array Systems 8.1 Disk Errors and Failures : : : : : : : : 8.2 Failure Modes of Individual Disks : : : 8.2.1 Disk Arrays : : : : : : : : : : : 8.3 Dependability Analysis of Disk Arrays

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

39 40 42 42 42 44

45

45 47 48 48 49 55 59 62 65

71

71 72 72 75 80 80 81 87 90

91

: : : : : : : : : : : :

: : : : : : : : : : : :

: : : : : : : : : : : :

: : : : : : : : : : : :

: : : : : : : : : : : :

: : : : : : : : : : : :

: : : : : : : : : : : :

: : : : : : : : : : : :

: : : : : : : : : : : :

: : : : : : : : : : : :

: : : : : : : : : : : :

: : : : : : : : : : : :

: : : : : : : : : : : :

: : : : : : : : : : : :

: : : : : : : : : : : :

: : : : : : : : : : : :

: : : : : : : : : : : :

: : : : : : : : : : : :

: 92 : 93 : 93 : 95 : 95 : 97 : 97 : 98 : 99 : 99 : 104 : 110

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: 112 : 113 : 115 : 116

111

CONTENTS 8.4

8.5 8.6

8.7

8.3.1 Dependability Measures of Disk Arrays : : : : : : : : : : : : : : 8.3.2 Dependability Models of Disk Arrays : : : : : : : : : : : : : : : : Disk Array Organizations : : : : : : : : : : : : : : : : : : : : : : : : : : 8.4.1 RAID-1 (Duplexed data) : : : : : : : : : : : : : : : : : : : : : : 8.4.2 RAID-2 (Hamming coded ECC) : : : : : : : : : : : : : : : : : : 8.4.3 RAID-3 (Bit-interleaved data) : : : : : : : : : : : : : : : : : : : 8.4.4 RAID-4 (Block-interleaved data) : : : : : : : : : : : : : : : : : : 8.4.5 RAID-5 (Block-interleaved data and rotated parity) : : : : : : : Disk Arrays With Support Hardware : : : : : : : : : : : : : : : : : : : : 8.5.1 Serial Placement of Support Hardware : : : : : : : : : : : : : : : 8.5.2 Orthogonal Placement of Support Hardware : : : : : : : : : : : : Numerical Results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8.6.1 How reliable should each disk be? : : : : : : : : : : : : : : : : : 8.6.2 How does probability of byte error a ect? : : : : : : : : : : : : : 8.6.3 How low should the mean recovery time be? : : : : : : : : : : : : 8.6.4 How does transient error rate a ect disk array reliability? : : : : 8.6.5 Is RAID reliability scalable? : : : : : : : : : : : : : : : : : : : : : 8.6.6 Are disk arrays reliable for mission-critical systems? : : : : : : : 8.6.7 How much does orthogonal placement of support hardware help? Conclusions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

9 Future Work 9.1 9.2 9.3 9.4

vii

Model Speci cation : : : : : : : : : : : : Model Generation : : : : : : : : : : : : Model Solution : : : : : : : : : : : : : : Dependability Modeling of Disk Arrays

Biography

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : :

: 116 : 116 : 120 : 120 : 122 : 122 : 122 : 123 : 123 : 123 : 123 : 127 : 127 : 127 : 129 : 130 : 130 : 132 : 132 : 134

135

: 135 : 135 : 136 : 136

147

viii

CONTENTS

List of Figures 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 3.15 3.16 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 5.1 5.2 5.3 5.4 5.5

A fault-tolerant multiprocessor system : : : : : : : : : : : : : : : : : : : Reliability Block Diagram of the multiprocessor system : : : : : : : : : Fault tree model of the multiprocessor system : : : : : : : : : : : : : : : Multiprocessor system with shared memory : : : : : : : : : : : : : : : : FTRE model of the multiprocessor system with shared memory : : : : : Reliability graph for the multiprocessor system with shared memory : : Conversion algorithm for FT to RBD : : : : : : : : : : : : : : : : : : : : Conversion algorithm for RBD to FT : : : : : : : : : : : : : : : : : : : : Conversion algorithm for FT to RG : : : : : : : : : : : : : : : : : : : : Conversion of the FT model of multiprocessor system to RG model : : : Conversion algorithm for RG to FTRE : : : : : : : : : : : : : : : : : : : Converting a RG model to FTRE model : : : : : : : : : : : : : : : : : : An equivalent FT without repeated nodes for an RG with a shared edge FTRE Model of a TMR System : : : : : : : : : : : : : : : : : : : : : : : Converting a CTMC to a GSPN : : : : : : : : : : : : : : : : : : : : : : Model hierarchies among dependability model types : : : : : : : : : : :

: : : : : : : : : : : : : : : : Conversion algorithm for FTRE to GSPN : : : : : : : : : : : : : : : : : : GSPN subnets for converting a FTRE model to a GSPN model : : : : : : GSPN model of the multiprocessor system with shared memory : : : : : : GSPN subnet when failure-time distribution has mass at zero : : : : : : : GSPN subnet when failure probability of a component is speci ed : : : : Conversion algorithm for FTRE to SRN : : : : : : : : : : : : : : : : : : : SRN Model of the multiprocessor system with Shared Memory : : : : : : GSPN model of the multiprocessor with shared memory (with repair) : : SRN Model of the multiprocessor system with Shared Memory (with repair) : SRN subnet for modeling FCFS repair discipline : : : : : : : : : : : : : : : : GSPN subnet for modeling FCFS repair discipline : : : : : : : : : : : : : : : SRN subnet for modeling pre-emptive resume priority repair discipline : : : : GSPN subnet for modeling pre-emptive resume priority repair discipline : : : GSPN/SRN subnet for modeling non-pre-emptive priority repair discipline : : GSPN/SRN subnet for modeling processor-sharing repair discipline : : : : : : Reliability block diagram for RAID : : : : : : : : : : : : : : : : : : : : : : : : Markov reliability model of a group of disks : : : : : : : : : : : : : : : : : : : Organization of the overall disk array reliability model : : : : : : : : : : : : : A more compact organization of the overall model : : : : : : : : : : : : : : : An even more compact organization of the overall model : : : : : : : : : : : : ix

: : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

12 13 14 15 15 16 17 17 18 20 21 21 22 22 24 25 30 31 32 33 34 35 37 38 40 41 41 42 43 43 44 48 49 51 51 52

LIST OF FIGURES

x 5.6 5.7 5.8 5.9 5.10 5.11 5.12 5.13 5.14 5.15 5.16 5.17 5.18 5.19 5.20 5.21

Reliability model of a three-component system : : : : : : : : : : : : : : : : : : : : : Fault-error handling (FEH) submodel : : : : : : : : : : : : : : : : : : : : : : : : : : Semi-Markov NCF competition (NCFC) submodel : : : : : : : : : : : : : : : : : : : Fault-occurrence (FO) submodel : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Overall model for N -component system : : : : : : : : : : : : : : : : : : : : : : : : : Queuing network model of an interactive computer system : : : : : : : : : : : : : : : SRN model of the interactive computer system : : : : : : : : : : : : : : : : : : : : : SRN submodel of the CPU{I/O system : : : : : : : : : : : : : : : : : : : : : : : : : Aggregated SRN of the interactive computer system : : : : : : : : : : : : : : : : : : Overall aggregated model : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Markov dependability model of a mirrored disk system : : : : : : : : : : : : : : : : : Markov dependability model of a single disk : : : : : : : : : : : : : : : : : : : : : : : Overall performability model of a mirrored disk system : : : : : : : : : : : : : : : : Availability submodel of a subsystem (shared repair person) : : : : : : : : : : : : : : Series RBD for the overall system : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Overall availability model of the system with shared repair person among subsystems

6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 6.12 6.13 6.14 6.15 6.16

Mixture of two Erlangs approximation to a lognormal pdf : : : : : : : : : : : Mixture of three Erlangs approximation to a lognormal pdf : : : : : : : : : : Mixture of four Erlangs approximation to a lognormal pdf : : : : : : : : : : : Transition diagram for Markov chain of parallel system : : : : : : : : : : : : : SHARPE le for Markov chain of parallel system : : : : : : : : : : : : : : : : Semi-Markov chain for two-component parallel system : : : : : : : : : : : : : GSHARPE le for parallel system with constant-time repair : : : : : : : : : : Approximate chain for parallel system with constant-time repair : : : : : : : SHARPE le for approximation of parallel system with constant-time Repair Approximate Markov chain for parallel system with lognormal repair : : : : : SHARPE le for approximation to parallel system with lognormal repair : : : Semi-Markov model for parallel system with competing Weibulls : : : : : : : GSHARPE le for parallel system with competing Weibulls : : : : : : : : : : SHARPE le for approximation of parallel system with competing Weibulls : Approximate Markov chain of the parallel system with competing Weibulls : Results for the parallel system : : : : : : : : : : : : : : : : : : : : : : : : : :

7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 7.10 7.11 7.12 7.13 7.14

Sti Markov chain solver algorithm : : : : : : : : : : : Upper bound reliability model of SEN+ network. : : : Reliability model of a subsystem of two SEs in parallel Reliability model of m SEs in series : : : : : : : : : : C.mmp system : : : : : : : : : : : : : : : : : : : : : : Markov model of M/M/1/K queue : : : : : : : : : : : Accuracy versus error tolerance - SEN+ model : : : : Accuracy versus error tolerance - C.mmp model : : : : CPU time versus error tolerance - SEN+ model : : : : CPU time versus error tolerance - C.mmp model : : : CPU time versus mission time - SEN+ model : : : : : CPU time versus mission time - C.mmp model : : : : CPU time versus repair rate - SEN+ model : : : : : : CPU time versus repair rate - C.mmp model : : : : :

: : : : : : : : : : : : : :

: : : : : : : : : : : : : :

: : : : : : : : : : : : : :

: : : : : : : : : : : : : :

: : : : : : : : : : : : : :

: : : : : : : : : : : : : :

: : : : : : : : : : : : : :

: : : : : : : : : : : : : :

: : : : : : : : : : : : : :

: : : : : : : : : : : : : :

: : : : : : : : : : : : : :

: : : : : : : : : : : : : :

: : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

53 53 54 54 56 57 57 58 58 60 61 61 63 63 65 68

: 75 : 76 : 76 : 80 : 81 : 81 : 82 : 83 : 84 : 85 : 86 : 87 : 88 : 88 : 89 : 89 : 100 : 102 : 102 : 102 : 103 : 104 : 105 : 105 : 106 : 107 : 108 : 108 : 109 : 109

LIST OF FIGURES

xi

7.15 CPU time versus size - M/M/1/K model : : : : : : : : : : : : : : : : : : : : : : : : : 110 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 8.10 8.11 8.12 8.13 8.14 8.15

Reliability block diagram for RAID : : : : : : : : : : : : : : : : : : : : : : : : : : : : 118 Dependability model for a group of disks : : : : : : : : : : : : : : : : : : : : : : : : : 118 Dependability model for a group of disks with nite number of spares : : : : : : : : 121 Reliability block diagram of RAID (1,2,3,4,5) with serial placement of support hardware125 RAID organization with orthogonal placement of support hardware : : : : : : : : : : 125 Approximate reliability model for orthogonal RAID (1,2,3,4,5) : : : : : : : : : : : : 126 MTTDL versus mean time to hard disk failure : : : : : : : : : : : : : : : : : : : : : 128 Data integrity versus probability of byte error : : : : : : : : : : : : : : : : : : : : : : 128 Data integrity versus mean recovery time : : : : : : : : : : : : : : : : : : : : : : : : 129 DCT (in seconds per year) versus mean recovery time : : : : : : : : : : : : : : : : : 130 MTTDL (in hours) versus byte error rate : : : : : : : : : : : : : : : : : : : : : : : : 131 MTCE (in hours) versus byte error rate : : : : : : : : : : : : : : : : : : : : : : : : : 131 MTCE (in hours) versus storage capacity : : : : : : : : : : : : : : : : : : : : : : : : 132 Data loss reliability versus mission time : : : : : : : : : : : : : : : : : : : : : : : : : 133 Data loss reliability versus time (in hours) : : : : : : : : : : : : : : : : : : : : : : : : 133

xii

LIST OF FIGURES

List of Tables 5.1 Interconnection matrix for RAID model organization : : : : : : : : : : : : : : : : : : 5.2 Interconnection matrix for a compact RAID model organization : : : : : : : : : : : : 5.3 Interconnection matrix for hierarchical reliability model of non-repairable N component system : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5.4 Interconnection matrix for hierarchical performance model of an interactive computer system : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5.5 Interconnection matrix for a performability model of mirrored disk system : : : : : : 5.6 Partial interconnection matrix for availability model of a system with shared repair person : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5.7 Partial interconnection matrix for availability model of a system with shared repair person : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

50 50 55 59 62 66 67

xiv

LIST OF TABLES

Chapter 1

Introduction Fault-tolerant computer systems are used in a variety of applications that require high reliability or availability. For instance, computer systems used in ight-control in aircrafts and spacecrafts require that the system provide service without failing until the end of mission time. Such systems have a high reliability requirement. On the other hand, computer systems used in database applications and communication networks are required to be operational for as high a fraction of time as possible (there is no critical mission time in this case). Such systems are required to possess high availability. Gray and Siewiorek [71] classi ed systems in availability classes 1 to 7 according to the leading number of nines in their availability. For example, a system with 99.999 percent availability is a class 5 system. Laprie [101] coined the term dependability as a qualitative measure of the quality, correctness, and continuity of service delivered by a system. The term \dependability" encompasses various measures such as reliability, availability, safety, etc. Dependability models are widely used to evaluate alternative system architectures, develop maintenance strategies, predict warranty costs, and generally guide system design [140, 157].

1.1 Contributions This thesis presents some new results in the speci cation, generation, and solution methodology of dependability models. Although the emphasis is on dependability models and the illustrative examples used are dependability models of fault-tolerant systems, some of the results are applicable to performance models as well. We rst establish a hierarchy among the most commonly used types of dependability models according to their modeling power. By modeling power, we mean the kind of systems and behaviors that can be modeled. Among the combinatorial (non-state-space) model types, we show that fault trees with repeated events (FTREs) are the most powerful in terms of kinds of dependencies among various components of a system that can be modeled (which is one metric of modeling power). Reliability graphs are less powerful than fault trees with repeated events but more powerful than relability block diagrams and fault trees without repeated events. By virtue of the constructive nature of our proofs, we also provide algorithms for converting from one model type to another. Among the Markovian (state-space) model types, we consider continuous-time Markov chains (CTMCs), generalized stochastic Petri nets (GSPNs) [3], and stochastic reward nets (SRNs) [29]. These are more powerful than combinatorial model types in that they can capture dependencies such as a shared repair facility between system components. We provide algorithms to convert an FTRE model into equivalent GSPN and SRN models. We also illustrate how such dependencies and various scheduling disciplines (for repair queue) such as 1

2

CHAPTER 1. INTRODUCTION

FCFS, processor-sharing, pre-emptive priority resume, etc. can be modeled by GSPNs and SRNs. Thus, if the operational dependence of a system on its components is speci ed by means of a faulttree and a description of repair dependence is speci ed in some (other) form, then our methodology provides an automatic way to generate GSPN and SRN models of the dependability of the system. GSPNs and SRNs are known to be isomorphic to CTMCs [3] and MRMs [28] respectively. However, it is the conciseness of model description that forms another basis to compare various state-space model types. Our comparative evaluation reveals that SRNs permit a much more concise description of dependability models than GSPNs do because of reward rates speci cations and several structural extensions. To model a complex system, an overall system model may be composed of several submodels of possibly di erent model types (such as RBDs, FTs, GSPNs, CTMCs, etc.) [142]. This is known as hierarchical modeling. Hierarchical modeling alleviates the problems of model speci cation, generation, storage, and solution. We present a methodology for formal speci cation of hierarchy both in model speci cation and solution. This methodology o ers a uni ed view of various modeling techniques including iterative hierarchical modeling, non-iterative hierarchical modeling, rewardbased performability modeling, behavioral decomposition, and approximate model decomposition. A central assumption in Markov dependability analysis is that failure and repair time distributions are exponential. In many real-life applications, assuming exponential distribution can be a signi cant over-simpli cation. Semi-Markov chains provide a simple mathematical structure for including general distributions in a state space framework. One of the ways to solve a semi-Markov model is to convert it to a Markov model using phase approximations approach [38, 43, 81]. This can be looked at as converting from a speci cation-model-type to a solution-model-type. We discuss a complete approach to phase approximations, including choice of phase approximation class, estimation of the selected parameters, and implementation of the approximation approach in a modeling toolkit. We describe a new hybrid approach for parameter tting that combines moment matching with least squares tting. This approach allows matching of a few moments as well as tting of distribution (or density) function shape. Markov dependability models are usually sti [20, 35, 50, 59, 102, 108, 137] due to extreme disparity between failure rates and repair rates. We propose a technique to design ecient methods using a combination of explicit (non-sti ) and implicit (sti ) ODE methods for numerical transient analysis of sti Markov models. For an initial phase of the solution interval, a non-sti ODE method is used to advance the solution in time. After that a sti ODE method is used to advance the solution until the end of solution interval. Two speci c methods based on this approach have been implemented. Both the methods use Runge-Kutta-Fehlberg [53] method as the non-sti method. One uses TR-BDF2 [9] method as the sti method while the other uses an implicit Runge-Kutta method [57] as the sti method. Numerical results obtained from solving dependability models of a multiprocessor system and an interconnection network are presented. These results show that the methods obtained using this approach are much more ecient than the corresponding sti methods which have been proposed to solve sti Markov models. Finally, we describe an application of dependability modeling and analysis to high performance disk array systems. We develop hierarchical dependability models for various organizations of redundant arrays of inexpensive disks RAID-1,2,3,4,5, [132] and use these models to answer important questions arising in the design of these architectures. Traditionally, only hard disk failures have been modeled and data loss is the only failure mode considered to analyze the reliability of disk arrays. However, transient faults could lead to a catastrophic error such as incorrect data being passed to the user. Our models consider both the transient errors and hard failures. This leads to analytic computation of probability of catastrophic disk failures based on several factors including byte error rate of a disk and error correcting code (ECC) etc. We introduce several new measures

1.2. ORGANIZATION

3

such as data integrity (probability of no data corruption until mission time) and mean time to catastrophic error (MTCE) (mean time to data corruption). To make the models realistic, we also take into account several factors which have hitherto not been considered; mainly, byte error rate, predictive disk failures, and the type of spares (cold or hot). Traditionally used measures like mean time to data loss (MTTDL) and data availability remain virtually unchanged with change in mean recovery time. A new dependability measure known as degraded capacity time is analyzed to bring out the e ect of mean recovery time on dependability of disk arrays. Our analysis reveals that mirrored disk organization has higher MTTDL than other disk array organizations if the only failure mode considered is data loss while catastrophic errors are ignored. However, if catastrophic errors are taken into account, then RAID-3,4,5 organizations have higher data-integrity than other disk array schemes. We also develop models that take into account the reliability of support hardware components and di erent placement schemes for arranging support hardware such as power supply. In serial placement of support hardware, the support hardware components are placed in series with the disk array and form a single point of failure in disk array system. In orthogonal placement of support hardware, the support hardware components are placed orthogonal to the parity groups of disks such that failure of a support hardware component can be tolerated. Our analysis reveals that RAID-1 bene ts the most from orthogonal placement of support hardware.

1.2 Organization The rest of this thesis is organized as follows. In Chapter 2, we present the literature survey of the various topics addressed in the thesis. In Chapter 3, we establish the power-hierarchy among various dependability model types including the combinatorial model types (RBDs, RGs, FTs, and FTREs) and state-space model types (GSPNs, SRNs, and CTMCs). In Chapter 4, we show how Petri-net based model types (GSPNs and SRNs) can be used for dependability modeling. By various subnet constructions, we illustrate how repair dependency such as shared repair facility and scheduling disciplines such as FCFS, processor sharing, and priority repair can be modeled. A methodology for formal speci cation of hierarchy in model solution is described in Chapter 5. In Chapter 6, we describe a complete approach to phase approximations used for converting a semi-Markov model to a Markov model. This includes a description of parameter estimation techniques, choice of phase approximation class, and generation of the Markov model. This entire approach is implemented as a modeling toolkit called GSHARPE which is also described. In Chapter 7, a computationally ecient approach for numerical transient analysis of sti Markov models is described. Two methods based on this approach are implemented and numerical results are described. An application of dependability modeling and analysis is described in Chapter 8. Detailed dependability models of disk arrays are presented. The dependence of various dependability measures such as data integrity and mean time to data loss on di erent parameters of a disk array is characterized. A study of these characteristics reveals the implications of several design issues of a disk array on its dependability. Finally, we discuss the possibilities for future work in Chapter 9.

4

CHAPTER 1. INTRODUCTION

Chapter 2

Literature Survey Modeling and analysis of a system consist of three main phases: model speci cation, model generation, and model solution. We discuss each of these phases which are addressed in this thesis.

2.1 Model Speci cation Dependability and performance models of complex computer and communications systems are extremely large. A comprehensive account on speci cation and generation of large models can be found in Haverkort and Trivedi [74]. Dependability models are also very sti because of extreme disparity in the failure and repair rates of system components. Combinatorial model types for dependability analysis such as fault-trees, reliability block diagrams, and reliability graphs do not su er from state-space explosion [86, 140]. However, these model types cannot model various dependencies such as repair dependency (shared repair between two system components). Among performance model types, product-form queuing networks and series-parallel directed acyclic graphs could be used for compact system representation. However, there still remain many instances where non-state-space model types do not suce to model certain kinds of system behavior. For instance product-form queuing networks do not permit internal concurrency within jobs while series-parallel directed acyclic graphs do not allow resource contention. In such cases, state-space model types must be used. An overview of various dependability model types appears in [60, 86, 106, 140, 144]. These studies informally discuss various model types, the kinds of dependencies that can be modeled by di erent model types, and various dependability measures that can be evaluated using these model types. However, to the best of our knowledge, there has been no formal comparative evaluation of modeling power of various model types except for a few of the following studies. By modeling power, we mean the kinds of system behavior that can be modeled and measures that can be computed. Using probabilistic arguments, Shooman [149] showed the equivalence between RBDs and FTs (without repeated events), i.e., any system that can be modeled by RBDs can also be modeled by FTs and vice-versa. Hura and Atwood [82] showed how Petri net models can be used to represent coherent fault trees. They showed that an equivalent Petri net representation allows study of dynamic behavior of the model and o ers more insightful treatment of fault-detection and propagation. Another important issue is the conciseness of model speci cation that a model type o ers. Markov models of realistic systems may have hundred thousand states and more. It is virtually impossible to manually specify large Markov models with so many states. To alleviate the problem of model speci cation, higher level model types such as Petri-net model types (generalized stochastic 5

6

CHAPTER 2. LITERATURE SURVEY

Petri nets (GSPNs) [3] and stochastic reward nets (SRNs) [29]) are used. These model types o er a much more concise description of system behavior. Stochastic Petri net based models have been extensively used for performance and performability modeling in the analysis of computer and communication systems [2, 3, 32, 114, 118, 121, 146]. However, in the dependability modeling community, Petri net based models have received considerably less attention [13, 84, 91, 124, 122]. A GSPN or a SRN model is automatically converted to a Markov model which is then solved. This is possible since Markov models (Markov reward models) are known to be isomorphic to GSPNs [3] (SRNs [29]). However, such higher level model types do not alleviate the problems of storing and solving the large Markov model that results after conversion. To overcome the problem of largeness in model solution, several techniques have been proposed that decompose the generator matrix of the Markov model and yield approximate solution [20, 39]. This alleviates the problem of largeness in model solution but the problem of large model generation and storage still persists. Therefore the need arises for some kind of model decomposition approaches that work at model level rather than at the matrix level. Examples of such techniques include the ow-equivalent server approximation introduced by Chandy et al [25], the model decomposition technique used in the software tool HARP [51], and the approach used by Balbo et al [8]. In these approaches, the overall model is not generated but instead smaller models are generated whose solution is combined to yield overall model solution. These are examples of hierarchical composition. It holds a lot of promise since it alleviates all the problems including model speci cation, generation, storage, and solution. Depending upon the kind of system being modeled, this approach could result in an approximate solution or an exact solution. Hierarchy in modeling manifests itself in a wide variety of modeling and solution techniques including xed-point iteration scheme for solving models [27, 33, 75, 117, 155], reward-based performability analysis (decomposition of the overall model into structural and performance models) for both speci cation and solution of performability models [113, 159], and non-iterative hierarchical (hierarchical composition or decomposition) modeling [15, 105, 142]. Several attempts have been made in the past to formalize the speci cation of system models. Berson et al [12] proposed an object-oriented paradigm for model speci cation and generation. Hillston [77] proposed a modular approach to modeling which aims to make models more accessible to non-experts. These e orts concentrate on formalizing model speci cation and model generation. A more general hierarchical modeling environment is allowed in modeling toolkit SHARPE [142, 144]. It allows hierarchy among models of di erent types (such as fault-trees, Markov chains, stochastic Petri-nets, queuing networks etc.). Contrary to other approaches which utilize hierarchy only for model generation and speci cation, this approach uses hierarchy for model solution as well. In SHARPE, the hierarchy among various submodels is implicit. The order of solution of submodels is speci ed in the input le and parameter passing between submodels is performed by passing the results from one submodel to another.

2.2 Model Generation In many cases, the model type used to specify the model is converted to another model type for solution of the model. In such cases, the next phase consists of model generation (generation of the model used for solution), i.e., conversion from one model type (usually higher level) to another model type (usually lower level). Examples of this include several modeling toolkits such as MARCA [152] (converts a balls and buckets description of a system into a Markov model for solution), HARP [51] (converts a fault tree model into Markov model for solution), TANGRAM

2.2. MODEL GENERATION

7

[12] (converts an object-oriented model description into a Markov model), SPNP [31] (converts a stochastic reward net model into a Markov reward model for solution), UltraSAN [41] (converts a stochastic activity network [114] model into a Markov reward model), ESP [43], SURE [24] and ASSIST [23] (ASSIST converts a high-level description of a model to a semi-Markov model which is solved by SURE), etc. We address the problem where a semi-Markov model is the model type used for speci cation while the model type used for solution is Markov model. Semi-Markov models are important since some critical aspects of system behavior cannot be easily captured in a Markov model. In particular, analyses of eld failure and repair times indicate that many failure and repair-time distributions are simply not exponential. For example, electronic component failure times often follow a Weibull or lognormal distribution [97]. Repair and outage times often show a great deal of statistical variability, and can be approximated fairly well by a lognormal or other distribution with a \heavy tail" [6, 136]. The e ect of di erent repair-time distributions on system mean time to failure, reliability, and availability has been studied for some time. For example, in a twocomponent parallel system where the components have exponential failure-time distributions and the mean of the unit repair time is xed, changing the distribution of the repair times or their variance can cause signi cant changes in system reliability and availability [56]. A simple extension to Markov chains is to allow holding time distributions to be general, but independent of the history of the process. The holding time thus depends only on the current state of the process and the elapsed time spent in the current state. The resulting process is a semi-Markov chain. Solving semi-Markov chains is more complicated than solving Markov chains. There are three basic approaches to semi-Markov chain solution [30]: numerical solution of the set of backward Kolmogorov integral equations [58], discrete-event simulation, and phase approximations [38, 43, 81, 157]. The most common, simplest approach is discrete-event simulation [22]. Some e orts have been made on numerical solution in the CARE III package [59, 158]. Also numerical methods based on Laplace-transform inversion have been investigated [30, 90]. Numerical approximations for general distributions were used in the SURE package [24], although they were not speci cally phase approximations. Our focus is on phase approximations. The idea behind phase approximation (or the Coxian method of stages) was described in [42]. Basically, any probability distribution with rational Laplace Steiltjes transform (LST) can be represented by the time to absorption of a Markov chain with absorbing states. However, this Markov chain may have complex and negative parameters. Distributions without rational LSTs can be approximated by distributions having rational LSTs, although, arbitrarily close approximations may require a Markov chain with large state space. Still, many distributions can be accurately approximated using Markov chains with a few states. Neuts [125] used phase approximations to represent a wide class of arrival and service-time distributions while avoiding the use of parameters that have no physical interpretation (e.g., complex probabilities). We use the term \phase approximations" as de ned by Neuts. Phase approximations have been discussed in the context of reliability by Singh et al [151], and Bobbio et al [18, 16, 17]. Other examples of phase approximations include [81, 157]. Phase approximations were used in the SURF package [38, 59], although SURF was intended only for a restricted class of reliability models. Johnson and Taa e consider the accuracy of classes of approximations based on mixture of Erlangs [88, 89], which are similar to some of the approximations we use. Cumani [43] has designed a software package ESP for evaluation of stochastic Petri nets with phase-type distributed transition times. However, to the best of our knowledge, there is no software package that o ers a practical implementation of phase approximations for semi-Markov models. We have designed a front end for the modeling toolkit SHARPE [142]. It takes as input a semi-Markov model and converts it into a Markov model in SHARPE syntax after applying phase

8

CHAPTER 2. LITERATURE SURVEY

approximations for non-exponential distributions.

2.3 Model Solution After model generation, the next phase is model solution. For dependability [59, 66] and performability [44] models, transient solution is of more interest than steady-state solution. Closed-form solution of transient measures of Markov models of real systems is infeasible because of largeness of state space. Therefore numerical solution methods are used. Our focus is on numerical transient analysis of nite-state continuous-time time-homogeneous Markov chains (CTMCs). The performance of numerical solution methods is plagued with computational problems like largeness and sti ness [20, 35, 102, 108, 137, 138] of the Markov chain. Numerical methods based on Jensen's method (also known as uniformization or randomization) and ordinary di erential equations (ODE) solution have been proposed and implemented in the past. An overview of numerical methods for transient analysis of Markov and Markov reward models appears in Reibman et al [139]. Jensen [85] originally proposed a method. Grassmann [67, 68, 69], Keilson [94], and Gross and Miller [72] have analyzed this method. De Souza e Silva and Gail [44] have specialized this method to compute performability measures. Marie [108] has suggested a modi cation to speed up Jensen's method, but it requires additional matrix products and therefore it is not suitable for large Markov chains. Muppala and Trivedi [121, 123] proposed a technique to detect steady-state of the underlying discrete-time Markov chain (DTMC) which saves a lot of computation time in some cases. Van Dijk [46] has extended this method to solve non-exponential service stochastic networks. A variant of Jensen's method that incorporates steady-state detection and uses Fox and Glynn's method [55] for Poisson probability computation has been implemented and analyzed in [104]. It is shown that signi cant gains in computation cost can be realized in the cases when mission time (time at which the model solution is desired) is greater than the time at which the underlying DTMC reaches steady-state. Since it is a priori not known when the Markov chain reaches steady-state, the above variant of Jensen's method with steady-state detection is particularly more advantageous to use than the original Jensen's method. Use of Fox and Glynn's method avoids under ow problems that mostly occur for sti Markov chains and yields highly accurate results. The other approach to solve a Markov chain is based on solving a system of ordinary di erential equations (ODE). Grassmann [67] has shown that explicit Runge-Kutta methods can be used for non-sti Markov models. Clarotti [35] proposed an implicit A-stable method (similar to trapezoidal rule) for sti Markov chains. A comprehensive e ort to comparatively evaluate di erent methods is the work by Reibman and Trivedi [137]. They numerically compared Jensen's method, RungeKutta Fehlberg method (an explicit method), and TR-BDF2 method (an implicit method). They concluded that TR-BDF2 is the method of choice for sti Markov models and Jensen's method is the most ecient for non-sti models. Higher order sti ODE methods were implemented in [102]. A third-order A-stable generalized Runge-Kutta method and a third-order L-stable implicit Runge-Kutta method were numerically compared with the TR-BDF2 and the Jensen's method. It was shown that the third order implicit Runge-Kutta method achieves higher accuracy at much less cost than TR-BDF2 method for sti Markov models. Thus, for error tolerances tighter than 10?8, implicit Runge-Kutta method is recommended for sti Markov models. For error tolerance higher than 10?8 , TR-BDF2 method is recommended for sti models. For non-sti models, Jensen's method is recommended. ODE methods have several advantages over the Jensen's method:  Explicit steady-state detection is not required by ODE methods because if the state probabil-

2.4. DEPENDABILITY MODELING AND ANALYSIS OF DISK ARRAY SYSTEMS

9

ity vector does not change within the tolerance limits over successive time steps, then these methods automatically increase step sizes, thereby causing little extra computational e ort.  The solution is obtained on a set of grid points (instants of time spread over the solution interval) at no additional cost. This makes it possible to examine the solution as a function of time. Jensen's method has one main advantage over ODE methods. The truncation error incurred by using a nite number of terms of an in nite series can be bounded a priori. The error control in ODE methods is poor since it is not easy to compute global error from local errors that occur at each time step of ODE method. Thus, Jensen's method has better error control than ODE methods.

2.4 Dependability Modeling and Analysis of Disk Array Systems Finally we consider an application of dependability modeling and analysis. High performance disk array architectures have been proposed in the past few years to match the high performance of CPUs and memories. However, an array of disks is more fault-prone than a single large drive [132] and has low reliability. Failure of a disk may result in loss or corruption of critical data which may have been accumulated through years of research e orts and extensive experimentation. This could lead to nancial loss or even loss of life. Thus, the demand is to design cost-e ective disk systems which not only deliver high-performance but also provide high reliability. To enhance the reliability of disk array systems, several fault-tolerant disk array architectures have been introduced by di erent researchers using varying degrees of hardware redundancy [14, 70, 96, 110, 131, 132, 145]. Patterson et al [132] coined the term RAID (Redundant Array of Inexpensive Disks) for fault-tolerant disk array systems with redundancy. They uni ed di erent disk system architectures as di erent levels of RAID (levels 1,2,3,4, and 5). Some studies have been done in the past to analyze disk array reliability. Gibson [62] and Patterson et al [132] analyzed reliability of RAID in terms of mean time to data loss (MTTDL). Bitton and Gray [14] analyzed MTTF for mirrored disks (RAID-1). Gibson and Patterson [64] provided a comprehensive analysis of RAID-1 and RAID-5 reliability and showed that RAID-5 organization with a few spares yields higher reliability than RAID-1 at lower cost. However, their results are based on a simple approximation. Assuming that the time to failure of a group of disks is exponentially distributed, they compute the reliability of disk arrays using the approximate value of MTTDL. Usefulness of these approximations is that MTTDL and reliability are expressible in closed form. However, even when the individual disk failure times are exponentially distributed, the time to failure of a group of disks is in general not exponentially distributed. Ng [127] analyzed the e ect of sparing (number of spares and whether spares are shared among groups or not) on reliability of disk arrays. He concluded that nearly all the improvement in disk array reliability is achieved by adding one spare. Schulze et al [148] showed that reliability of support hardware components such as the power supply and cooling equipment could severely degrade the reliability of disk array. They proposed an organization scheme in which support hardware components are placed orthogonal to parity groups of disks.

10

CHAPTER 2. LITERATURE SURVEY

Chapter 3

Power-Hierarchy of Dependability Model Types Model speci cation is the rst step in the process of modeling and analysis of systems. Over the years, several di erent model types such as reliability block diagrams (RBDs), fault trees (FTs), Markov chains, etc. have been used to model fault-tolerant systems and evaluate various dependability measures. These model types di er from one another not only in the ease of use in a particular application but also in terms of modeling power. For instance, a series-parallel system is naturally modeled by a series-parallel reliability block diagram. Similarly, fault trees are more intuitive in modeling how failure of a component a ects the reliability of other subsystems. Thus some model types lend themselves easily to model certain kind of behavior of systems. Modeling power of a model type is determined by the kinds of dependencies within subsystems that can be modeled and the kind of dependability measures that can be computed. For instance, if various components of a system share a repair person (repair dependency among components), then FTs or RBDs cannot be used to model the availability of this system. Markov chains or stochastic Petri nets can model such repair dependency. From among a variety of di erent model types, a particular model type is chosen to specify a model. The choice of a suitable model type is determined by various factors such as:  Familiarity of the user with the model type.  The model type supported by the available modeling toolkit.  Ease of use in a particular application.  The kind of system and system behavior to be modeled.  The measure of system behavior to be computed.  Conciseness and ease of model speci cation. Assume that the rst two factors do not play a role (i.e., the modeler is free to choose from di erent model types). The decision making process of the modeler can be greatly simpli ed by a comparative evaluation of various model types according to their modeling power and the conciseness in model speci cation that they o er. In this chapter, we focus on the modeling power of various dependability model types. The model types we look at are relability block diagrams (RBDs), fault trees without repeated events (FTs), fault trees with repeated events (FTREs), 11

12

CHAPTER 3. POWER-HIERARCHY OF DEPENDABILITY MODEL TYPES

P1

D12

D21

D22

M1

N P2

D11

M2

Figure 3.1: A fault-tolerant multiprocessor system reliability graphs (RGs), continuous-time Markov chains (CTMCs), generalized stochastic Petri nets (GSPNs), and stochastic reward nets (SRNs). We comparatively evaluate di erent model types and establish a hierarchy of model types on the basis of their modeling power. For example, to compare model type A with model type B , we either provide an algorithm that converts any instance of model type A to an equivalent instance of model type B (and vice-versa) or we prove that not every instance of model type A can be converted to an equivalent instance of model type B . Some of the relationships that are revealed by our study are obvious and some are not so obvious. Our aim is to provide a dependability engineer with a power-hierarchy of dependability model types which would enable him/her to select from a variety of model types for a given problem. The rest of this chapter is organized as follows. In the next section, we describe the faulttolerant multiprocessor system that has been used as the illustrative example in this chapter. In Section 3.2, we describe combinatorial model types. In Section 3.3, we establish power-hierarchy among combinatorial model types. In Section 3.4, we brie y consider Markovian model types and compare them to combinatorial model types.

3.1 Fault-Tolerant Multiprocessor System We consider a fault-tolerant multiprocessor system as a running example in this chapter. The basic multiprocessor architecture is as shown in Figure 3.1. It consists of two processors P1 and P2 each with a private memory M1 and M2 respectively. A processor and a memory form a processing unit. Each processing unit is connected to a mirrored disk system. This forms a processing subsystem. Both the processing units are connected via an interconnection network N . The system is functional as long as the interconnection network N is functional and one of the processing subsystems is functional. For processing subsystem to be functional, the processor, memory module, and one of the two disks should be functional. For simplicity and sake of illustration, we restrict ourselves to this two processor system. This architecture and the corresponding models are easily scaled to a large number of processors.

3.2. COMBINATORIAL MODEL TYPES

13

D11 P1

M1 D12

N

D21 P2

M2 D22

Figure 3.2: Reliability Block Diagram of the multiprocessor system

3.2 Combinatorial Model Types

3.2.1 Reliability Block Diagrams (RBDs)

Reliability block diagrams fall into the category of combinatorial (also known as non-state-space) model types [86, 140, 106]. They map the operational dependency of a system on its components and not the actual physical structure. In Shooman's [149] words, RBDs represent the probability of success approach to system modeling. The subsystem representing components in series implies that failure of any component results in failure of that subsystem. The subsystem representing components in parallel implies that only the failure of all the components results in failure of that subsystem. The RBD model for the fault-tolerant multiprocessor is shown in Figure 3.2.

3.2.2 Fault Trees Without Repeated Events (FTs)

Like RBDs, a FT is also a combinatorial model type and maps the operational dependency of a system on its components. However, unlike RBDs, FTs represent probability of failure approach to system modeling [149]. The phrase \without repeated events" means that inputs to all the gates are distinct. The FT model for the fault-tolerant multiprocessor is shown in Figure 3.3. Failure of a component implies that the corresponding input to the gate becomes TRUE. If any input to an OR gate becomes TRUE, then the output becomes TRUE. If all the inputs to an AND gate become TRUE, only then the output of the gate becomes TRUE. The output of the top gate in the FT tells whether the system is operational or not. Many authors have proposed several extensions to FTs. Fault trees in which repeated events are allowed is one. Some researchers have included a variety of di erent gates such as NOT, EXOR, PAND (priority AND gate), kOfn gate [144], and some special gates such as cold spare gate, functional dependency gates, and sequence enforcing gates [49]. Some of these extensions enhance the modeling power of fault trees and some simply increase the ease of use. We consider fault trees with repeated events next.

14

CHAPTER 3. POWER-HIERARCHY OF DEPENDABILITY MODEL TYPES

G1 N

G2 G3 G5

G4 M2

D21 D22

P2

G6

M1

P1

D11 D12

Figure 3.3: Fault tree model of the multiprocessor system

3.2.3 Fault Trees with Repeated Events (FTREs)

Fault trees in which di erent gates are allowed to share inputs are known as fault trees with repeated events. It is obvious that an FTRE is more general than an FT. Let us consider a simple variation on the base multiprocessor system shown in Figure 3.1. Suppose in this system, there is also a shared memory M3 between the two processors P1 and P2. The new system is shown in Figure 3.4. If memory module M1 (M2) fails, then processor P1 (P2) uses memory module M3 and continues to work. We also assume that M3 can be shared by both P1 and P2 in case both M1 and M2 fail. Neither RBDs nor FTs as we have described can model the dependability of this system because of the inherent non-series-parallel nature of the dependence due to shared memory module M3. FTREs can model the dependability of this system as shown in Figure 3.5. This example establishes the fact that FTREs possess higher modeling power than FTs or RBDs since any RBD or FT model can also be modeled by FTRE.

3.2.4 Reliability Graphs (RGs)

Graphs have been extensively used as model types to model network reliability [36]. We consider a special class of digraphs that is used as a model type in the software tool SHARPE [144]. A reliability graph is an acyclic digraph G = (U; V ) where U is the set of nodes and V is the set of directed edges. There are two special nodes labeled source and sink in U . The source node has no incoming edges and the sink node has no outgoing edges. V consists of two kinds of edges: component-edges and 1-edges. Component-edges are labeled by component names and 1-edges are labeled 1. The remaining nodes (except source and sink node) do not represent anything. For each component in the system, there is at most one edge in the RG, i.e., repeated edges are not allowed. Failure of a component is indicated by failure of an edge. 1-edges and the nodes do not fail. The system modeled by an RG is considered operational as long as there is at least one path with no failed edge from the source node to the sink node. The reliability graph of the

3.2. COMBINATORIAL MODEL TYPES

P1

15

D11

D12

D21

D22

M1

M3

N

P2

M2

Figure 3.4: Multiprocessor system with shared memory

G1 N

G2 G3 G5

G6

D21 D22 M2 M3

G4 P2

G7

G8

P1

D11 D12 M1 M3

Figure 3.5: FTRE model of the multiprocessor system with shared memory

CHAPTER 3. POWER-HIERARCHY OF DEPENDABILITY MODEL TYPES

16

P1

D11

1

D12 D22

src D21

1

M1 M3

N

sink

M2

P2

Figure 3.6: Reliability graph for the multiprocessor system with shared memory multiprocessor system with shared memory is shown in Figure 3.6.

3.3 Hierarchy Among Combinatorial Model Types In this section, we establish power-hierarchy among the four combinatorial model types considered so far: RBDs, FTs, FTREs, and RGs. This is accomplished by either proving that every instance of a model type can be converted into an instance of another model type or by proving the counterassertion. Proofs in the former case are constructive in nature, i.e., we provide an algorithm for converting an instance of a model type into an instance of another model type. Proofs in the latter case are given by providing a counter-example, i.e., showing existence of an instance of a model type for which no equivalent instance of another model type exists.

3.3.1 Fault Trees to Reliability Block Diagrams

Shooman [149] has previously shown that a RBD is equivalent to a FT without repeated events. He did not provide a conversion algorithm but it is a simple matter to do so. A careful comparison of the RBD shown in Figure 3.2 and the FT shown in Figure 3.3 reveals the similarities among the two model types. These similarities also provide the algorithm for converting a FT model to an equivalent RBD model. The algorithm is given in the PASCAL like pseudo-code shown in Figure 3.7. The FT is converted to the equivalent RBD by starting the algorithm as FT to RBD(root) where root is the root node of the FT. This algorithm is simply a preorder traversal of the FT. If the node encountered is a gate (AND or OR), then it is converted to appropriate construct (PARALLEL and SERIES respectively). If the node is a component (i.e., a leaf), then do nothing. This yields the RBD. The number of leaves in the FT are n. The number of gates are at most n ? 1. At every step of the algorithm, O(1) amount of time is spent. Hence, the time complexity of this algorithm is O(n) where n is the number of components in the system.

3.3.2 Reliability Block Diagrams to Fault Trees

An RBD can be similarly converted to a FT in a reciprocal fashion. The algorithm is given in Figure 3.8. The time-complexity of this algorithm is O(n) as well.

3.3. HIERARCHY AMONG COMBINATORIAL MODEL TYPES

Algorithm FT to RBD(x) begin if (x 6= NIL) then if (x == AND gate) then x PARALLEL construct else if (x == OR gate) then x SERIES construct foreach y 2 child[x] do FT to RBD(y ) end Figure 3.7: Conversion algorithm for FT to RBD

Algorithm RBD to FT(x) begin if (x 6= NIL) then if (x == PARALLEL construct) then x AND gate else if (x == SERIES construct) then x OR gate foreach y 2 child[x] do RBD to FT(y ) end Figure 3.8: Conversion algorithm for RBD to FT

17

18

CHAPTER 3. POWER-HIERARCHY OF DEPENDABILITY MODEL TYPES Algorithm FT to RG(x) begin if (x 6= NIL) then if (x == root) then insert edge (source, sink) labeled root in Vrg else if (x == AND gate) then delete directed edge (u; v ) labeled x from Vrg foreach y 2 child[x] do insert directed edge (u; v ) labeled y in Vrg else if (x == OR gate) then delete directed edge (u; v ) labeled x from Vrg foreach y 2 child[x] do if (y is the last remaining child of x) then w insert directed edge (u; w) labeled y in Vrg u w foreach y 2 child[x] do FT to RG(y ) end

v

Figure 3.9: Conversion algorithm for FT to RG

3.3.3 Fault Trees to Reliability Graphs

We now show that any FT can be converted to a RG. We present a constructive proof of this claim by presenting an algorithm to convert a FT to an equivalent RG. Assume that in a FT, a node is either a gate or a component. The algorithm is given in pseudo-code in Figure 3.9. We brie y describe it below. Initialize the RG consisting of only source and sink nodes connected by an edge labeled with the name of the root node (a gate) of the FT. Now perform a preorder traversal of the FT starting from the root node. The node at any step of the tree traversal is the current node. Let the current node be a gate labeled G with k inputs. If it is an AND gate, then replace the directed edge (u; v ) (directed from u to v ) labeled G in the partially constructed RG by k edges e1 ; e2; ::::; ek between u and v with the same direction as the original edge. Assign the labels of the nodes (gates or components) from which the inputs to gate G are coming from to each of these edges. If it is an OR gate with k inputs, then replace the edge (u; v ) labeled G in the partially constructed RG by a path p = (e1 ; e2; ::::; ek) of k edges where e1 = (u; w1), e2 = (w1; w2), ....., and ek = (wk?1 ; v ). Assign the labels of the nodes (gates or components) from which the inputs to gate G are coming from to each of these edges. If it is a component, then do nothing. Continue with preorder tree traversal until all the nodes have been traversed. It is easy to see that this algorithm yields an equivalent RG. Let n be the number of components in the FT. The number of gates (internal nodes) is at most n ? 1. For each internal node, an edge in RG is inserted once and deleted once. For each external (leaf) node, an edge is inserted in the RG once. Assume that it is possible to identify an edge in the partially constructed RG by its label in O(1) time. This is possible by maintaining an array of pointers indexed by the label of the gate/component. Hence deletion and insertion of each edge in the partially constructed RG can

3.3. HIERARCHY AMONG COMBINATORIAL MODEL TYPES

19

be done in O(1) time. This implies that for each node encountered in the preorder traversal of the FT, we do O(1) amount of work. The time complexity of this algorithm is therefore O(n) since the FT consists of at most 2n ? 1 nodes. For the FT model of the multiprocessor system shown in Figure 3.3, we show the steps taken by this algorithm in Figure 3.10. Due to preorder traversal of this FT, the nodes are looked at in the following order: (G1; G2; G3; G5; D21; D22; M2; P2; G4; G6; D11; D12; M1; P1; N ) : (3:1) Omitting the nodes (D21; D22; M2; P2; D11; D12; M1; P1; N ) corresponding to the components (since we do nothing at these nodes), we are left with the gates (G1; G2; G3; G5; G4; G6). The rst Figure shows the initialization of the RG. Each subsequent gure shows the partially constructed RG at the end of each step when the gates are encountered in the order (G1; G2; G3; G5; G4; G6).

3.3.4 Reliability Graphs to Fault Trees with Repeated Events

The basic idea behind this algorithm is to enumerate all the simple paths from the source node to the sink node. This is easily done using a breadth- rst search [37]. For every path, construct an OR gate with inputs from all the components which appear on this path (1-edges are ignored). Then construct an AND gate (root gate) such that the output of each OR constructed in the previous step is input to this gate. It is easy to see that this is the equivalent FTRE for the RG. Events are repeated if di erent paths are not edge-disjoint, i.e., di erent paths have edges in common. Enumeration of all the paths can be done using a naive algorithm which takes exponential (O(2e )) time (e is the number of edges in the RG) since there could be as many as O(2e) distinct paths. We would like to point out that this is a naive algorithm. More sophisticated algorithms which yield more compact FTREs can be derived based on minpaths of the RG which can be computed using Tarjan's [153] algorithm. The pseudo-code of a generic algorithm to convert a reliability graph into an FTRE which uses some path enumeration algorithm is given in Figure 3.11. Our aim presently is simply to establish if a RG can be converted to a FTRE and not to provide optimal algorithms for conversion. The correctness of this algorithm can be argued as follows. A system modeled by a RG is considered operational as long as there is a path from source node to the sink node. A failed edge in a RG blocks all the paths it appears on. In other words, this can be stated as: a system is operational if there is at least one path which has no failed edge. This is precisely what the above constructed FTRE captures. For the RG of the base architecture with shared memory (Figure 3.6), the paths possible form the src to the sink are:

D11; P1; M1; N D11; P1; 1; M3; N D12; P1; M1; N D12; P1; 1; M3; N D21; P2; M2; N D21; P2; 1; M3; N D22; P2; M2; N D22; P2; 1; M3; N

The equivalent FTRE as constructed by this algorithm is shown in Figure 3.12. It must be realized however, that a shared edge among di erent paths in a RG does not always imply that an equivalent FT (without repeated events) does not exist. An example of this appears in the Figure 3.13. Here the edge labeled C appears on both the paths from src to sink.

CHAPTER 3. POWER-HIERARCHY OF DEPENDABILITY MODEL TYPES

20

G2

G1 src

src

src

sink

G3

N sink

M2

G4

N sink

G4

src

M2

D21

P2

D22

src

P2

sink

G4

N

N

sink

G4 M2

D21

P2

D22

src

G6

D21 src

D11

M1 M2 D22 D12 M1

N

sink

P1

P2

N

sink

P1

Figure 3.10: Conversion of the FT model of multiprocessor system to RG model

3.3. HIERARCHY AMONG COMBINATORIAL MODEL TYPES

21

Algorithm RG to FTRE(RG) begin enumerate all the paths from source to sink in RG let the paths be P1 ; P2; :::; PK let Pi = fei1 ; ei2; ::::; eiNig foreach path Pi do construct gate Gi OR ei1 ; ei2; ::::; eiNi construct gate Gsys AND G1; G2; ::::; GK end Figure 3.11: Conversion algorithm for RG to FTRE

D22 P2 M2 N

D22 P2 M3 N

D21 P2 M2 N

D21 P2 M3 N

D11 P1 M1 N

D11 P1 M3 N

D12 P1 M1 N

D12 P1 M3 N

Figure 3.12: Converting a RG model to FTRE model

22

CHAPTER 3. POWER-HIERARCHY OF DEPENDABILITY MODEL TYPES

A

C



src

C

sink

B

A B Figure 3.13: An equivalent FT without repeated nodes for an RG with a shared edge

A

B B C

C A C

A B

Figure 3.14: FTRE Model of a TMR System

3.3.5 Fault Trees with Repeated Events to Reliability Graphs

In the earlier section, we gave a constructive proof to show that any RG can be converted to an equivalent FTRE. In this section, we show that the converse is not true, i.e., there does not always exist an equivalent RG for every FTRE. We prove this claim by means of a counter-example. Consider a TMR system [157] with three components A; B; and C . The system is operational as long as at least two components are operational. A FTRE model of this system is shown in Figure 3.14. We claim that this FTRE can not be converted to a RG. To prove this claim, let us assume that an equivalent RG exists. Let the component-edges representing components A; B; and C in this RG be ea = (ua; va ), eb = (ub ; vb), and ec = (uc ; vc ) respectively. Let pa;b,pb;c , and pa;c be the paths from the source to the sink which include edges (ea ; eb ), (eb ; ec ) and (ea ; ec) respectively. Thus, ab ab pa;b = (pab 1 ; ea; p2 ; eb ; p3 ) ; ac ac pa;c = (pac 1 ; ea; p2 ; ec ; p3 ) ; pb;c = (pbc1 ; eb ; pbc2 ; ec ; pbc3 ) ;

3.4. MARKOVIAN MODEL TYPES

23

bc ac where pab i ; pi ; pi (i = 1; 2; 3) are paths consisting only of 1-edges. Note that the paths pa;b ,pb;c, and pa;c are not necessarily edge-disjoint, i.e., there could be edges common to more than one path. For example, path pa;c could possibly contain edge eb . Given these three paths, there exists another path from source to sink:

pb = (pbc1 ; eb; pab 3 ):

(3:2)

This path pb consists of only one component-edge eb and the rest are 1-edges. This implies that there exists a path from the source to the sink such that it consists of only one component edge. This in turn implies that the system is operational if B is operational even though both A and C may have failed. This contradicts our assumption about the operational dependency of the system on A; B; and C . Therefore, the assumption that an equivalent RG exists for this FTRE is incorrect. QED. In fact, it can be similarly shown that any system which has a kOFn gate where k > d n2 e (i.e., at least k out of n components must be operational for the system to be operational) can not be modeled by a RG. As another example, consider the multiprocessor system with shared memory. If we impose the constraint that memory module M3 can be used by only one processor P1 or P2 , then failure of only one memory module M1 or M2 could be tolerated, i.e., failure of both the modules M1 and M2 would lead to system failure. It is easy to construct the FTRE model for such a system. However, we claim that a RG model for this system can not be constructed. From our discussion so far, we have established the hierarchy among combinatorial model types. We have shown that FTREs is the most powerful model type, followed by RGs, which is followed by RBDs and FTs which are equivalent to each other.

3.4 Markovian Model Types In this section, we consider Markovian model types used for dependability modeling. The model types we look at are continuous-time Markov chains (CTMCs), generalized stochastic Petri nets (GSPNs), and stochastic reward nets (SRNs). It is well known that Markovian model types can model certain kinds of dependencies in a system which combinatorial model types cannot [86, 140, 106]. For example, consider FTRE model of the multiprocessor system. If we wished to model the availability of the system when there is a single repair person for all the components, then we could not do so by an FTRE. Such repair dependency can not be modeled by any combinatorial model type, but a Markov chain or a GSPN could easily model such dependency. Dugan et al [51] have shown that an FTRE model can be converted to a CTMC. This establishes that CTMCs are more powerful than FTREs. Most of the algorithms for converting Markovian model types into each other are known since GSPNs (SRNs) are solved by converting them into CTMCs (Markov reward models (MRMs)) and solving them. Ajmone-Marsan et al [3] showed that CTMCs are equivalent to GSPNs, i.e., for every GSPN model, an equivalent continuous-time Markov chain exists and vice-versa. They also provide an algorithm which converts a GSPN to a CTMC. Converting a CTMC to a GSPN model is fairly obvious. For each state si of the CTMC, construct a place pi . Replace each arc (si ; sj ) with rate rij of the Markov model by a transition tij of rate rij with an incoming arc from place pi and an outgoing arc to place pj . If there is a single initial state sinit of the CTMC, then a single token must be put in the place pinit . If the CTMC has several initial states sk ; sk+1 ; :::; sm with probabilities !k ; !k+1 ; :::; !m, then a new place p0 is created with a single token and this place is connected to the places pk ; pk+1; :::; pm via immediate transitions t0k ; t0(k+1); :::; t0m with probabilities !k ; !k+1 ; :::; !m.

CHAPTER 3. POWER-HIERARCHY OF DEPENDABILITY MODEL TYPES

24

2

s2

1 1 2

s1 2

s4 1

(a)

1

s3

s5

1

1 !3 p1

p0 !1 !2

1 2 2

2 p2

p4

1

(b)

1 p3

p5 1

Figure 3.15: Converting a CTMC to a GSPN In Figure 3.15 (a), the CTMC of a system that consists of two components C1 and C2 sharing a repair facility with priority repair discipline is shown. The failure rate of C1 (C2) is 1 (2) and the repair rate is 1 (2 ). C1 has repair priority over C2, i.e., if C1 fails while C2 is being repaired, then repair of C2 is preempted, repair of C1 begins, and the repair of C2 resumes after C1 is repaired. Let us suppose that this subsystem could be in either of the states 1,2, and 3 at time zero with probabilities !1 ,!2 , and !3 respectively. The equivalent GSPN model is shown in Figure 3.15 (b). Ciardo et al [29] formalize SRNs and provide an algorithm to convert an SRN into an equivalent Markov reward model. SRNs are equivalent to GSPNs in terms of modeling power. Following this discussion, we have established the overall hierarchy of dependability model types. This is shown in Figure 3.16. We must remark here that in comparing combinatorial model types with Markovian model types, we have implicitly assumed that time-to-failure and time-torepair distributions of various components in the system are exponentially distributed. Whereas analysis of combinatorial model types does not put any restrictions on the nature of distributions, the Markovian model types can be extended but usually become intractable under non-exponential distributions. Semi-Markov chains have often been used in reliability modeling to allow for nonexponential distributions but only in a restrictive manner. Therefore a trade-o exists between Markovian model types and combinatorial model types.

3.5. CONCLUSIONS

25 GSPN

CTMC

SRN

MRM

FTRE RG RBD

FT

Figure 3.16: Model hierarchies among dependability model types

3.5 Conclusions We have formally established a hierarchy among the most commonly used types of dependability models according to their modeling power. Among the combinatorial (non-state-space) model types, we showed that fault trees with repeated events (FTREs) are the most powerful in terms of kinds of dependencies among various components of a system that can be modeled (which is one metric of modeling power). Reliability graphs are less powerful than fault trees with repeated events but more powerful than relability block diagrams and fault trees without repeated events. By virtue of constructive nature of our proofs, we also provided algorithms for converting from one model type to another. Among the Markovian (state-space) model types, we considered continuoustime Markov chains (CTMCs), generalized stochastic Petri nets (GSPNs), and stochastic reward nets (SRNs). These are more powerful than combinatorial model types in that they can capture dependencies such as a shared repair facility between system components. However, they are analytically tractable only under certain distributional assumptions such as exponential failure and repair-time distributions. Furthermore, these are also subject to an exponentially large state space. The equivalence among various Markovian model types is well known and these are only brie y discussed.

26

CHAPTER 3. POWER-HIERARCHY OF DEPENDABILITY MODEL TYPES

Chapter 4

Dependability Modeling Using Petri-Net Based Models Petri-net based models have been extensively used for performance and performability modeling in the analysis of computer and communication systems [2, 3, 34, 32, 114, 118, 121, 146]. However, in dependability modeling, Petri-net based models have received considerably less attention. In this chapter, we describe a methodology to construct dependability models using Petri-net based models. Among the Petri-net based model types, we consider generalized stochastic Petri-nets (GSPNs) [3] and stochastic reward nets (SRNs) [29]. In the previous chapter, we showed that fault tree with repeated events (FTRE) is the most powerful combinatorial model type. In this chapter, we rst show how to convert a FTRE model to an equivalent GSPN or SRN model. We provide algorithms for these conversions. These algorithms can be easily modi ed to convert a series-parallel reliability block diagram into GSPN or SRN models. The subnet constructions involved in these conversions change based on what kind of distribution is assigned to the components of a system. For instance, in a FTRE model, it is common to assign failure probabilities (a distribution with mass at time zero and mass at in nity which sum to one) to each component, or assign a failure-time distribution in which a component can be faulty from the very start of system operation (mass at time zero). We illustrate subnet constructions for such commonly occurring cases. The major handicap of combinatorial model types is that they cannot model certain kinds of dependencies, the most common one being the repair dependency among components. A commonly occurring repair dependency is the one where several components share a repair facility. Failed components queue up for repair if the repair facility is busy. Such a dependency can be modeled by model types such as queuing networks, Markov chains, and Petri-net based model types. The repair requests in the queue can be serviced according to some scheduling discipline such as FCFS, processor-sharing, or pre-emptive resume priority discipline etc. In this chapter, we show how to incorporate repair in GSPN and SRN models. We provide constructive algorithms to convert an FTRE availability model with no repair dependency (i.e., each component has its independent repair facility) to equivalent GSPN and SRN models. We then illustrate how repair dependency can be introduced in these GSPN and SRN models. Net constructions for di erent scheduling disciplines of repair queue are di erent. We provide GSPN and SRN constructions for four di erent scheduling disciplines: FCFS, processor-sharing, pre-emptive resume priority, and non-pre-emptive priority. Thus, given a system whose operational dependency on components is speci ed by a FTRE model (or any combinatorial model type), our methodology provides a direct way to construct GSPN or SRN model for this system and allows us to incorporate repair dependencies among components of 27

28

CHAPTER 4. DEPENDABILITY MODELING USING PETRI-NET BASED MODELS

the system. Following these constructions and algorithms, we comparatively evaluate the merits of SRNs and GSPNs. SRNs are an extension of GSPNs, i.e., they include all the features of GSPNs and many more such as guards (earlier known as enabling functions), timed transition priorities, variable cardinality arcs, halting condition, and rewards etc. None of these extensions enhance the modeling power, since every SRN model can be converted to a Markov chain which can be converted to an equivalent GSPN model [3] (although SRNs allow calculation of some reward-based measures which are not possible through GSPNs [31]). However, by converting FTRE models (with and without repair dependencies) to equivalent GSPN and SRN models, we illustrate that these extensions greatly simplify the model construction and reduce the model size in case of dependability modeling of repairable systems. Thus, more compact representations of dependability models are possible with SRNs than with GSPNs. The rest of this chapter is organized as follows. In Sections 4.1 and 4.2, we describe how a FTRE reliability model without repair can be converted into equivalent GSPN and SRN models respectively. In Section 4.3, we describe how to introduce repair (without any repair dependency) in GSPN and SRN models. In Section 4.4, we describe how to incorporate repair dependency (i.e., shared repair facility among components) and queuing disciplines in GSPN and SRN models. Our conclusions are presented in Section 4.5. We consider a fault-tolerant multiprocessor system with shared memory described in Section 3.2.3 as a running example in this chapter. The basic multiprocessor architecture is as shown in Figure 3.4. An FTRE model of the system is shown in Figure 3.5.

4.1 Generalized Stochastic Petri-Nets GSPNs were introduced by Ajmone-Marsan et al [3] and have been extensively used in performance modeling of computer and communication systems [2]. However, they have been rarely, if ever, used in dependability modeling. To begin with, we illustrate how a FTRE model can be converted to an equivalent GSPN model. Unless otherwise stated, we assume that the time-to-failure and timeto-repair distributions of any component are exponential.

4.1.1 FTREs to GSPNs

Let us consider how a FTRE model can be converted to a GSPN model. In principle this can be done by converting a FTRE to a Markov chain [51] and then converting the Markov chain into a GSPN model [107]. However, the GSPN model obtained could be totally non-intuitive and signi cantly less compact than the one obtained by careful design. To generate the equivalent GSPN model, we must consider what is the input associated with basic events in the FTRE. This could be time-to-failure distribution, failure probability, or instantaneous availability of the component. The time-to-failure distribution can be further classi ed in two: non-defective (no mass at time zero and in nity) and defective (mass at time zero or in nity). We consider each of these cases. The rst one is discussed in detail. The remaining cases are discussed brie y since they can be similarly handled.

Non-defective Failure-time Distribution Assume that time-to-failure distribution of each component is exponential and there is no mass at time zero and in nity. Let the time to failure of a component be a random variable X . In this case: P (X = 0) = 0 ;

4.1. GENERALIZED STOCHASTIC PETRI-NETS

29

P (X  t) = 1 ? e?t ; P (X = 1) = 1 : A pseudo-code of the algorithm to convert a FTRE model of a non-repairable system into an equivalent GSPN model appears in Figure 4.1. The equivalent GSPN model is obtained by calling this procedure as FTRE to GSPN(root) where root is the top gate of the FTRE. A brief description of the algorithm is as follows. Assume that basic events as well as outputs of gates may be shared (repeated). In the latter case, we say that a gate is repeated. To begin with, perform a preprocessing step to count the number of times a event (basic or output of a gate) is repeated and set RC [x] equal to that number, where x is the label identifying the event (RC  Repeated Count). At the end of this step, RC [x] is at least 1 for each x, unless there is some error in the speci cation of the FTRE. For each node x, set DONE [x] FALSE to indicate that the subnet for this component/gate has not been generated. This step can be carried out in O(n) time were n is the number of events in the FTRE. The remaining steps are carried out by a postorder traversal of the FTRE starting from the root. If the current node x is a basic event and DONE [x] == FALSE , then construct a subnet as shown in Figure 4.2 (a) and label each place as shown. Label the output places as shown so that each place is recognized by the component name. If node x is an AND gate with inputs (children) c1; ::::; cx, then perform postorder(ci) for i 1; :::; x. After the postorder traversal for each of the inputs is completed, construct the subnet as shown in Figure 4.2 (b). The input places in this subnet are identi ed by the corresponding node names of the children c1:dn; ::::; cx:dn. Set DONE [x] TRUE . If node x is an OR gate with children c1 ; ::::; cx, then perform postorder(ci) for i 1; :::; x. After the postorder traversal for each of the inputs is completed, construct the subnet as shown in Figure 4.2 (c). The input places in this subnet are similarly identi ed by the corresponding node names of the children such as c1:dn; ::::; cx:dn. If RC [x] == 1, then we do not need the part of the subnet enclosed in the rectangular box. Set DONE [x] TRUE . In all the three cases as shown above, the multiplicity of the output arc from the transition in the subnets is set equal to RC [x]. Set DONE [x] TRUE to indicate that the subnet for node x has been constructed. In the end, after the postorder walk is completed, construct inhibitor arcs from the place root:dn (the dn place for the top gate in the FTRE) to all the timed transitions. This is done so that after the system fails, failure of operational components is disallowed to prevent generation of unnecessary markings. This reduces the number of states of the underlying Markov chain. Thus, both storage and time are saved since a smaller Markov chain needs to be generated and stored. The complexity of the GSPN model can be expressed in terms of number of places and transitions. Following our algorithm, it is easy to see: #(places) = 2  #(components) + #(gates) #(timed transitions) = #(components) X #(immediate transitions)  #(AND gates) + #(inputs to OR gates) The number of immediate transitions could be more than the sum above if any of the OR gates are repeated since an extra place and immediate transition (see the dashed rectangular box in Figure 4.2 (c)) are needed in that case. In Figure 4.3, the GSPN model obtained from converting the FTRE model of multiprocessor system with shared memory (Figure 3.5) is shown.

30

CHAPTER 4. DEPENDABILITY MODELING USING PETRI-NET BASED MODELS

Algorithm FTRE to GSPN(x) begin if (x 6= NIL) then if (x is a basic event) and (DONE[x] == FALSE) then construct the subnet shown in Figure 4.2 (a) and label each place else if (x is an AND gate) then let c1 ; ::::; cx be the inputs (children) of x foreach ci; i 1; :::; x do FTRE to GSPN(ci) construct the subnet shown in Figure 4.2 (b) else if (x is an OR gate) then let c1 ; ::::; cx be the inputs (children) of x foreach ci; i 1; :::; x do FTRE to GSPN(ci) construct the subnet shown in Figure 4.2 (c) DONE[x] TRUE end construct inhibitor arcs from root:dn to all the timed transitions. Figure 4.1: Conversion algorithm for FTRE to GSPN

4.1. GENERALIZED STOCHASTIC PETRI-NETS

x:up

RC[x]

31

x:dn

(a)

c1:dn c2:dn

RC[x]

x:dn

(b)

cx:dn c1:dn c2:dn

RC[x]

x:dn

(c)

cx:dn Figure 4.2: GSPN subnets for converting a FTRE model to a GSPN model

32

CHAPTER 4. DEPENDABILITY MODELING USING PETRI-NET BASED MODELS

P2 :up D21:up D22:up M2 :up M3:up

2

M1:up

N:up

D11:up D12:up

P1 :up

Figure 4.3: GSPN model of the multiprocessor system with shared memory

4.1. GENERALIZED STOCHASTIC PETRI-NETS p

x:up

33 RC[x]

x:dn

1?p Figure 4.4: GSPN subnet when failure-time distribution has mass at zero

Failure-time Distribution with Mass at Zero

Suppose a defective distribution with a mass at zero equal to 1 ? p is assigned to each component [141]. In this case:

P (X = 0) = 1 ? p ; P (X  t) = 1 ? p + p(1 ? e?t ) ; P (X = 1) = 1 : A common example of this scenario occurs when a component could be faulty in the beginning (time zero) and if not, then its failure-time distribution is speci ed. To model such a scenario, we need to modify Figure 4.2 (a) as shown in Figure 4.4. Another way to look at this is to compute a new initial state distribution. The initial distribution is speci ed as follows. With probability p, there is one token in place x:up at the start and with probability 1 ? p, there will be a token in place x:dn at the start. Figures 4.2 (b) and 4.2 (c) remain the same and so does the algorithm shown in Figure 4.1.

Failure-time Distribution with Mass at Zero and In nity In this case, a constant failure probability 1 ? p is assigned to each component. Looking at it as a defective distribution function implies that there is a mass at time zero equal to 1 ? p and a mass at time in nity equal to p. In this case:

P (X = 0) = 1 ? p ; P (X = 1) = p : An example of this scenario is when a component is either faulty or fault-free (i.e., does not fail as time progresses) from the very start of system operation. This can also be interpreted as numerical value of reliability of a component at a given instant of time. This value can be computed from some failure-time distribution at a given instant of time. To model this scenario, we need to modify the Figure 4.2 (a) as shown in Figure 4.5. In all the above cases, the algorithm to construct the overall GSPN model remains the same. Only the subnet for each component changes depending upon the kind of distribution assigned to each component. By virtue of our constructions, this methodology extends to the case where a defective failure-time distribution is speci ed for some components while a non-defective failuretime distribution is speci ed for the others. The important thing is proper labeling of places

34

CHAPTER 4. DEPENDABILITY MODELING USING PETRI-NET BASED MODELS 1 ? p RC[x]

p

x:dn x:up

Figure 4.5: GSPN subnet when failure probability of a component is speci ed

x:dn and x:up in the subnet for a component x. Once these places have been generated for each component x, the construction of the rest of the net remains the same regardless of what kind of distribution was assigned to each component.

4.2 Stochastic Reward Nets SRNs were introduced by Ciardo et al [31] and have been extensively used in performance modeling and analysis [32, 124]. To begin with, we illustrate how a FTRE reliability model (no repair) can be converted to an equivalent SRN model. SRNs can be considered extensions of GSPNs. Besides structural extensions such as guards (enabling functions), variable cardinality arcs, halting condition etc., reward rates can be associated with the markings of the net. This helps reduce the size of the net since many aspects of a system that are modeled explicitly by places and transitions in a GSPN can be expressed by arithmetic and boolean expressions involving reward rates. Similarly, conditions for removal of tokens from a place can be captured by associating enabling functions with appropriate transitions. In modeling dependability of a system, the above distinction becomes quite clear.

4.2.1 FTREs to SRNs

Like GSPNs, we discuss di erent cases based on what kind of failure-time distributions is assigned to basic events in the FTRE model.

Non-defective Failure-time Distribution The algorithm to convert a FTRE to a SRN is based on a similar postorder traversal of the FTRE as the one used for converting a FTRE to a GSPN. The di erence is in the actions taken when a node is encountered. Every time a gate is encountered, instead of constructing a subnet of immediate transitions and places, a reward rate function is constructed. To begin with, perform a preprocessing step to count the number of times a event (basic or output of a gate) is repeated and set RC [x] equal to that number, where x is the label identifying the event (RC  Repeated Count). At the end of this step, RC [x] is at least 1 for each x, unless there is some error in the speci cation of the FTRE. For each node x, set DONE [x] FALSE . This step can be carried out in O(n) time were n is the number of events (gates and basic events) in the FTRE. The remaining steps are carried out by a postorder traversal of the FTRE starting

4.2. STOCHASTIC REWARD NETS

35

Algorithm FTRE to SRN(x) begin if (x 6= NIL) then if (x is a basic event) and (DONE[x] == FALSE) then construct the subnet shown in Figure 4.2 (a) and label each place bool(x) (#tokens(x:dn) == 1) else if (x is an AND gate) then let c1; ::::; cx be the inputs (children) of x foreach ci ; i 1; :::; x do FTRE to SRN(ci ) bool(x) bool(c1) ^ bool(c2) ^ ::: ^ bool(cx) else if (x is an OR gate) then let c1; ::::; cx be the inputs (children) of x foreach ci ; i 1; :::; x do FTRE to SRN(ci ) bool(x) bool(c1) _ bool(c2) _ ::: _ bool(cx) DONE[x] TRUE end Figure 4.6: Conversion algorithm for FTRE to SRN from the root. Every time a node (a basic event or a gate) is encountered, a speci c action is performed. The algorithm for postorder tree-traversal is given in pseudo-code in the Figure 4.6. It is described below. If the current node x is a basic event and DONE [x] == FALSE , then construct a subnet as shown in Figure 4.2 (a) and label each place as shown. Construct the following boolean function: bool(x) (#tokens(x:dn) == 1). Set DONE [x] TRUE to indicate that the subnet for node x has been constructed. If node x is an AND gate with children c1 ; ::::; cx, then for each child of this node, perform postorder(ci). Construct the function: bool(x) bool(c1) ^ bool(c2) ^ :::: ^ bool(cx). Set DONE [x] TRUE . If node x is an OR gate with children c1 ; ::::; cx, then for each child of this node, perform postorder(ci). Construct the function: bool(x) bool(c1) _ bool(c2) _ :::: _ bool(cx). Set DONE [x] TRUE . In the end, the halting condition is speci ed as: if (bool(root) == 1) then disable all the transitions in the net. Basically, the idea behind halting condition is to prevent generation of unnecessary markings. Suppose that the system fails due to failure of some components and it is shut down. This shut down implies that no more activity takes place in the system, i.e., operational components can no longer fail. Thus, all the transitions within the system must be disabled. If we do not disable all the transitions, then many more markings will be generated, each of which will represent a system failure state. Halting condition disables all the transitions after the system fails and therefore prevents generation of these unnecessary markings.

36

CHAPTER 4. DEPENDABILITY MODELING USING PETRI-NET BASED MODELS The SRN model is obtained by calling the above procedure as

FTRE to SRN(root).

The reliability of the system is speci ed by the reward function: if (bool(root) == 0) then r = 1 (system is operational) else r = 0 (system is failed) . In Figure 4.7, the SRN model obtained from converting the FTRE model of the multiprocessor system (Figure 3.5) is shown. bool(G1) is used to specify the reliability of the system. Compare it with the equivalent GSPN model shown in Figure 4.3. First, we do not need the mesh of immediate transitions and places. Second, the use of halting condition avoids the need of inhibitor arcs to prevent generation of unnecessary markings. The complexity of the SRN model is: #(places) = 2  #(components) #(timed transitions) = #(components) #(immediate transitions) = 0

Failure-time Distribution with Mass at Zero

The SRN subnet for each component in this case will be the same as the GSPN subnet shown in Figure 4.4.

Failure-time Distribution with Mass at Zero and In nity The SRN subnet for each component in this case will be the same as the GSPN subnet shown in Figure 4.5. In all these cases, the overall SRN model simply consists of such subnets for each components, unlike GSPN models which needed the mesh of immediate transitions, places, and inhibitor arcs.

4.3 Modeling Repair (Without Dependency) In the previous constructions, the components were assumed to be unrepairable. We now consider how to model repair of components. In practice, repairing a component could consist of calling the repair person, removal of bugs, purchase of new component, replacement of faulty component, installing the new component, recon guring the new component, and testing the new component etc. For simplicity and sake of illustration, we group all these steps together into a collective action called repair. Combinatorial model types such as reliability block diagrams, FTREs, reliability graphs, etc. can model only the case where each component of the multiprocessor system has an independent repair person. We rst consider this simple no-repair-dependency case.

4.3.1 Modeling Repair in GSPN Models

The GSPN model of the multiprocessor system where each component has its independent repair facility is shown in Figure 4.8. In this model, we have not shown the inhibitor arcs (to disable failure transitions after system failure) for sake of clarity, but they are present just like in the model shown in Figure 4.3. Comparing this model with the GSPN model shown in Figure 4.3, we nd that we need to introduce a complementary mirrored subnet to appropriately remove the tokens from the

4.3. MODELING REPAIR (WITHOUT DEPENDENCY)

Gate

D21:up D22:up P2 :up

M2:up M3:up

D21:dn D22:dn P2 :dn

M2:dn M3 :dn

M1 :up P1 :up

D12:up D11:up N:up

M1:dn P1 :dn

D12:dn D11:dn N:dn

bool(G8) bool(G7) bool(G6) bool(G5) bool(G4) bool(G3) bool(G2) bool(G1)

Boolean Function (#tokens(M1 :dn) == 1) ^ (#tokens(M3 :dn) == 1) (#tokens(D11:dn) == 1) ^ (#tokens(D12:dn) == 1) (#tokens(M2 :dn) == 1) ^ (#tokens(M3 :dn) == 1) (#tokens(D21:dn) == 1) ^ (#tokens(D22:dn) == 1) bool(G7) _ bool(G8) _ (#tokens(P1 :dn) == 1) bool(G5) _ bool(G6) _ (#tokens(P2 :dn) == 1) bool(G3) ^ bool(G4) bool(G2) _ (#tokens(N:dn) == 1)

Halting Condition if (bool(G1) == 1) then disable all the transitions Figure 4.7: SRN Model of the multiprocessor system with Shared Memory

37

38

CHAPTER 4. DEPENDABILITY MODELING USING PETRI-NET BASED MODELS

P2 D21 D22 A M2 M3 M1

PF

2 2

D11 D12

N

P1

Figure 4.8: GSPN model of the multiprocessor with shared memory (with repair) places which indicate di erent subsystem failures in order to re ect the repair of components. We call this subnet complementary since the AND and OR dependencies of subsystems on components are complemented (i.e., AND becomes OR and vice-versa). For example, consider the place A in the GSPN shown in Figure 4.8. If processor P2 fails, or if both the disks D21 and D22 have failed, or if both the memory modules M2 and M3 have failed, then there will be a token in place A. After repair, suppose that none of the above conditions hold, then we must remove the token from A, i.e., if P2 is up, and either of disks D21 or D22 is up, and either of M2 or M3 is up, then we must remove the token from A. These conditions are complementary to the conditions which led to deposit of a token in A. The complementary subnet modeling these conditions is shown in the dashed rectangular box in Figure 4.8. One of the other modi cations in this case is the need of several inhibitor arcs on the immediate transitions (both in the original subnet and its complementary subnet) to prevent these transitions from continuously ring (since at least one of the input places to these transitions is also one of the

4.4. MODELING REPAIR DEPENDENCIES

39

output places). Thus, unless the inhibitor arcs are used, these transitions will re inde nitely. An algorithm to convert a FTRE model where each component has its own repair facility into a GSPN model can be derived based on the arguments above similar to the algorithm shown in Figure 4.1. Various output measures can be computed from this model. Steady-state probability of a token in place PF gives steady-state unavailability of the system. Transient probability of a token in place PF gives instantaneous unavailability of the system. Similarly, instantaneous and steadystate availabilities of components and subsystems can be computed. The complexity of the GSPN model with repair is: #(places) = 2  #(components) + 2  #(gates) #(timed transitions) = 2 X  #(components) #(immediate transitions)  (#(inputs to gate g ) + 1) g2gates

The complexity of the complementary net (number of places and transitions) is nearly the same as the complexity of the original net (modeling the FTRE with no repair). Thus, the size of the GSPN nearly doubles to incorporate repair.

4.3.2 Modeling Repair in SRN Models

The SRN model of the multiprocessor system where each component has an independent repair facility is shown in Figure 4.9. Comparing this model with the GSPN model shown in Figure 4.8, the usefulness of SRNs over GSPNs becomes obvious. The only modi cation we needed to make to SRN model shown in Figure 4.7 was to add transitions for repair of components. The boolean function bool(G1) which was used to specify the reward function for reliability of multiprocessor system (Figure 4.7) can also be used to specify the reward function for availability of the system. However, there is no halting condition in this case since the system is repairable (and repair transitions must not be \disabled" after system failure). Instead, there are guards for each failure transition. These guards disable the failure transitions while the system is down (i.e., components do not fail while the system is down). The failure transitions are enabled once the system is operational again. Contrast this SRN model with the equivalent GSPN model where a complementary subnet of about the same size as the original subnet must be constructed to model the repair of components. Besides the standard output measures such as availability (instantaneous and steady-state), we can also compute cumulative up (or down) time of the system until some time t. This is done by computing the accumulated reward in system \up" (or \down") states. The complexity of the SRN model with repair is: #(places) = 2  #(components) #(timed transitions) = 2  #(components) #(immediate transitions) = 0

4.4 Modeling Repair Dependencies In the previous section, we considered the simple case where each component of a system has its independent repair facility. In practice, this is not the case. Usually, repair facilities are shared between various components of a system. If a component fails while the repair facility is busy, then it has to queue for service. Components that queue up for service could be serviced according to some scheduling policy. In this section, we show how to model such repair dependency and various scheduling disciplines using GSPN and SRN models.

40

CHAPTER 4. DEPENDABILITY MODELING USING PETRI-NET BASED MODELS D21:up

D22:up

P2 :up

M2 :up

M3:up

D21:dn

D22:dn

P2:dn

M2 :dn

M3:dn

M1:up

P1 :up

D12:up

D11:up

N:up

M1:dn

P1 :dn

D12:dn

D11:dn

N:dn

Figure 4.9: SRN Model of the multiprocessor system with Shared Memory (with repair)

4.4.1 FCFS Repair Discipline

According to this discipline, components that arrive for repair at a repair facility are served in order of arrival. To illustrate how this discipline is modeled using SRNs, consider a parallel system of three components C1; C2; and C3 that share a repair facility R. The SRN subnet for this discipline is shown in Figure 4.10. The component that arrives for repair rst grabs the token from place R (i.e., grabs the repair facility) and releases it only after repair is completed. To model the FCFS discipline, a queue is modeled as shown [83]. Each component Ci is identi ed by i number of tokens in any of the places Q1 ; Q2; or Q3 . These places and the immediate transitions between them model the FCFS queue. The arcs with a `Z' like sign are variable cardinality arcs, a special feature of SRNs. Each time transition tq1 (tq2 ) res, it removes as many tokens as present in Q1 (Q2) and places them in Q2 (Q3). The reward rate r for availability of this system is assigned as: if ((#tokens(C1:up) == 1) _ (#tokens(C2 :up) == 1) _(#tokens(C3:up) == 1)) then r = 1 (system is up) else r = 0 (system is down) The GSPN subnet for FCFS discipline is shown in Figure 4.11. This di ers from the SRN subnet only in the modeling of the FCFS queue. Since GSPNs do not allow variable cardinality arcs, we need to explicitly model that behavior [31]. The net becomes signi cantly complicated in this case.

4.4. MODELING REPAIR DEPENDENCIES

C1 :up

41

C1 :dn1

1

C1 :dn2 C2 :up

1

C2 :dn1

tq1

2

R

C2 :dn2 C3 :up

C3 :dn1

Q1

2

Q2 tq2

3

Q3

C3 :dn2

3

Figure 4.10: SRN subnet for modeling FCFS repair discipline

C1:up

C1:dn1

1

C1:dn2 C2:up

C2:dn1

1 2

R

C2:dn2 C3 :up

C3:dn1 C3:dn2

2

3 3

Figure 4.11: GSPN subnet for modeling FCFS repair discipline

42

CHAPTER 4. DEPENDABILITY MODELING USING PETRI-NET BASED MODELS C1 :up

C1:dn

tr1

C2:up

C2:dn

C3:up

tr2

C3:dn

tr3

Figure 4.12: SRN subnet for modeling pre-emptive resume priority repair discipline

4.4.2 Pre-emptive Resume Priority Repair Discipline

In this case, a priority is attached to each component. If a high priority component needs repair while a low priority component is being repaired, then the repair of low priority component is pre-empted and resumed after the repair of high priority component is completed. By virtue of memoryless property of exponential distribution, the amount of remaining repair time has the same distribution as the original repair time. Let us consider the same example as before. The SRN model for this system is shown in Figure 4.12. Assign priorities x1; x2; and x3 (x1 > x2 > x3) respectively to the timed transitions tr1 ; tr2; and tr3. An enabled timed transition is disabled if another timed transition with higher priority is enabled before this transition res. This models the pre-emptive resume priority discipline. Note that although the repair facility is a shared resource which is in contention when more than one component has failed, it is not explicitly modeled. The equivalent GSPN subnet is shown in Figure 4.13.

4.4.3 Non-pre-emptive Priority Repair Discipline

In this case, the component which is being repaired currently is not pre-empted if a high priority component arrives for repair. However, after the current repair completes, then the highest priority component from the queued components is selected for repair. To illustrate this, let us consider a system of three components C1, C2 , and C3 in decreasing order of priorities. The GSPN model for this system is shown in Figure 4.14. The priority in this case is modeled by inhibitor arcs. For instance, these arcs guarantee that if C1 and C2 (or C3) are waiting in the queue for repair, then C1 will begin repair rst and C2 (or C3) will be repaired after C1 nishes repair. This could also be modeled by simply assigning priorities x1 ; x2; and x3 (x1 > x2 > x3 ) respectively to the immediate transitions t1 ; t2; and t3 .

4.4.4 Processor-sharing Repair Discipline

According to this discipline, no queuing takes place at the repair facility. Instead, each failed component perceives the repair facility to be slowed down by a factor of k if there are k failed components waiting to be repaired at any instant. This is easily modeled by GSPNs and SRNs by assigning marking dependent transition rates to transitions tr1; tr2; and tr3 as shown in Figure 4.15. Here 1=i is the mean repair time for component Ci . The overall GSPN models in all the above cases will contain the subnets shown in the gures above and the mesh of immediate transitions and places which models the operational dependency of the system onto its components. However, compared to the no-repair-dependency case (Figure

4.4. MODELING REPAIR DEPENDENCIES

43

C1:up

C2:up

R

C3:up

Figure 4.13: GSPN subnet for modeling pre-emptive resume priority repair discipline

C1 :up

C2:up

C3 :up

C1:dn1 C1:dn2 t1 C2:dn1 C2:dn2 t2 C3:dn1

R

C3:dn2 t3 Figure 4.14: GSPN/SRN subnet for modeling non-pre-emptive priority repair discipline

44

CHAPTER 4. DEPENDABILITY MODELING USING PETRI-NET BASED MODELS

Transition

tr1 tr2 tr3

tr3

tr2

tr1 C1 :dn

C3:up

C2:up

C1 :up

C2:dn

Rate Function

C3:dn

1 =(#tokens(C1 :dn) + #tokens(C2:dn) + #tokens(C3:dn)) 2 =(#tokens(C1 :dn) + #tokens(C2:dn) + #tokens(C3:dn)) 3 =(#tokens(C1 :dn) + #tokens(C2:dn) + #tokens(C3:dn))

Figure 4.15: GSPN/SRN subnet for modeling processor-sharing repair discipline 4.8), this mesh will be more complicated since now there is more than one place per component where a token would indicate failure of a component. For example, a token in place C1:dn1 or C1:dn2 (Figures 4.11, and 4.14) indicates that component C1 is down.

4.5 Conclusions We have described a methodology to construct GSPN and SRN models for dependability modeling of systems. We have presented algorithms to convert a fault tree model into GSPNs and SRNs. We consider various kinds of distributions assigned to components in the fault tree model such as defective failure-time distribution, non-defective failure-time distribution, failure probability, instantaneous unavailability, etc. By subnet constructions, we have illustrated how GSPNs and SRNs can model di erent cases. We then show how to incorporate repair in these models. First, we considered the case where each component has an independent repair facility. Then, we considered the case where several components share a repair facility. Such repair dependency cannot be modeled by combinatorial model types. Now suppose that we are given a combinatorial model (such as fault tree) of a system that speci es operational dependence of a system on its components and we wish to model repair dependency among the components. In this case, our methodology provides a direct way to generate a GSPN or a SRN model which models the operational dependence of the system as well as repair dependency among the components. In case of shared repair facility, the failed components must queue up for repair. We considered several scheduling disciplines to serve requests in the queue such as FCFS, pre-emptive priority resume, non-pre-emptive priority, and processor sharing. We provide GSPN and SRN subnet constructions for each of these disciplines. These subnet constructions allow us to compare SRNs with GSPNs as dependability model types. We nd that for dependability models of repairable systems, the complexity of GSPN models is signi cantly higher than the complexity of equivalent SRN models. Since SRNs include all the features of GSPNs, we basically conclude that added features of SRNs such as rewards, variable cardinality arcs, etc. greatly simplify model construction and signi cantly reduce the model size.

Chapter 5

A Methodology for Formal Expression of Hierarchy in Model Solution In this chapter, we describe a methodology for formal speci cation of hierarchy both in model speci cation and solution. We consider a uni ed view of various approaches in which hierarchical modeling manifests itself as described in Chapter 2. The overall system model consists of one or more submodels of possibly di erent types (this is known as hybrid hierarchical modeling [142]) which possibly interact with each other in some manner. The overall model solution is obtained by solving the submodels and combining submodel solutions in some fashion. In the modeling toolkit SHARPE [142, 144], the hierarchy among various submodels is implicit. The order of solution of submodels is speci ed in the input le and parameter passing between submodels is performed by passing the results from one submodel to another. The formal speci cations of our methodology capture how the submodels interact with each other and how overall model solution is obtained by combining submodel solutions. Our methodology brings out the hierarchy in model solution to the fore and allows a better understanding of the submodel interactions to the user. To the best of our knowledge, no such e ort has been made in the past. The rest of this chapter is organized as follows. In Section 5.1, we describe our notion of hierarchical modeling. In Section 5.2, we present a methodology for formal speci cation of hierarchy in model speci cation and solution. In Section 5.3, we present several examples to illustrate the power of our methodology. These examples di er in the kind of parameters passed among the submodels (distributions, reward rates, real numbers, etc.) and the the modeling techniques (non-iterative hierarchical modeling, iterative hierarchical modeling, reward-based performability modeling, approximate model decomposition, state-space aggregation, etc.) employed. Our conclusions are presented in Section 5.4.

5.1 Hierarchical Modeling We view the overall system model as a collection of various submodels which interact with each other. Let the system model consist of k submodels: M1 ; M2; ::::; Mk : (5:1) These submodels could be of di erent types, i.e., some of these submodels could be fault-trees, some could be Markov models, and some could be stochastic Petri-nets, etc. The submodel types 45

46

CHAPTER 5. FORMAL EXPRESSION OF HIERARCHY IN MODEL SOLUTION

that are commonly used in dependability, performance, and performability modeling are allowed. These include (but are not restricted to):  Reliability block digrams  Fault trees (with and without repeated events)  Reliability Graphs  Product-form queuing networks  Series-parallel directed acyclic graphs  Generalized stochastic Petri nets  Stochastic reward nets  Stochastic activity networks  Discrete-time Markov chains  Continuous-time Markov chains  Markov reward models  Semi-Markov (reward) models However, for the speci cation purpose, we treat each submodel as a black box which has some inputs and some outputs. Specialized solution techniques are used within each black box to compute the outputs from the inputs. Thus each submodel can be simply considered a mapping from the inputs to the outputs. The outputs could be expressed in symbolic, semi-numeric, or numeric form. The same holds for the inputs to any black box. The type of inputs and outputs are speci ed by the modeler. The actual value of the inputs may be computed by other submodels or may be supplied by the user. There could be many possible outputs from a submodel. Consider for example, a Markov chain with a single absorbing state. Suppose it models reliability of a (sub)system. The set of inputs to this model consists of initial state probabilities for the Markov chain and transition rates between various states. The possible outputs that can be computed from this model are mean time to absorption (usually MTTF { mean time to failure), reliability, unreliability, and individual probabilities of being in di erent states at some time t. The user can specify that one or more of these outputs be computed. Likewise, the user can also specify the solution method depending upon the type of solution desired. For example, semi-symbolic method may yield reliability of a system as a closed form expression in t (the time variable) whereas a numerical method will yield a numerical value of reliability at a given instant of time t. The solution of the overall system model is computed by interactions among various submodels. By interactions we mean that parameters are passed between various submodels. These parameters could be a numerical value (of mean, variance, transition rates, coverage probabilities, reward rates, steady-state availability, etc.), a closed form expression (e.g., time-dependent coverage probability), or a semi-symbolic distribution function (e.g., (un)reliability at time t, instantaneous (un)availability at time t, etc). If parameters are passed from submodel Mi to submodel Mj , then we write Mi  Mj . The relation \" de nes an order among the submodels M1 ; M2; :::; Mk. This order can be represented by a directed graph G in which each node represents a submodel and there is an edge directed from node i to j i Mi  Mj . Such a graph has been named import graph [33]. Let the transitive closure of relation  be denoted by . To compute the solution of the overall model, two cases exist:

5.2. FORMAL EXPRESSION OF HIERARCHY IN MODEL SOLUTION

47

 G is acyclic. This implies that the relation  de nes a partial order among various submodels and provides one or more linear orderings by which each submodel may be solved. If Mi  Mj , then Mi must be solved before Mj since Mj receives input from the outputs of Mi. If neither Mi  Mj nor Mj  Mi hold, then Mi and Mj can be solved in any order since

solution of one does not a ect the solution of the other. We carry out a topological sort to linearize the evaluation of submodels.  G is non-acyclic. This implies that there exists at least one pair of submodels Mi and Mj such that Mi  Mj and Mj  Mi . This case presents a problem regarding solution of the overall model. We must resort to an iterative solution in this case. A typical approach to solving such an iteration is to break the cyclicity among the submodels. One of the submodels Mx is chosen and some starting values (initial guess) for the input parameters of this submodel are supplied. Using these starting values, the submodel is solved and its results are used to solve the rest of the submodels in the cycle. The outputs of these submodels determine the next input parameters for submodel Mx . The similar cycle is repeated and it is said to constitute an iteration. Iterations are continued until convergence is achieved. Such a solution method based on xed-point iteration has been successfully used by many researchers [26, 27, 33, 45, 75, 117, 155]. Under certain conditions, existence and uniqueness of the solution can be proved. However, the actual solution method is not our focus. We are more interested in how the user can specify such interactions among various submodels.

5.2 Formal Expression of Hierarchy in Model Solution

We de ne an interconnection matrix A to specify the interactions among the submodels. Each submodel Mj has inputs denoted by Ij 1 ; Ij 2; :::; Ijnj and outputs denoted by Oj 1 ; Oj 2; :::; Ojmj . Let E1; E2; :::; El be the set of external inputs to the overall model. Let L1 ; L2; :::; Lp be the set of internal variables. Each input to a submodel is determined by the external inputs, internal variables if any, and outputs from other submodels. This relationship is expressed by a subroutine which provides the mapping from the set of external inputs and submodel outputs to submodel inputs. P P The interconnection matrix A is an R  C matrix where R = kj=1 nj and C = l + p + kj=1 mj . For each entry aij in this matrix, the index i denotes a speci c input to a submodel and the index j denotes either a speci c output of a submodel or an external input. Assuming the columns of matrix A are indexed from 1; 2; :::; C , then we know that if 1  j  l, then j denotes an external input, otherwise if l < j  l + p then j denotes an internal variable, otherwise j denotes an output from a submodel. If aij = 1, then the output (or external input or internal variable) j contributes to the input i. If aij = 0, then output (or external input or internal variable) j does not contribute to input i. The matrix A captures all the possible interactions between the submodels. It is possible that some outputs of a submodel are input to some submodels while others are output to some other submodels. It is also possible that a single output contributes to inputs of several di erent submodels. It is easy to see that all possible scenarios are captured by A. Typically, the interconnection matrix A is very sparse (this will become more obvious in the next section where we discuss several examples). The user needs to specify only the non-zero entries of this matrix. A modeling toolkit may provide either a high-level software interface or a graphical user interface. The user will then specify the model either using a high-level language or graphically. The model speci cations by a user include various submodels, interactions between the submodels, and types of inputs and outputs form each submodel, and the types of external inputs and outputs. The interconnection matrix can be automatically generated from these speci cations.

48

CHAPTER 5. FORMAL EXPRESSION OF HIERARCHY IN MODEL SOLUTION G1

G2

GN

Figure 5.1: Reliability block diagram for RAID It is possible that several di erent submodel outputs (possibly from di erent submodels) contribute to a single input of some submodel. To express the relationship from external inputs and submodel outputs to submodel inputs, we require a mapping. We represent this mapping by a subroutine. Let u1 ; u2; :::; up be p di erent submodel outputs, e1 ; e2; :::; eq be q di erent external inputs and l1; l2; :::; lr be r di erent internal variables from which a single value of some submodel input v is computed. Let  be the subroutine which computes v . This mapping is denoted as:

(e1; e2; :::; eq; u1; u2; :::; up; l1; l2; :::; lr) = v :

(5:2)

It really is expressed in a subroutine form as the examples in next section will show. In case, a subroutine has only a single input parameter u and the output v is equal to the parameter u, then the subroutine  becomes an identity subroutine. In this case we denote it as I. In general, we have a vector of subroutines S of length R (R is equal to the sum of number of all submodel inputs) denoted as: S = [1; 2; :::; R] : (5:3) Some (or all) of these is could possibly be identity subroutines. Suppose now that we are given a complete speci cation of a system model in the form described above. From the interconnection matrix A, it is easy to gure out which external inputs and submodel outputs will determine a submodel input. This determines the input parameters for each subroutine for di erent submodel inputs. The subroutine speci cations completely determine the interactions between various submodels. We now provide several examples which will illustrate exactly how this speci cation of hierarchy will work. The usefulness of formal speci cation of hierarchy will also become clear.

5.3 Examples

5.3.1 Hierarchical Reliability Model of Disk Arrays

Let us consider a reliability model of disk array systems [132]. A disk array consists of N groups of disks. Each group consists of G disks. Failure of one disk in each group is tolerable and data of the failed disk can be reconstructed using the parity (or redundant) and data stored on other disks. After a disk fails, a recovery action is initiated which consists of replacing a failed disk by a spare disk and reconstructing the data on the spare disk from the data and parity stored on the other disks. If another disk in the same group fails before recovery is complete, then data is lost. The disk array is considered failed if data is lost in any of the N groups. The reliability of each group is modeled using the Markov model shown in Figure 5.2. The reliability of the disk array is modeled by a reliability block diagram (RBD) shown in Figure 5.1. Thus, this is an example of hierarchical composition, i.e., the overall model is composed of several Markov models at lower level and a RBD model at the higher level.

5.3. EXAMPLES

49

Gp(1 ? ) 2

(G ? 1) 1

0

 G(1 ? p) Figure 5.2: Markov reliability model of a group of disks The parameters of each submodel of a group are G; ; ; ; p and P(0) (the initial state probability vector). The outputs from each reliability submodel for a group are input to the RBD. Suppose we wish to compute numerical values of reliability and mean time to data loss (MTDL) of a disk array. In this case, the value of time t at which each Markov submodel must be solved also becomes an input to each Markov submodel. The numerical value of reliability from each Markov submodel is needed to compute the numerical reliability of the disk array. However, if MTDL of the disk array needs to be computed, then semi-numerical value of reliability (an expression in variable time t) from each Markov submodel is needed. Therefore, semi-numerical value of reliability is chosen as the output from each Markov submodel. These are all input to the RBD submodel which has two outputs { the reliability of the disk array and the MTDL of the disk array. The RBD submodel computes the reliability of the disk array using the reliabilities of individual groups. Thus the overall model consists of K = N + 1 submodels. In Figure 5.3, we pictorially depict the interactions between di erent submodels. The overall model is enclosed within the dashed box. Typically, however, such a model will be very super uous since the submodels M1 ; M2; :::; MN are identical. If this property holds, then we need to decompose the overall model into only two submodels as shown in Figure 5.4. This results in a much more compact organization of the model. The output from the Markov submodel is replicated N times and supplied as N inputs for the RBD submodel. The RBD submodel computes the reliability of disk array as a product of reliabilities input to it. The interconnection matrix for this model organization is shown in Table 5.1. This model can be made more compact since each input to the RBD submodel is identical. The new model organization is shown in Figure 5.5. Now there are two inputs to the RBD submodel { the reliability as output from Markov submodel and the number of groups N . The reliability of disk array is now computed as the N th power of reliability of a group. The interconnection matrix for this model organization is shown in Table 5.2. The external inputs E1; E2; ::::; E8 respectively denote G; ; ; ; p; P(0); t; N . The subroutine for each of the inputs is an identity subroutine, i.e., I11 = E1; I12 = E2; :::; I21 = N; I22 = O11. The outputs of the RBD submodel, O21 and O22, respectively denote the reliability and MTDL of the disk array. These are what may be called the external outputs of the overall model.

5.3.2 Hierarchical Reliability Model Based on Behavioral Decomposition

We now consider a hierarchical reliability model which is adapted from [60]. The reliability of a N -component system can be modeled by a stochastic reward net [29] shown in Figure 5.6. All the components are assumed to be identical and statistically independent. A component fails at rate . After the rst component failure, the system is recon gured at rate  . If any of the other

50

CHAPTER 5. FORMAL EXPRESSION OF HIERARCHY IN MODEL SOLUTION

E1 E2 E3 E4 E5 E6 E7 O11 O21 O22

I11 I12 I13 I14 I15 I16 I17 I21 I22 I2N

1 0 0 0 0 0 0 0 0

0 1 0 0 0 0 0 0 0

0 0 1 0 0 0 0 0 0

0 0 0 1 0 0 0 0 0

0 0 0 0 1 0 0 0 0

0 0 0 0 0 1 0 0 0

0 0 0 0 0 0 1 0 0

0 0 0 0 0 0 0 1 1

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

0

0

0

0

0

0

0

1

0

0

Table 5.1: Interconnection matrix for RAID model organization

I11 I12 I13 I14 I15 I16 I17 I21 I22

E1 E2 E3 E4 E5 E6 E7 E8 O11 O21 O22 1 0 0 0 0 0 0 0 0

0 1 0 0 0 0 0 0 0

0 0 1 0 0 0 0 0 0

0 0 0 1 0 0 0 0 0

0 0 0 0 1 0 0 0 0

0 0 0 0 0 1 0 0 0

0 0 0 0 0 0 1 0 0

0 0 0 0 0 0 0 1 0

0 0 0 0 0 0 0 0 1

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

Table 5.2: Interconnection matrix for a compact RAID model organization

5.3. EXAMPLES

51

M1

M2 MN +1

Reliability MTDL

MN

Figure 5.3: Organization of the overall disk array reliability model

M1

1 2

N

M2

Reliability MTDL

Figure 5.4: A more compact organization of the overall model

52

CHAPTER 5. FORMAL EXPRESSION OF HIERARCHY IN MODEL SOLUTION N

M1

M2

Reliability MTDL

Figure 5.5: An even more compact organization of the overall model components fails while recon guration is underway, then system is considered failed. Such a failure is said to be caused by a near-coincident fault (NCF). Furthermore, during recon guration a single point failure (SPF) might occur which would cause a system failure. An SPF occurs at rate  . There are no repairs of components and nally redundancy failure occurs (i.e., when the last token in place pup is removed). The rate of transitions t and tncf depends upon the number of tokens in place pup. As soon as an NCF (a token in place pncf) or an SPF (a token in place pspf) occurs, the system is considered failed and a halting condition is used to disable all the transitions in the net. In other words, the underlying Markov chain reaches an absorbing state. The halting condition is also enabled after a redundancy failure occurs. The Markov chain resulting from this SRN is extremely sti due to a large disparity between rates  ,  , and . This model can be decomposed into three submodels: a fault-error handling (FEH) submodel (shown in Figure 5.7 and termed as submodel M1 ), a NCF competition (NCFC) submodel (shown in Figure 5.8 and termed as submodel M2 ), and a fault-occurrence (FO) submodel (shown in Figure 5.9 and termed as submodel M3 ). The FEH submodel represents the competition between system recon guration after a failure and a SPF. This submodel is solved to obtain exit probabilities pC (probability of a token in place pC ) and pS (probability of a token in place pS ) and distribution of time to reach a resolution among the competing recon guration and SPF processes. This distribution is input to the NCFC submodel which makes it a semi-Markov stochastic Petri net (SMSPN) [34]. The underlying model in this case is a semi-Markov model which is solved to obtain the probability of avoiding an NCF. Probabilities obtained from solving the NCFC and FEH submodels are input to the FO submodel which is solved to obtain a reliability estimate of the system. In the SRN shown in Figure 5.9, the probabilities ci+1;i and si+1;i are given by:

ci+1;i = pRS(i)  pC ; si+1;i = pRS(i)  pS ; where pRS (i) is obtained from NCFC submodel as the probability of a token in place pRS for a given i = 1; :::; N ? 1. In this SRN model, as soon as a token reaches places pspf or pncf, the halting condition is enabled. Submodel M1 has two input parameters I11 and I12 which correspond to  and  respectively. There are three outputs from this submodel O11; O12; and O13 which represent cdf(single fault),

5.3. EXAMPLES

53

 pspf

# pup N

precf

 #

pncf Figure 5.6: Reliability model of a three-component system

 pA

pC

 pS

Figure 5.7: Fault-error handling (FEH) submodel

54

CHAPTER 5. FORMAL EXPRESSION OF HIERARCHY IN MODEL SOLUTION

cdf(single fault) pF

pRS

exp(i)

pSD Figure 5.8: Semi-Markov NCF competition (NCFC) submodel

ci+1;i q #

si+1;i

N qup

qspf qncf

1 ? ci+1;i ? si+1;i

i is the number of tokens in place qup. Figure 5.9: Fault-occurrence (FO) submodel

5.3. EXAMPLES

55

I11 I12 I21 I22 I23 I31 I32 I33 I34

E1 E2 E3 E4 L1 O11 O12 O13 O21 O31 1 0 0 0 0 0 0 0 0

0 1 0 0 0 0 0 0 0

0 0 1 0 0 0 1 0 1

0 0 0 0 0 1 0 0 0

0 0 0 0 1 0 0 0 0

0 0 0 1 0 0 0 0 0

0 0 0 0 0 0 0 1 0

0 0 0 0 0 0 0 0 1

0 0 0 0 0 0 0 1 1

0 0 0 0 0 0 0 0 0

Table 5.3: Interconnection matrix for hierarchical reliability model of non-repairable N -component system

pC , and pS respectively. Submodel M2 has three input parameters I21; I22; and I23 which respectively represent , cdf(single fault), and i. There is a single output from this submodel O21 which represents pRS (i). Submodel M3 has four inputs I31; I32; I33; and I34 which respectively represent N; ; ci+1;i; and si+1;i . This submodel is solved for system reliability which is represented by the output O31. There are four external inputs E1; E2; E3 and E4 to the overall model which correspond respectively to ; ; ; and N . There is an internal variable i represented by L1 which takes values from 1 to N ? 1. The interconnection matrix for this model is shown in Table 5.3. The subroutine vector S = [1; 2 ; ::::; 9]. The non-identity subroutines are: 8(O12; O21) begin return(O12  O21) end and 9(O13; O21) begin return(O13  O21) end Note that the above two subroutines are identical in the function they perform. Hence in actual implementation, there will be a single subroutine to which di erent parameters will be passed. The rest of the subroutines are all identity subroutines. The overall model organization is pictorially shown in Figure 5.10.

5.3.3 Performance Model of an Interactive System

We now present an example of hierarchical performance modeling based on aggregation. This example is adapted from [19]. Consider a queuing network model of an interactive computer system shown in Figure 5.11. There are Nt terminals each submitting a request at rate  when in think state. After a request is submitted, it may have to wait in the memory queue before it

CHAPTER 5. FORMAL EXPRESSION OF HIERARCHY IN MODEL SOLUTION

56



 

N

8 M1

cdf()

ci+1;i M3

M2

9

i

si+1;i

Figure 5.10: Overall model for N -component system joins the CPU queue because there can be only k jobs in the CPU-I/O subsystem at any given time. The computer subsystem consists of a CPU and an I/O device with service rates 0 and 1 respectively. A request submitted by a terminal is queued at the CPU queue. After completing service at the CPU, this request is either completed with probability v0 or it requests I/O with probability v1 (and it gets scheduled in I/O queue). After completing I/O service, the request joins the CPU queue. Both the CPU and I/O queue are serviced in FCFS manner. The SRN model for the system is shown in Figure 5.12. Assume that v0

Suggest Documents