Reliability Modelling of Whole RAID Storage ...

4 downloads 50109 Views 2MB Size Report
service-level agreement (SLA) metrics (e.g., data availability). .... disk failure rate) and chunk recovery rate as model input and calculates stripe MTTF, ... model all the RAID components but previous work, to the best of our knowledge, has not.
Reliability Modelling of Whole RAID Storage Subsystems

A Thesis Submitted For the Degree of

Master of Science (Engineering) in the Faculty of Engineering

by

Prasenjit Karmakar

Computer Science and Automation Indian Institute of Science BANGALORE  560 012

March 2012

i

©Prasenjit Karmakar March 2012 All rights reserved

TO

My Parents

Acknowledgements There are many people who have assisted me in this journey and I would like to express my gratitude to each one of them. I would like to rst thank my supervising professor Prof. K. Gopinath whose expertise, guidance and constant encouragement was integral towards my thesis. I am also grateful to Computer Science department in particular and Indian Institute of Science in general for providing an excellent environment conducive for conducting research. I would like to thank Chairman Prof. Y. Narahari for those wonderful words of wisdom and encouragement. I would also like to thank all the nonteaching sta members of our department due to whose hard work we have a smooth academic life. I want to thank Mr. Kishore Sampathkumar from LSI LOGIC for sharing information regarding large storage systems.

I want to thank Dr.

Dave Parker, Dr.

Hakan Younes and some members of the PRISM team of Oxford University for sharing a lot of information regarding PRISM tool. Moreover, the conversations with Jon Elerath from NetApp regarding Monte-Carlo Simulation algorithms for RAID subsystems were very useful. Finally, the cheerful assistance I received from my friends (Arun, Santanu, Pankaj, Spigel and many others) made my stay pleasant as ever.

i

Abstract Reliability modelling of RAID storage systems with its various components such as RAID controllers, enclosures, expanders, interconnects and disks is important from a storage system designer's point of view. A model that can express all the failure characteristics of the whole RAID storage system can be used to evaluate design choices, perform cost reliability trade-os and conduct sensitivity analyses.

We present a reliability model for RAID storage systems where we try to model all the components as accurately as possible. We use several state-space reduction techniques, such as aggregating all in-series components and hierarchical decomposition, to reduce the size of our model. To automate computation of reliability, we use the PRISM model checker as a CTMC solver where appropriate.

Initially, we assume a simple 3-state disk reliability model with independent disk failures. Later, we assume a Weibull model for the disks; we also consider a correlated disk failure model to check correspondence with the eld data available. For all other components in the system, we assume exponential failure distribution. To use the CTMC solver, we approximate the Weibull distribution for a disk using sum of exponentials and we rst conrm that this model gives results that are in reasonably good agreement with those from the sequential Monte Carlo simulation methods for RAID disk subsystems.

Next, our model for whole RAID storage systems (that includes, for example, disks, expanders, enclosures) uses Weibull distributions and, where appropriate, correlated failure modes for disks, and exponential distributions with independent failure modes

ii

iii

for all other components. Since the CTMC solver cannot handle the size of the resulting models, we solve such models using hierarchical decomposition technique. We are able to model fairly large congurations with upto 600 disks using this model.

We can use such reasonably complete models to conduct several what-if  analyses for many RAID storage systems of interest. Our results show that, depending on the conguration, spanning a RAID group across enclosures may increase or decrease reliability. Another key nding from our model results is that redundancy mechanisms such as multipathing is benecial only if a single failure of some other component does not cause data inaccessibility of a whole RAID group.

Contents Acknowledgements

i

Abstract

ii

1 Introduction

1

1.1

Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Previous Work

3

1.3

Contributions of the Thesis

. . . . . . . . . . . . . . . . . . . . . . . . .

9

1.4

Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 Storage System Architecture

13

2.1

Main Components Present in Storage Systems . . . . . . . . . . . . . . .

13

2.2

Some Storage System Congurations

17

. . . . . . . . . . . . . . . . . . . .

3 Background on Disk Failure 3.1

3.2

19

Disk Failure Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

3.1.1

Operational Failures

19

3.1.2

Latent Defects . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

Disk Reliability Models . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

. . . . . . . . . . . . . . . . . . . . . . . . .

3.2.1

Bathtub Curve

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

3.2.2

Increasing Hazard Rate Model . . . . . . . . . . . . . . . . . . . .

23

4 Reliability Related Denitions

26

4.1

Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

4.2

Hazard Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

5 Reliability Measures, Model Inputs and Model Assumptions

28

5.1

Reliability Measures

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

5.2

Modelling Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

5.3

Model Inputs

32

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6 Modelling Techniques 6.1

36

PRISM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

6.1.1

PRISM Model Checker . . . . . . . . . . . . . . . . . . . . . . . .

37

6.1.2

PRISM Discrete-event Simulator

38

iv

. . . . . . . . . . . . . . . . . .

CONTENTS

6.2

v

6.1.3

Selection of PRISM : Comparison with Other Tools . . . . . . . .

39

6.1.4

PRISM Tool Settings . . . . . . . . . . . . . . . . . . . . . . . . .

40

PRISM Model of Small Systems . . . . . . . . . . . . . . . . . . . . . . .

42

6.2.1

Model Each Component Separately . . . . . . . . . . . . . . . . .

42

6.2.2

State Space Reduction Techniques . . . . . . . . . . . . . . . . . .

6.3

Modelling Systems Using PRISM Discrete-event Simulator

6.4

Hierarchical Decomposition

6.5

52

. . . . . . . . . . . . . . . . . . . . . . . . .

55

6.4.1

Denition

6.4.2

Correctness of Hierarchical Decomposition

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.4.3

Methodology

6.4.4

Discussion and Explanation of the Results

6.4.5

Hierarchical Decomposition for Spanned RAID groups

6.4.6

3-level Modelling using Hierarchical Decomposition

55

. . . . . . . . . . . . .

56

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

. . . . . . . . . . . . .

59

. . . . . .

62

. . . . . . . .

63

. . . . . . . . . . . . . .

69

6.5.1

Modelling Single Controller-pair Systems . . . . . . . . . . . . . .

71

6.5.2

Modelling Systems with Multiple Controller-pairs

72

Modelling of Some Known Field Congurations

. . . . . . . . .

7 Detailed Model of Disk Subsystems 7.1

42

. . . . . . . .

81

Detailed Disk Model : Modelling Weibull Distribution . . . . . . . . . . .

81

7.1.1

Weibull Approximation by a 3-state Markov Model

. . . . . . . .

82

7.1.2

Model Parameters for TTOP

. . . . . . . . . . . . . . . . . . . .

85

7.1.3

Model Parameters for TTR and TTscr

7.1.4

Detailed Model for a Single Disk

7.1.5

Validation of the Detailed Model for Disk Subsystems against Monte

. . . . . . . . . . . . . . .

88

. . . . . . . . . . . . . . . . . .

89

Carlo simulation and DDF(t) equation Results : . . . . . . . . . .

90

Whole System Modelling with Detailed Disk Model . . . . . . . .

102

7.2

Memorylessness property of a Markov Model . . . . . . . . . . . . . . . .

103

7.3

Modelling Correlated Failure For Disks . . . . . . . . . . . . . . . . . . .

108

7.1.6

7.3.1

RAID5 Model assuming Correlated Failure . . . . . . . . . . . . .

109

7.3.2

Validation of Model against Field Data . . . . . . . . . . . . . . .

111

7.3.3

Spanning a RAID group Across Enclosures . . . . . . . . . . . . .

116

8 Conclusions and Future Work

118

A Reliability Model of a RAID Controller

122

A.1

Sti Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

B Standard RAID Levels

126

128

B.1

RAID0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

128

B.2

RAID1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

129

B.3

RAID5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

129

B.4

RAID6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

130

C Kolmogorv-Smirnov Test

131

CONTENTS

vi

Bibliography

133

List of Figures 2.1

Diagram of an external storage subsystem; This is provided by a storage vendor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

2.2

Diagram of a DAS subsystem; This is provided by a storage vendor

. . .

15

2.3

4 disk RAID5

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

3.1

Bathtub Curve [WikiBathtub] . . . . . . . . . . . . . . . . . . . . . . . .

24

5.1

Simple RAID5 model by Rao et al. [KKR06] . . . . . . . . . . . . . . . .

29

5.2

3 state model for disk failure, rate and

β

σ

is burn-in rate,

α

is pre burn-in failure

is post burn-in failure rate [Xin05diskinfant]; X axis shows time

in hrs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3

Hazard rate function for disk failure assuming post burn-in failure rate < pre burn-in failure rate; ; X axis shows time in hrs . . . . . . . . . . . . .

6.1

32

Composing in-series components :

and

com2

are in series and can be

com

as the Markov model at right is equiva-

x/x0 /x00

stands for the transitions corresponding

replaced by series equivalent lent to the model at left.

com1

to other components i.e. those that are not connected in series with and

com2 .

34

State 1' is the merged equivalent of states 1 and 2.

com1

. . . . . .

43

6.2

An example of using the series-composition technique . . . . . . . . . . .

45

6.3

Composition of all the nal states into a single nal state . . . . . . . .

46

6.4

Enclosure failure rate

. . . . . . . . . . . . .

49

6.5

MTTDIL of 2 disk RAID1 . . . . . . . . . . . . . . . . . . . . . . . . . .

50



number of disks inside it

vii

LIST OF FIGURES

viii

6.6

MTTDIL of a 4 disk RAID5 . . . . . . . . . . . . . . . . . . . . . . . . .

50

6.7

4 enclosures 24 disks as 4 RAID5 groups

. . . . . . . . . . . . . . . . . .

53

6.8

Neglecting a data inaccessible scenario in hierarchical decomposition . .

57

6.9

Decomposition phase for the systems of Figure 6.7 . . . . . . . . . . . . .

58

6.10 Aggregation phase for the systems of Figure 6.7

. . . . . . . . . . . . . .

6.11 2 RAID groups each distributed across multiple enclosures

58

. . . . . . . .

62

6.12 4 enclosures 96 disk as 4 RAID10 groups . . . . . . . . . . . . . . . . . .

64

6.13 RAID10

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

64

6.14 One independent subsystem for the 96 disk conguration of Figure 6.12 .

65

6.15 Decomposition and aggregation phase for the subsystem of Figure 6.12(b)

65

6.16 Final model after applying the decomposition and aggregation from Figure 6.15

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.17 Markov model for a 2-component redundant system;

λ

and

µ

66

are fail-

ure rate and repair rate of a component respectively; State 0: both the components are working, 1 = one component fails, 2 = both component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

68

6.18 single controller pair systems . . . . . . . . . . . . . . . . . . . . . . . . .

70

fails

6.19 Aggregation phase : C1 and C2 are controllers; E1 and E2 are expanders; MTTF of RAID-equiv = MTTF of RAID-G1/10 . . . . . . . . . . . . . . 6.20 20 controller pair 20 enclosure systems

72

. . . . . . . . . . . . . . . . . . .

73

6.21 Discretization approach . . . . . . . . . . . . . . . . . . . . . . . . . . . .

76

7.1

Approximating a increasing function by a series of constant functions

. .

83

7.2

Approximate vs. Weibull; X axis shows time in hrs

. . . . . . . . . . . .

87

7.3

CDF dierences between approximate and weibull; X axis shows time in

7.4

hrs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

88

Hazard rates; Approximate vs. Weibull; X axis shows time in hrs

89

. . . .

LIST OF FIGURES 7.5

ix

Approximate Disk model based on Gopinath et al. [Gopi10] : one dierence is that we consider here a more accurate model that has a transition from Disk(LSE1) state to the Disk(LSE2) state with rate transition from Disk(LSE1) to Disk(Burnt-in) state.

N +1

rather than a

. . . . . . . . . . .

90

7.6

State Diagram for

. . . . . . . . . . . . .

91

7.7

Timing diagram for Monte-Carlo Simulation [Elerath07] . . . . . . . . . .

93

7.8

4 state model for disks

98

7.9

Approximate vs. Weibull; X axis shows time in hrs

. . . . . . . . . . . .

100

7.10 Approximate vs. Weibull; X axis shows time in hrs

. . . . . . . . . . . .

103

m

7.11 Traditional

RAID group [Elerath07]

σ

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

disk fault-tolerant Markov model [Greenan]

. . . . . . . .

104

7.12 Illustrative multi-disk fault tolerant Markov model with sector errors.[Greenan]105 7.13 Critical region of rst failed disk susceptible to data loss due to latent . . . . . . . . . . . . . . . . . . . . . . . . . . . .

106

7.14 Comparison of PRISM results with Greenan's simulation results . . . . .

107

sector errors.[Greenan]

7.15 Model of a

n disk RAID5 in an enclosure assuming disk correlated failure;

State 0 : all the disks are working, 1 : one disk fails, 2 : data loss, probability of unrecoverable error during rebuild, rebuild rate

λ

: disk failure rate,

h

:

µ

:

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.16 A single disk failure model.

110

State 0 : working, 1 : burnt-in, 2 : non-

correlated disk failure, 3 : correlated disk failure i.e. two disks fail . . . .

111

7.17 MTTDIL (hr) for a RAID group of 8 disks using correlated disk failure model. en : enclosure

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

A.1

Diagram of a single controller; This is provided by a storage vendor

A.2

CTMC model for a controller

A.3

Failover model for a RAID controller; State 0 : primary and backup both

117

. . .

123

. . . . . . . . . . . . . . . . . . . . . . . .

124

working, State 1 : backup fails and primary working, 2 : primary fails and backup working, 3 : both fails. Here is the repair rate of a controller and

λ is the failure rate of a controller, µ

ρ is the switchover rate to the backup

controller when primary controller fails . . . . . . . . . . . . . . . . . . .

125

LIST OF FIGURES A.4

Transient probability of failure for a RAID controller

B.1

x

. . . . . . . . . . .

127

RAID0

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

128

B.2

RAID1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

129

B.3

RAID5

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

130

B.4

RAID6

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

130

List of Tables 1.1

Absolute failures over 18 month of operation for the 3.2 TB storage system [ERR99] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.2

Relative frequency of hardware replacements for large systems [SchroederG07]

5.1

Disk reliability parameters . . . . . . . . . . . . . . . . . . . . . . . . . .

33

5.2

MTTF of other components

33

6.1

PRISM parameters for our model

. . . . . . . . . . . . . . . . . . . . . .

40

6.2

Parameters for PRISM Simulator

. . . . . . . . . . . . . . . . . . . . . .

41

6.3

Model results for some small congurations . . . . . . . . . . . . . . . . .

42

6.4

MTTDIL (hr) of a 8 disk RAID5;

. . . . . . . . . . . . . . . . . . . . . . . . .

m

: number of enclosures;

OOM : Out

of Memory Error; indep. : independent . . . . . . . . . . . . . . . . . . . 6.5

MTTDIL (hr) of a 8 disk RAID5;

m

:

6

number of enclosures;

t

47

is the

threshold of the number of disks in an enclosure after which its MTTF decreases; 6.6



: dependent . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

Cost-reliability trade-os; enc. : enclosure; MP : multi-pathing and SP : single-pathing; Gain : reliability gain using multi-pathing; Extra Cost : Extra cost due to multi-pathing . . . . . . . . . . . . . . . . . . . . . . .

6.7

52

Simulation Results for system of Figure 6.7; Simulation widths are almost 1% of the point estimator; SMTTDIL : System MTTDIL . . . . . . . . .

xi

54

LIST OF TABLES 6.8

xii

Results of Hierarchical Decomposition; Dev : decomposition from simulation results i.e

deviation of hierarchical

100(H − S) % S

where

H

and

S

are the results of Hierarchical Decomposition and Simulation respectively, SMTTDIL : System MTTDIL . . . . . . . . . . . . . . . . . . . . . . . . 6.9

59

Reliability increase of the whole system with respect to the 10% reliability increase of an independent subsystem

. . . . . . . . . . . . . . . . . . .

60

6.10 Reliability increase of the whole system with respect to the 10% reliability increase of an enclosure

. . . . . . . . . . . . . . . . . . . . . . . . . . .

61

6.11 Results of Hierarchical Decomposition; SMTTDIL : System MTTDIL . .

67

6.12 MTTF of the components for large storage congurations . . . . . . . . .

69

6.13 Real world eld data for some large storage congurations

69

. . . . . . . .

6.14 Model results vs. real-world eld data for 480 disk congurations

. . . .

77

6.15 Model results vs. real-world eld data for 600 disk congurations

. . . .

78

6.16 Model results (M) vs.

real-world eld data (F) for some large storage

congurations; Dev. : Deviation of model results from eld value

7.1

. . . .

79

DDF(t) per 1000 RAID groups for 6 disk RAID5 : PRISM Model (PRISM DDF(t)) vs. Simulation (sDDF(t)) vs. DDF(t) equation (eqDDF(t)) result; sDev = Deviation of PRISM results from Simulation results; eDev = Deviation of PRISM results from DDF(t) equation results; Time taken for Model Checking =

37 sec while time for Simulation = 8 min; both

PRISM and simulation error are 1%; 7.2

. . . . . . . . . . . . . . . . . . . .

95

DDF(t) per 1000 RAID groups for 8 disk RAID5 : PRISM Model (PRISM DDF(t)) vs. Simulation (sDDF(t)) vs. DDF(t) equation (eqDDF(t)) result; sDev = Deviation of PRISM results from Simulation results; eDev = Deviation of PRISM results from DDF(t) equation results; Time taken for Model Checking using symmetry reduction = Simulation =

3.2 min while time for

7 min; both PRISM and simulation error are 1%

. . . . .

96

LIST OF TABLES 7.3

xiii

DDF(t) per 1000000 RAID groups for 8 disk RAID6 :

PRISM Model

(PRISM DDF(t)) vs. Simulation (sDDF(t)) results; sDev = Deviation of PRISM results from Simulation results; Time taken for Model Checking using symmetry reduction =

12.6 min while time for Simulation = 26

hr; PRISM error is 1% and Simulation Error is 4% 7.4

. . . . . . . . . . . .

97

DDF(t) per 1000 RAID groups for 6 disk RAID5 : PRISM Model (PRISM DDF(t)) vs. Simulation (sDDF(t)) vs. DDF(t) equation (eqDDF(t)) result; sDev = Deviation of PRISM results from Simulation results; eDev = Deviation of PRISM results from DDF(t) equation results; Time taken

4.3 min, almost 8 times higher than the time using 3-state model; both PRISM and simulation error are 1%; . . . . for Model Checking =

7.5

101

DDF(t) per 1000 RAID groups for 8 disk RAID5 : PRISM Model (PRISM DDF(t)) vs. Simulation (sDDF(t)) vs. DDF(t) equation (eqDDF(t)) result; sDev = Deviation of PRISM results from Simulation results; eDev = Deviation of PRISM results from DDF(t) equation results; Time taken

50 min, almost 16 times higher than the time using 3-state model; both PRISM and for Model Checking using symmetry reduction =

simulation error are 1% . . . . . . . . . . . . . . . . . . . . . . . . . . . .

102

7.6

Detailed Model results vs. eld value for the large storage congurations

102

7.7

Model results vs. real-world eld data for 480 disk congurations

. . . .

113

7.8

Model results vs. real-world eld data for 600 disk congurations

. . . .

114

Chapter 1 Introduction 1.1 Motivation Despite major eorts, both in industry and in academia, achieving high reliability remains a major challenge in large-scale IT systems.

It has been found that a large

fraction of the total cost of ownership goes to disaster prevention and cost of actual disasters. With ever larger server clusters, reliability and availability are a growing problem for many sites, including high-performance computing systems and internet service providers.

A particularly big concern is the reliability of storage systems, for several

reasons :

ˆ

Failure of storage can not only cause temporary data unavailability, but in the worst case leads to permanent data loss.

ˆ

Technology trends and market forces may combine to make storage system failures occur more frequently in the future [SOSP05].

ˆ

Finally, the size of storage systems in modern, large-scale IT installations has grown to an unprecedented scale with thousands of storage devices, making component failures the norm rather than the exception. For example, the EMC

TM

Symmetrix

TM DMX-4 can be congured with up to 2400 disks [EMC], the Google File System cluster is composed of 1000 storage nodes [GFS], and the NetApp®FAS6000 series 1

Chapter 1. Introduction

2

can support more than 1000 disks per node, with up to 24 nodes in a system [NetappFAS].

Due to the huge amount of failures in the storage system, a number of dierent redundancy schemes have been developed :

ˆ

RAID : RAID, an acronym for Redundant Array of Independent Disks (changed from its original term Redundant Array of Inexpensive Disks), is a technology that provides increased storage functions and reliability through redundancy.

This is

achieved by combining multiple disk drive components into a logical unit, where data is distributed across the drives in one of several ways called RAID levels. This concept was rst dened by David A. Patterson, Garth A. Gibson, and Randy Katz at the University of California, Berkeley in 1987 as Redundant Arrays of Inexpensive Disks [RAIDGibPat]. Marketers representing industry RAID manufacturers later attempted to reinvent the term to describe a redundant array of independent disks as a means of dissociating a low-cost expectation from RAID technology. RAID is now used as an umbrella term for computer data storage schemes that can divide and replicate data among multiple disk drives. The schemes or architectures are named by the word RAID followed by a number (e.g., RAID 0, RAID 1). The various designs of RAID systems involve two key goals : increase data reliability and increase input/output performance. When multiple physical disks are set up to use RAID technology, they are said to be in a RAID array. This array distributes data across multiple disks, but the array is addressed by the operating system as one single disk.

A detailed description of the Standard RAID levels is given in

Appendix B.

ˆ

Redundant components :

To increase the reliability of the storage system,

redundant components such as redundant controller, redundant expander etc. are used.

Chapter 1. Introduction

3

External Storage Systems are RAID storage systems connected to multiple servers over a storage area network (SAN). They are more ecient than Direct Attached Storage (DAS) systems where the storage is attached to a dedicated single server. Hence external storage systems are mainly used in large data centres. When used in data centres or in critical server applications, their reliability needs to be guaranteed. These systems consist of many components such as RAID controller, enclosures, expanders, interconnects and of course disks. Failures in any of these components can lead to downtime or data loss, or both of the storage system.

Hence redundant components such as dual controllers

and dual expanders are used to make them highly available.

To design and build a reliable storage system, it is important to have a model of the storage system that expresses all the storage failure characteristics. If we have a model of the storage system then we can do what-if  analyses for the storage system.

For

example,

ˆ

We can accurately estimate storage failure rate that can help system designers decide how many resources should be used to tolerate failures and to meet certain service-level agreement (SLA) metrics (e.g., data availability).

ˆ

We can determine the factors that greatly impact the storage system reliability. This can guide designers to select more reliable components or build redundancy into unreliable components.

ˆ

We can evaluate existing resiliency mechanisms and develop better fault-tolerant mechanisms.

1.2 Previous Work While several studies have been conducted on understanding and modelling disk failures [KKR06, Gopi10, Elerath07, BarrosoFAST07], (we discuss on the several disk failure mechanisms and reliability models in Chapter 3) there seems to be little work done on analyzing the reliability of the overall RAID storage systems.

Chapter 1. Introduction Patterson et al.

4

[ERR99] analysed the error behavior of a 368 disk 3.2 TB storage

system. They considered both, error instances due to degraded mode operation (which was recovered by node restarts in most of the cases) and absolute failure (an absolute failure is when a component needs to be replaced ; this event is usually preceded by many error instances). Table 1.1 shows the absolute failures of several components in the storage system they studied. Table 1.1 shows that, of all the absolute hardware failures,

Component

Total in System Total failed (Absolute failures) % Failed

SCSI Controller

44

1

2.3%

SCSI Cable

39

1

2.6%

SCSI Disk

368

7

1.9%

IDE Disk

24

6

25.0%

Disk Enclosure

46

13

28.3%

Enclosure Power

92

3

3.26%

Ethernet Controller

20

1

9.8%

Ethernet Switch

1

1

50%

Ethernet cable

42

1

2.3%

Table 1.1: Absolute failures over 18 month of operation for the 3.2 TB storage system [ERR99]

disks were the most reliable even though there were more data disks in the system than any other component. The enclosures that house these disks were among the least reliable in the system. The reason of enclosures to fail frequently was the backplane/midplane failure. On the other hand, components such as motherboard, power supply, memory modules etc. does not fail at all. This study gives some information about failure of components in a storage system. Enclosure being the least reliable of all the components is supported by our model inputs from a storage vendor as enclosure has the least MTTF of all the other components.

Bianca et al.

[SchroederG06, SchroederG07] analysed failures in high-performance

computing systems in two separate studies:

ˆ

The rst study considers any kind of failures (not necessarily storage system failure) in a HPC system . A key nding of the study was that hardware failures were the

Chapter 1. Introduction

5

largest contributor in the failure of a system. The main reason of hardware failure was found to be memory and CPU failure.

ˆ

The second study is mainly about disk failure. By disk failure they meant disk replacement in eld as experienced by customers. The study also shows a relative comparison of disk replacement frequency with that of other hardware components. Table 1.2 shows relative frequency of hardware component replacements for the ten most frequently replaced components in three dierent systems : HPC1, COM1, COM2. HPC1 is a HPC cluster. COM1 and COM2 are Internet service providers.

Both of the studies provide a overall picture of component failures in a large system.

Jiang et al. [JiangHZK08] presented an analysis of NetApp AutoSupport logs collected from about 39, 000 storage systems commercially deployed at various customer sites. The dataset covers a period of 44 months and includes about 1, 800, 000 disks hosted in about 155, 000 storage shelf enclosures. This report classify storage subsystem failures into four failure types based on their symptoms and root cause :

ˆ

Disk failure :

This type of failure is triggered by failure mechanisms of disks. Im-

perfect media, media scratches caused by loose particles, rotational vibration, and many other factors internal to a disk can lead to this type of failures. Sometimes, the storage layer proactively fails disks based on statistics collected by on-disk health monitoring mechanisms (e.g., a disk has experienced too many sector errors. These incidences are also counted as disk failures ).

ˆ

Physical interconnect failure :

This type of failure is triggered by errors in the

networks connecting disks and storage heads.

It can be caused by host adapter

failures, broken cables, shelf enclosure power outage, shelf backplanes errors, and/or errors in shelf FC drivers.

When physical interconnect failures happen, aected

disks appear to be missing from the system.

ˆ

Protocol failure :

This type of failure is caused by incompatibility between

protocols in disk drivers or shelf enclosures and storage heads and software bugs in

Chapter 1. Introduction

System

HPC1

COM1

COM2

6

Component

%

Hard drive

30.6

Memory

28.5

Misc/Unk

14.4

CPU

12.4

PCI motherboard

4.9

Controller

2.9

QSW

1.7

Power Supply

1.6

MLB

1.0

SCSI BP

0.3

Power Supply

34.8

Memory

20.1

Hard drive

18.1

Case

11.4

Fan

8.0

CPU

2.0

SCSI Board

0.6

NIC Card

1.2

LV Power Board

0.6

CPU heatsink

0.6

Hard drive

49.1

Motherboard

23.4

Power Supply

10.1

RAID card

4.1

Memory

3.4

SCSI cable

2.2

Fan

2.2

CPU

2.2

CD-ROM

0.6

Raid Controller

0.6

Table 1.2: Relative frequency of hardware replacements for large systems [SchroederG07]

Chapter 1. Introduction

7

the disk drivers. When this type of failure happens, disks are visible to the storage layer but I/O requests are not correctly responded by disks.

ˆ

Performance failure : This type of failure happens when the storage layer detects that a disk cannot serve I/O requests in a timely manner while none of previous three types of failures are detected. It is mainly caused by partial failures, such as unstable connectivity or when disks are heavily loaded with disk-level recovery (e.g., broken sector remapping).

The important ndings of the study were :

ˆ

Physical interconnect failures i.e. components other than disks make up the largest (27-68%) in the failure of storage subsystem.

Disk failures make up the second

largest part (20-55%) whereas protocol failures and performance failures, contribute to 5-10% and 4-8% of storage subsystem failures.

Choices of disk types, shelf

enclosure models and other components of storage subsystems contribute to the variability.

ˆ

Each individual storage subsystem failure type and storage subsystem failure as a whole exhibit strong self-correlations. In addition, these failures exhibit bursty patterns.

ˆ

Storage subsystems congured with redundant interconnects experience 30-40% lower failure rates than those with a single interconnect.

ˆ

Spanning disks of a RAID group across multiple enclosures provides a more resilient solution for storage subsystems than within a single enclosure.

Components other than disks being the largest contributor in storage reliability is the main motivation for our work.

Another recent study [FordLPSTBGQ10] by Ford et al. characterizes the end to end data availability properties of cloud storage systems at the distributed le system level based on an extensive one year study of Google's main storage infrastructure. This study

Chapter 1. Introduction

8

also presents statistical models that enable further insight into the impact of multiple design choices, such as data placement and replication strategies. A key point of this report is the importance of modelling correlated failures when predicting availability, and show their impact under a variety of replication schemes and placement policies. Basic building blocks of these systems are commodity servers, which are organized into racks (around 40 servers) that in turn are interconnected into clusters. This report considers failures of CPU, DRAM and disks at a server level, shared networking or power failure at a rack level and cluster interconnect failure and failures of automated management system (which takes care of upgrades ) at cluster level.

In this study, one le system

instance is referred to as a cell and storage server programs running on physical machines in a data-center, managing local disk storage on behalf of the distributed storage cluster, are called node.

MTTF of node was found to be very low (4.3 months) and planned

machine reboots (e.g. kernel version upgrades) was found to be the major cause of node unavailability.

In these systems, resilience to failure is increased by using replication

or erasure encoding across nodes.

In both cases data is divided into a set of stripes,

each of which comprises a xed sized data/code blocks called chunks.

The Markov

model presented in the study takes chunk failure rate (includes node failure rate and disk failure rate) and chunk recovery rate as model input and calculates stripe MTTF, which is a complex function of the individual node availability, the encoding scheme used, the distribution of correlated node failures, chunk placement, and recovery times. One important nding of the Markov model analysis was that policies where data is spread over source of correlation, such as rack aware placement policies (no two chunks in a stripe are placed on nodes in the same rack) and multi-cell replication schemes, increase the stripe MTTF signicantly. The study shows that the components below the server (node) level (disk failure rate or latent sector error rate) do not contribute much to the data availability. Existence of correlated failure in a large system is the main important nding of this study. In our work we also have assumed correlated failure and modelled them accordingly.

Chapter 1. Introduction

9

While previous work provides a good understanding of storage failure characteristics (such as hardware failures are the most important contributor in the failure of a storage system) it is not enough because, to design a reliable RAID storage system, we need to model all the RAID components but previous work, to the best of our knowledge, has not considered such models. We need to consider manual component repairs in the model which was absent in the statistical models presented by Ford et al. [FordLPSTBGQ10].

1.3 Contributions of the Thesis A storage system designer needs to answer the following questions

ˆ

Given some components what are the best reliable congurations using them ?

ˆ

Given two congurations which is the more reliable and how much is it reliable than the other?

ˆ

How to increase the reliability of a given system to a certain amount?

Which

component's reliability should he increase and how much should he increase to obtain that certain amount of reliability increase for a system ?

We show that all these questions can be answered by building and analysing an accurate model of the whole system.

In this work, we model all the components of a RAID storage system (controllers, enclosures, expanders, interconnects and disks) with a failure rate and repair rate. We assume exponential failure distribution for all the components in the system except disks. For disks, we initially assume a simple 3-state model [Xin05diskinfant] and later use a Weibull model [Elerath07]. We approximate Weibull model using exponentials and show that this model gives almost the same results as the sequential Monte-Carlo simulation methods [Elerath07] for disk subsystems.

Chapter 1. Introduction

10

However, the results of using this detailed Weibull model do not agree well with the eld data for the RAID congurations we use for validation. Hence we infer correlated failures in such congurations and therefore revise our models by estimating and including correlated failures. Since such models are computationally dicult, we therefore use hierarchical decomposition techniques; we are then able to model large RAID congurations with upto 600 disks and 40 controllers.

. The hierarchical decomposition technique assumes that a subsystem, which is decomposed from the main system to model separately has a constant failure rate.

For

some special kind of systems where there are subsystems not connected to each other i.e. they do not share any common components, we use two other techniques to relax the constant failure rate assumption of hierarchical decomposition :

ˆ

Calculating mean time to the failure of the whole system by simulating a single subsystem at a time rather than simulating the whole system.

ˆ

Calculating mean time to the failure of the whole system from transient probability of failure for each individual subsystem using discretization approach.

We perform the following analyses using our models :

ˆ

Sensitivity Analysis

: From the model of a RAID system, we can determine

the components that mainly contribute to the failure. Our results show that, in most of the cases, enclosure is the main contributing component in the failure of storage system. To increase the reliability of a storage system, we can increase the reliability of the critical components appropriately to increase the reliability of the whole system according to user requirements. If we have an accurate model of an existing system then we can know about this appropriate amount i.e. how much the reliability of a critical component need to be increased to obtain a reliability for the whole system which satisfy user requirements.

ˆ

Designing reliable congurations

: Our results show that spanning a RAID

group across enclosures may increase or decrease reliability. We therefore present

Chapter 1. Introduction

11

an algorithm for optimally designing a across

m

n

disk RAID group that tolerates

f

faults

enclosures given that enclosure is the main contributing component in

the failure of RAID systems. Moreover, when enclosure MTTF depends on the number of disks, our results show that this algorithm is not useful and we need model results to predict the reliable conguration. Where correlated disk failures exist, our results show that the model is sensitive to two parameters : correlated failure probability and enclosure failure rate. We show that the decision of spanning a RAID group across enclosures depends on enclosure failure rate as well as correlated failure probability. For example, for a certain enclosure failure rate and correlated failure probability spanning i.e. using more number of enclosures predicts a reliable conguration whereas for some other enclosure failure rate and correlated failure probability spanning predicts a less enclosure conguration.

ˆ

Cost-reliability trade-os :

While redundancy increases the reliability of a stor-

age system, the increase in reliability depends on the conguration.

For exam-

ple, redundancy increases reliability by a factor of 1000 or more in some systems, whereas in some others, the increase is much less. Such factors are important for cost-reliability trade-o analyses. Note that in many scenarios it may be possible to predict the more reliable conguration among two congurations just by looking at the conguration.

But, if we have a model for those two congurations then

we can predict the dierence between the reliabilities of them. This information is very important for cost-reliability trade-o analyses.

1.4 Outline of the Thesis The rest of the thesis is organized as follows :

Chapter 1. Introduction

12

Chapter 2 describes the components present in storage systems and presents some congurations.

Chapter 3 provides some background on disk failure mechanisms and disk reliability models.

Chapter 4 provides some reliability related denitions.

Chapter 5 lists reliability measures, model inputs and modelling assumptions.

Chapter 6 shows several modelling techniques. This chapter shows the modelling of RAID storage systems using a simple disk model. This part shows the what-if analyses that we perform using our models. This chapter also explains the hierarchical decomposition technique, simulation and discretization approach to calculate the reliability measures of some large storage congurations.

Chapter 7 shows detailed modelling of RAID disk subsystems assuming Weibull model and correlated failure. This chapter also discusses validation of our model against elddata.

Chapter 8 presents conclusions and possible future work.

Chapter 2 Storage System Architecture In this chapter, we detail the typical architecture of external storage systems. External storage system is dierent from direct attached storage (DAS) in the sense that it is connected to multiple hosts through storage area network (SAN) rather than dedicated to only one host. Figures 2.1 and 2.2 shows the diagram of an external storage system and DAS storage system respectively.

2.1 Main Components Present in Storage Systems The storage systems we consider here consist of the following components :

ˆ

RAID Controller :

RAID controller is usually a pair of controllers, one acting as

the primary and the other one acting as the back up. If the primary controller fails switchover occurs to the backup controller. For load balancing, one controller is the primary for half the disks in the system that also acts as the secondary for the rest of the disks; it is vice versa for the other controller. We use the term controller pair/RAID controller to denote both the controllers and the term controller to denote a single controller. Each controller consists of several components



XOR ASIC, Control Processor,

Internal Control, Processor Memory, host interface cards (provides the front end),

13

Chapter 2. Storage System Architecture

14

Figure 2.1: Diagram of an external storage subsystem; This is provided by a storage vendor

Chapter 2. Storage System Architecture

Figure 2.2: Diagram of a DAS subsystem; This is provided by a storage vendor

15

Chapter 2. Storage System Architecture

16

SAS controller (provides the back end), DDR SDRAM data cache, Battery back up, etc. Of all the components described above, battery failure and cache failure does not cause controller to fail. System continues to work without battery. In such a case, user is advised that write-back caching be disabled, and a switch be made to write-through caching.

If the user still continues with write-back cache, it is

made known that they risk data loss in case of power outage. If SSD cache fails then caching is disabled. Moreover, fatal error in the rmware code (TLB exception) are edge cases

ˆ

Enclosure :



a restart of the controller xes them.

All the disks reside in a component called external storage enclosure

for expandability and portability. In each enclosure, there are several components such as redundant power supply/cooling fans and midplane for all the disks. An enclosure fails if both the power supplies fail or both the cooling fans fail, or the midplane fails.

As enclosure components are shared by all the disks inside, an

enclosure's reliability depends on the number of disks inside. This is not only true for the enclosures of the storage vendor from which we got eld data but also enclosures from other vendors like Wiebetech [Wiebe].

ˆ

Expander :

Expander provides large storage environments the ability to connect

multiple targets and initiators through a switched device for scalability and faulttolerant path-redundancy to improve system reliability ideal for today's data center and storage subsystems.

ˆ

Interconnects :

The interconnects are generally SAS cables for connecting com-

ponents in the system.

ˆ

Disks

: The disks in the system are SAS (Serially Attached Storage) or SATA

disks. SAS controllers used in this system supports both SAS and SATA disks.

We need to say, in short, about the SAS (serial attached SCSI interface).

Serial

Attached SCSI (SAS) is a computer bus used to move data to and from computer storage

Chapter 2. Storage System Architecture devices such as hard drives and tape drives.

17

SAS depends on a point-to-point serial

protocol that replaces the parallel SCSI bus technology that rst appeared in the mid 1980s in data centers and workstations, and it uses the standard SCSI command set. SAS oers backwards-compatibility with second-generation SATA drives.

2.2 Some Storage System Congurations Figures 2.3(a) and 2.3(b) show a 4 disk RAID5 group in one enclosure and across 2 enclosures respectively. There are multiple types of redundancy present :

ˆ

redundant controllers, interconnects, expanders

ˆ

redundant disks (RAID)

ˆ

redundant enclosures (with spanned RAID groups across)

(a) 1 enclosure

(b) 2 enclosure Figure 2.3: 4 disk RAID5

Chapter 2. Storage System Architecture

18

While the rst two redundancy mechanisms clearly increase the reliability of the system, analysis is needed to know when spanning is benecial.

Chapter 3 Background on Disk Failure 3.1 Disk Failure Mechanisms In computing, a hard-disk failure occurs when a hard disk drive malfunctions and the stored information cannot be accessed with a properly congured computer. They can be classied as follows [Elerath07]:

3.1.1 Operational Failures The inability to nd data is most often caused by operational failures, which can occur any time the HDD disks are spinning and the heads are staying on track. This can be caused by the following reasons :

ˆ

Head crash :

The most notorious cause of hard-disk failure is a head crash,

where the internal read-and-write head of the device, usually just hovering above the surface, touches a platter, or scratches the magnetic data-storage surface. A head crash usually incurs severe data loss, and data recovery attempts may cause further damage if not done by a specialist with proper equipment.

Hard-drive

platters are coated with an extremely thin layer of non-electrostatic lubricant, so that the read-and-write head will simply glance o the surface of the platter should a collision occur. However, this head hovers mere nanometers from the platter's

19

Chapter 3. Background on Disk Failure

20

surface that makes a collision an acknowledged risk.

ˆ

Bad Servo-track : Heads must read servo wedges that are permanently recorded onto the media during the manufacturing process and cannot be reconstructed with RAID if they are destroyed.

These segments contain no user data, but provide

information used solely to control the positioning of the read/write heads for all movements. If servo-track data is destroyed or corrupted, the head cannot correctly position itself, resulting in loss of access to user data even though the user's data is uncorrupted. Servo tracks can be damaged by scratches or thermal asperities.

ˆ

Can't stay on track :

Tracks on an HDD are never perfectly circular.

The

present head position is continuously measured and compared to where it should be and a position error signal is used to properly reposition the head over the track. This repeatable run-out is all part of normal HDD head positioning control. Non-repeatable run-out caused by mechanical tolerances from the motor bearings, excessive wear, actuator arm bearings, noise, vibration and servo-loop response errors can cause the head positioning to take too long to lock onto a track and ultimately produce an error. High rotational speeds exacerbate this mechanism in both ball and uid-dynamic bearings.

ˆ

SMART limit exceeded :

HDDs use self-monitoring analysis reporting tech-

nology (SMART) to predict impending failure based on performance data.

For

example, data reallocations are expected and many spare sectors are available on each HDD, but an excessive number in a specic time interval will exceed the SMART threshold, resulting in a SMART trip.

ˆ

Bad Electronics :

Currently, most head failures are due to changes in magnetic

properties. Electro-static discharge (ESD), physical impact (contamination), and high temperatures can accelerate magnetic degradation. ESD induced degradation is dicult to detect and can propagate to full failure when exposed to localized heat from thermal asperities (T/As).

The HDD electronics are attached to the

Chapter 3. Background on Disk Failure

21

outside of the HDD. DRAM and cracked chip-capacitors have also been known to cause failure.

3.1.2 Latent Defects Data is sometimes written poorly initially, but can be corrupted after being written. Unless corrected, missing and corrupted data become latent defects.

Error During Writing ˆ

Bad Media : rupted data.

Writing on scratched, smeared, or pitted media can result in cor-

Scratches can be caused by loose hard particles (TiW, Si2O3, C)

becoming lodged between the head and the media surface.

Smears, caused by

soft particles such as stainless steel and aluminum, will also corrupt data. Pits and voids are caused by particles that were originally embedded in the media during the sputtering process and subsequently dislodged during the nal processing steps, the polishing process to remove embedded contaminants, or eld use. Hydrocarbon contamination (machine oil) on the disk surface can result in write errors as well.

ˆ

Inherent bit-error rate :

The bit-error rate (BER) is a statistical measure of

the eectiveness of all the electrical, mechanical, magnetic, and rmware control systems working together to write (or read) data. Most bit-errors occur on a read command and are corrected, but since written data is rarely checked immediately after writing, bit-errors can also occur during writes.

BER accounts for some

fraction of defective data written to the HDD, but a greater source of errors is the magnetic recording media coating the disks.

ˆ

High-y writes :

A common cause for poorly written data is the high-y write.

The heads are aerodynamically designed to have a negative pressure and maintain the small, xed distance above the disk surface at all times. If the aerodynamics are perturbed, the head can y too high, resulting in weakly (magnetically) written

Chapter 3. Background on Disk Failure data that cannot be read.

22

All disks have a very thin lm of lubricant on them

as protection from head-disk contact, but lubrication build-up on the head can increase the ying height.

Data Written but Destroyed Most RAID reliability models assume that data will remain undestroyed except by degradation of the magnetic properties of the media (bit-rot). While it is correct that media can degrade, this failure mechanism is not a signicant cause.

Data can become cor-

rupted any time the disks are spinning, even when data is not being written to or read from the disk. Three common causes for erasure include thermal asperities, corrosion, and scratches/smears.

ˆ

Thermal asperities :

Thermal asperities are instances of high heat for a short

duration caused by head-disk contact. This is usually the result of heads hitting small bumps created by particles embedded in the media surface during the manufacturing process. The heat generated on a single contact may not be sucient to thermally erase data, but may be sucient after many contacts.

ˆ

Corrosion and scratched media :

Heads are designed to push particles away,

but contaminants can still become lodged between the head and disk. Hard particles used in the manufacture of an HDD, such as Al2O3, TiW, and C, can cause surface scratches and data erasure any time the disk is rotating. Other soft materials such as stainless steel can come from assembly tooling. Soft particles tend to smear across the surface of the media rendering the data unreadable. Corrosion, although carefully controlled, also can cause data erasure and may be accelerated by T/A generated heat.

Scrubbing :

Latent defects can be reduced by a technique called data scrubbing

[SchwarzMASCOTS]. During scrubbing, data on the HDD is read and checked against its parity bits even though the data is not being requested by the user.

The corrupt

data is corrected, bad spots on the media are mapped out, and the data is saved to

Chapter 3. Background on Disk Failure

23

good locations on the HDD. Since this is a background activity, it may be rather slow so it does not impede performance. Depending on the foreground I/O demand, the scrub time may be as short as the maximum HDD and data-bus transfer rates permit, or may be as long as weeks.

3.2 Disk Reliability Models Disk failure model is an area that has not been understood well enough in the past years. This is clear from the several contradictory ndings regarding disk failure models in the previous literature. In the following we describe several disk reliability models.

3.2.1 Bathtub Curve According to this model, hard-drive failures tend to follow the concept of bathtub curve (Figure 3.1). It describes a particular form of the hazard function that comprises three parts :

ˆ

The rst part (0-1 yr) is a decreasing failure rate, known as early failures due to the defect in manufacturing process (infant mortality).

ˆ

The second part (2-5 yr) is a constant failure rate, known as random failures.

ˆ

The third part (> 5 yr) is an increasing failure rate, known as wear-out failures.

This has led the International Disk Drive Equipment and Materials Association (IDEMA) to propose a more sophisticated way to measure disk drive reliability by using four dierent MTBF values for disks aged 0-3 months, 3-6 months, 6-12 months, and one year to the end of design life span.

3.2.2 Increasing Hazard Rate Model The study by Garth et al. [SchroederG07] shows that disk failure rates are rising signicantly over the years, even during early years in the lifecycle.

Failure rates nearly

Chapter 3. Background on Disk Failure

Failure Rate

Decreasing Failure Rate

Early "Infant Mortality" Failure

Constant Failure Rate

24

Increasing Failure Rate

Observed Failure Rate Wear Out Failures

Constant (Random) Failures

Time

Figure 3.1: Bathtub Curve [WikiBathtub]

double when moving from year 2 to 3 or from year 3 to 4.

This observation suggests

that wear-out may start much earlier than expected, leading to steadily increasing failure rates during most of a system's useful life. This is an interesting observation because it does not agree with the common assumption that after the rst year of operation, failure rates reach a steady state for a few years, forming the bottom of the bathtub. Hence the under-representation of the early onset of wear-out is a much more serious factor than under-representation of infant mortality and they recommend to include this in new IDEMA standards. This study also shows that Weibull distribution provides the best t for time between disk replacements.

The study by Elerath et al. [Elerath07] also shows that time to operational failure for disks can be modelled by Weibull distribution with increasing failure rate.

Chapter 3. Background on Disk Failure

25

The study by Jiang et al. [JiangHZK08] shows that disk failure distribution is better represented by Gamma distribution with increasing hazard rate.

Chapter 4 Reliability Related Denitions 4.1 Reliability Let the random variable

X

be the lifetime or the time to failure of a component. The

probability that the component survives until some time

t

is called the reliability

R(t)

X.

The

of the component. Thus

R(t) = P (X > t) = 1 − F (t) where

F

is the cumulative distribution function (CDF) of the component lifetime

component is normally (but not always) assumed to be working properly at time [i.e.

0].

R(0) = 1], and no component can work forever without failure [i.e. limt−>+∞ R(t) =

Also,

R(t)

is a monotone decreasing function of

no meaning, but we let

If

t=0

R(t) = 1

for

t < 0. F (t)

t.

Fot

t

less than zero reliability has

is often called the unreliability.

f (t) is the probability density function for time to failure of a component then f (t)∆t

is the (unconditional) probability that a component will fail in the interval However, if we have observed the component functioning upto some time the (conditional) probability of its failure to be dierent from the notion of instantaneous failure rate (or hazard rate).

26

f (t)∆t.

t,

(t, t + ∆t]. we expect

This leads us to

Chapter 4. Reliability Related Denitions

27

4.2 Hazard Rate Notice that the conditional probability that the component does not survive for an (additional) interval of duration as:

GY (x|t) =

x

given that it has survived until time

can be written

F (t + x) − F (t) P (t < X 6 t + x) = P (X > t) R(t)

The instantaneous failure rate or hazard rate

h(t) = limx−>0

t

h(t)

at time

t

is dened to be

F (t + x) − F (t) R(t) − R(t + x) = limx−>0 xR(t) xR(t)

so that

h(t) = Thus age

t

f (t) R(t)

h(t)∆t represents the conditional probability that a component having survived to will fail in the interval

(t, t + ∆t).

Chapter 5 Reliability Measures, Model Inputs and Model Assumptions Definition 1.

[RAID5 reliability]

A RAID5 group will experience data inaccessi-

bility or data loss (DIL) if 1. data in any of two disks in the RAID5 group are inaccessible. The data in a disk is said to be inaccessible if some component in its access path fails or the disk itself fails. 2. Or, a data of a disk is inaccessible and an unrecoverable error is discovered during rebuild of the data.

Most of our studies consider RAID5 groups for which we use a simple RAID5 model similar to the one given by Rao et al. [KKR06] (Figure 5.1). Note that it may appear from the denition of because

HER  1.

h that it can be > 1 but h = 1−(1−HER)(d−1)C ≈ (d−1).C.HER

Hence

h

is always less than 1.

We assume the use of hot spare drives, a drive physically installed in the array, which is inactive until an active drive fails, when the system automatically replaces the failed drive with the spare, rebuilding the array with the spare drive included (in case of RAID5

28

Chapter 5. Reliability Measures, Model Inputs and Model Assumptions

29

Figure 5.1: Simple RAID5 model by Rao et al. [KKR06]

rebuild involves reading the rest of the drives, calculating the data on the failed disk using Xor operation and copying them to the spare disk). This reduces the mean time to recovery, though it does not eliminate it completely. Subsequent additional failure(s) in the same RAID redundancy group before the array is fully rebuilt can result in loss of the data; rebuilding can take several hours, especially on busy systems. Rapid replacement of failed drives is important as the drives of an array will all have had the same amount of use, and may tend to fail at about the same time rather than randomly.

We can extend the above denition to RAID6 and other RAID systems by modifying the number of data disks that need to be inaccessible.

Chapter 5. Reliability Measures, Model Inputs and Model Assumptions

30

5.1 Reliability Measures For a system with only one RAID group there is only one reliability metric MTTDIL, which is dened as Mean time before which the RAID group experiences data inaccessibility or data loss (DIL). Generally, a RAID5 group consists of 6-8 disks because larger number of disks increase the chance of unrecoverable sector error during rebuild. Hence one RAID5 group may not be sucient to store huge amount of data . Therefore multiple RAID5 groups are needed to store huge amount of data in a storage system. For systems with multiple RAID groups accessed by multiple users we dene three reliability metrics : Definition 2.

[MTTDIL]

Mean Time before any of the RAID groups experiences

data inaccessibility or data loss (DIL). This denotes the average time before which at least one user using the system will experience data unavailability.

Definition 3.

[k%System MTTDIL+R]

Mean time before which

k%

of all the

RAID groups experience data inaccessibility or data loss even with repair. With k = 50, this is system half-time when repair is possible from the data inaccessible state.

Definition 4.

[k%System MTTDIL-R]

Mean time before which

k%

of all the

RAID groups experience data inaccessibility or data loss without any repair. With k = 50, this is system half-time without repair.

In a multiple RAID system, the rst measure (MTTDIL) is not suciently informative. Consider two users A and B where the disks for A are in very highly unreliable enclosures while those for B are not. Although the MTTDIL of the system is very low, B experiences good reliability (as the RAID groups used by B experience high MTTDIL). The

k %System

MTTDIL-R metrics can reect this information because it consider failures

from data inaccessible or data loss state. Additionally, the 100%System MTTDIL-R metrics provide an upper bound on the MTTDIL experienced by any user in the system when there is no component repair facility.

Chapter 5. Reliability Measures, Model Inputs and Model Assumptions

31

Generally, in recent storage reliability literature, transient probability of dataloss and number of double disk failures at time t (DDF(t)) [Elerath07] are considered better reliability metrics than MTTDIL [GreenanHTSRG10]. In spite of this, we have chosen the mean as our reliability measure because the eld data with us use the mean values. We calculate transient probability of failure only in case of reliability analysis of a RAID controller (Appendix A). Note that the detailed model of a RAID controller has not been used in the rest of the models for the whole system. The reason of mentioning this in the thesis is the stiness problem we have faced when modelling a RAID controller and how we recover from that.

5.2 Modelling Assumptions Due to the lack of all relevant disk failure information from eld data and access to only MTTF values of all the other components in our study, our modelling assumptions are :

ˆ

Initially, we start with the assumption that each component has failure modes that is independent of other components.

Later, we consider correlated failures for

disks in our model and show that the model results match the eld data available of some storage systems.

ˆ

For a disk, we assume at the start a simple 3-state Markov model given by Qin et al.

[Xin05diskinfant] with burn-in rate, pre-burn-in failure rate, post burn-in

failure rate (Figure 5.2). This model incorporates the bathtub curve in some sense because it reects the higher failure rates of new disk drives, a lower, constant failure rate during the remainder of the design life span.

This model does not

consider the wear-out phase assuming that typical disk is obsolete before its failure probability starts to climb, in part because newer drives have much higher capacity and performance; thus the tail of the bathtub curve is of little practical importance. Later we consider detailed model such as Weibull models. failure rate for all other components.

We assume constant

Chapter 5. Reliability Measures, Model Inputs and Model Assumptions

Figure 5.2: 3 state model for disk failure, and

β

ˆ

σ

is burn-in rate,

α

32

is pre burn-in failure rate

is post burn-in failure rate [Xin05diskinfant]; X axis shows time in hrs

We also assume constant repair rate for all the components.

5.3 Model Inputs Table 5.1 shows the disk reliability parameters taken from previous literature [KKR06, SchroederG07]. Table 5.2 shows MTTF values of other components obtained from storage vendors. These data have been generously given to us by a storage vendor on the condition that it not be attributed to them. The enclosure type we consider here can contain atmost 24 disks. We have used Mean Time to Repair (MTTR) of a non-critical component as 30 min based on some inputs from industry. However, only a single repair person is present for the whole system. We do not consider any switchover time from primary component to the back up component for a multipathing system assuming that switchover is very fast.

By disk failure we mean disk replacement based on the eld-replacement data, which is dierent from the vendor quoted MTTF [SchroederG07].

Vendor specied AFR or

Chapter 5. Reliability Measures, Model Inputs and Model Assumptions

Parameters

Value

MTTF

33 yr

burn-in rate

.00054668 /hr

pre-burn-in failure rate

1.05849359E-5 /hr

post-burn-in failure rate

3.3917198E-6 /hr

Mean Time To Reorganization

30 hr

Hard Error Rate

8E-06/GB

HDD size

500 GB

33

Table 5.1: Disk reliability parameters

Components

MTTF value

Controller

604440 hr

Expander

2560000 hr

Enclosure

28400 hr if

Interconnect

6

50% full, else 11100 hr 200000 hr

Table 5.2: MTTF of other components

MTTF is calculated by constantly running samples of the drive for a short amount of time (accelerated life and stress tests), analyzing the resultant wear and tear upon the physical components of the drive, and extrapolating to provide a reasonable estimate of its lifespan. Since this fails to account for phenomena such as the head crash, external trauma (dropping or collision), power surges, and so forth, the MTTF number is not generally regarded as an accurate estimate of a drive's lifespan. On the other hand, eld replacement data is based on disk conditions that lead a drive customer to treat a disk as permanently failed and to replace it.

For example, a common way for a customer

to test a drive is to read all of its sectors to see if any reads experience problems, and decide that it is faulty if any one operation takes longer than a certain threshold. The outcome of such a test will depend on how the thresholds are chosen. As a result, in most of the cases, a customer may declare a disk faulty, while its manufacturer sees it as healthy. Additionally, failure rate of a drive depends on the operating conditions i.e.

environmental factors, such as temperature and humidity, data center handling

Chapter 5. Reliability Measures, Model Inputs and Model Assumptions

34

Figure 5.3: Hazard rate function for disk failure assuming post burn-in failure rate < pre burn-in failure rate; ; X axis shows time in hrs

procedures, workloads and duty cycles or powered-on hours patterns.

Hence we assume a annual disk failure rate of 3.01%, which was the average ARR (Annual replacement rate) over all data sets in the study by Garth et al. [SchroederG07] whereas the vendor quoted AFR (Annual Failure rate) ranges from 0.58% to 0.88%. We calculate a mean of almost 33 yr (approximately 300000 hr) from the AFR = 3.01%. Then we nd the disk failure rate parameters of the 3-state model from the model parameters used by Qin et al. after appropriate scaling which are shown in Table 5.1. Figure 5.3 shows the hazard rate function for the fail state using the above input parameters.

The above information is consistent with the little information we have obtained from CERN. According to them disk MTTF is about 350000 hr based on a 60000 installed HDDs. These are SATA drives, enterprise quality (24 x 7 x 52 uptime) with a vendor quoted MTTF of greater than 1 million hour in their environment. The disks are spinning all the time and have a moderate I/O usage with regular peaks. Some failure analysis points strongly to vibrations as the main failure reasons. In addition they have of course batches of bad disks (manufacturing problems, only partly accounted for in the quoted

Chapter 5. Reliability Measures, Model Inputs and Model Assumptions

35

MTTF).

CFDR :

The computer failure data repository (CFDR), under USENIX Association,

aims at accelerating research on system reliability by lling the nearly empty collection of public data with detailed failure data from a variety of large production systems. Unfortunately, the repository contains very little data on the failure of storage subsystem components other than disks. Moreover, there is no diagram as well MTTF of the components of the systems for which failure data is recorded. Hence we cannot use them for modelling purposes.

For all the computational results reported in this paper, we have used a 2.8GHz 8GB RAM machine with 16GB swap space.

Chapter 6 Modelling Techniques The systems we consider have some kind of combinatorial structure i.e. failure of the system can be described in terms of the structural relationships between the system components. But the existing combinatorial models (such as reliability block diagram, reliability graph, fault tree) in literature are not useful here due to the following reasons :

ˆ

These models assume independent failure and independent repair of components. Our systems have a shared repair person and later we model correlated failures.

ˆ

We need state space models to model RAID group.

ˆ

Enclosure is a component that hosts expanders, disks, interconnects. The connection of enclosure to other components is not representable by any graph of the system where each node (or edge) corresponds to one component.

Therefore we need to use state space models to analyse this kind of systems.

Continuous Time Markov Chains (CTMC) have been used widely to build reliability models of systems. Here, we use a model checking tool, PRISM (Probabilistic Symbolic Model Checker) [PRISM], to build and analyse the CTMC models. The tool also has a discrete-event simulator engine.

36

Chapter 6. Modelling Techniques

37

6.1 PRISM 6.1.1 PRISM Model Checker PRISM is a probabilistic model checker, a tool for the modelling and analysis of systems, which exhibit probabilistic behaviour.

Probabilistic model checking is a formal veri-

cation technique. It is based on the construction of a precise mathematical model of a system which is to be analysed. Properties of this system are then expressed formally in temporal logic and automatically analysed against the constructed model.

Models are supplied to the tool by writing descriptions in the PRISM language, a simple, high-level modelling language.

The fundamental components of the PRISM

language are modules and variables. A model is composed of a number of modules that can interact with each other. A module contains a number of local variables. The values of these variables at any given time constitute the state of the module. The global state of the whole model is determined by the local state of all modules. The behaviour of each module is described by a set of commands. A command takes the form :

[] guard -> prob_1 : update_1 +............+ prob_n : update_n; The guard is a predicate over all the variables in the model (including those belonging to other modules). Each update describes a transition, which the module can make if the guard is true.

A transition is specied by giving the new values of the variables

in the module, possibly as a function of other variables. Each update is also assigned a probability (or in some cases a rate) that will be assigned to the corresponding transition.

Using these inputs, PRISM model checker constructs a model of the system, typically a labelled state-transition system in which each state represents a possible conguration and each transition represents an evolution of the system from one conguration to another over time. This is typically done by exhaustive exploration (the state space is explicitly generated using sparse matrices) or by symbolic methods (the state space

Chapter 6. Modelling Techniques

38

is represented implicitly using a formula in propositional logic, often encoded in space ecient data structures such as binary decision diagrams (BDDs), or multi-terminal BDDs (MTBDDs)). It is then possible to automatically verify whether or not each property is satised, based on a systematic and exhaustive exploration of the constructed state-transition system. Properties of these models are written in the PRISM property specication language that is based on temporal logic.

It incorporates several well-

known probabilistic temporal logics− PCTL (probabilistic computation tree logic), CSL (continuous stochastic logic), LTL (linear time logic) plus support for costs/rewards, quantitative properties and several other custom features and extensions. PRISM performs probabilistic model checking, based on exhaustive search and numerical solution, to automatically analyse such properties. Two kinds of property formula are of interest :

ˆ

P=? [true U 1.

n/m

m > 1

n

C2 λp.

Now, if the

and, for simplicity,

n

is a

disks), then the rate at which data

m[(n/m) C2 ]λp =

(n((n/m)−1))λp that is less than 2

Hence with respect to only correlated failures, spanning is a good

option. But, whether spanning will increase the chance of overall data inaccessibility or data loss will depend on enclosure failure rate also (Fig.7.17). In Fig.7.17, for

p

=

0.4, spanning is benecial when enclosure MTTF is 60000 hr. but not useful if enclosure MTTF is 28400 hr. Similarly, for a given enclosure MTTF, say 60000 hr, spanning is benecial when

p

= 0.4 but not useful when

p

= 0.2.

Hence, with correlated failure,

spanning decision depends on both enclosure MTTF as well as

p

and we need to model

the system to predict the optimal conguration (Greedy algorithm is not applicable).

Chapter 7. Detailed Model of Disk Subsystems

117

Figure 7.17: MTTDIL (hr) for a RAID group of 8 disks using correlated disk failure model. en : enclosure

Chapter 8 Conclusions and Future Work We have presented several approaches for modelling and simulating storage systems starting from upto 600 disks.

Using these models we are able to perform sensitivity

analyses, cost-reliability trade-os and choose better reliability congurations. Due to paucity of failure information (for example, the value of

p for correlated failure and exact

disk failure model), we have used some indirect techniques to estimate some of these missing parameters assuming a particular disk failure model and attempt validation in other congurations. To the best of our knowledge, there has been no comparable work in the open literature.

All this modelling algorithms can be used to design a Storage system Modelling tool. For that we need to automate all the modelling algorithms, optimizations we have used.Finally, we need to generate PRISM code from this graph structure.

Currently,

we are able to automate only the series composition of components as follows :

We represent the whole system as a directed graph structure where the nodes are components of the system and the edges represent connections between the components. . Each node also stores failure rate of that component. Enclosures can not be represented in the graph structure. Their details (i.e expanders they contain) have to be represented separately.

This graph structure is useful to identify the components in series.

118

For

Chapter 8. Conclusions and Future Work example, if

i (i=1

to

v1 , v 2 , v 3 , . . . , v n

n-1),

(outdegree of

vn

119

are consecutive nodes in a path in the graph and for all

both the indegree and outdegree of

vi

can be more than 1) then we declare that

can be replaced by an equivalent component

is 1 and indegree of

v1 , v 2 , . . . , v n

vn

is 1

are in series and

v0.

Another area of future research will be the problem of spanning disks across enclosures in case of non-MDS Erasure coded disk subsystem where the disks have independent failure modes. In this scenario, enclosure will be the main contributor in storage reliability. Hence the problem is how to span disks of a RAID group across enclosures. The problem can be dened as follows :

Suppose, we have

n

data blocks and

k

where each enclosure can contain atmost

parity blocks distributed across

b

m

enclosures

blocks. Each parity block is Ex-or of some

data blocks (for ease of understanding we can assume each data/parity block as a single bit) and each data block is involved in atleast one Ex-or operation to get some parity block. Each data block can be involved in more than one parity block operation, infact will be as that will increase the fault tolerance of the code. So, we have

k parity equations

as follows:

P1 = D11 ⊕ D12 ⊕ · · · . . .

Pk = Dk1 ⊕ Dk2 ⊕ · · · Now if some data or parity block fails and we cannot recover the lost data blocks from the non-failed data/parity blocks we call this situation a dataloss situation. The following example shows such a scenario: Suppose we have

4 data blocks and 3 parity blocks where

the parity equations are as follows:

P1 = D1 ⊕ D3 ⊕ D4 P2 = D1 ⊕ D2 ⊕ D3 P3 = D2 ⊕ D3 ⊕ D4

Chapter 8. Conclusions and Future Work

120

We represent them by a matrix called generator matrix, a



D1 , D2 , D3

matrix which is :



1 0    0 1 G=   0 0  0 0 Now, suppose

4x7

0 0 1 1 0

  0 0 0 1 1    1 0 1 1 1   0 1 1 0 1

fails and we want to check whether that causes dataloss. So,

we remove the columns corresponding to

D1 , D2 , D3

from

G

and call the matrix we

obtain the recovery matrix which is



 0 1 1 0

     0 0 1 1  0   G =   0 1 1 1    1 1 0 1 here.

Now, we check the rank of

otherwise not. Here if we Ex-or

rank (G')=4

P 1, P 2, P 3

G0 .

and

If this is less than

n=4

n

then we have dataloss,

and we do not have any dataloss. Basically,

then we can recover

D3,

from that we can recover

D1, D2.

A data/parity block fails if and only if the enclosure containing it fails and if a enclosure fails then all the data and parity blocks inside it fails. Now we want to formulate the following optimization problem:

Given

n, k , m

and the

k

parity equations (we are not allowed to choose the data

blocks which are xor-ed to form a check block, we are given an erasure coded system) we want an assignment of the disk and parity blocks across

m

enclosures such that we can

minimize the number of cases where a enclosure failure causes dataloss; i.e we want to m X minimize the function

G=

g(i),

i=0

where

Chapter 8. Conclusions and Future Work

g(i) =

  1

if failure of enclosure

 0

otherwise

121

i

causes dataloss

The problem can be dened in a theoretical sense as follows:

We are given an where

In

is the

n × (n + k)

matrix

A,

with entries in GF(2), of the form

n × n identity matrix, and B

is to partition the columns of

A = [In B],

has no zero rows or columns. The problem

A into at most m subsets, each of size at most b, such that

the number of critical subsets is minimized (with minimum

m),

is a subset of the set of columns such that if we remove it from

where a critical subset

A

the reduced matrix

n.

has rank less than

By the phrase minimum m we mean that we want to partition using minimum subsets (at the max we can use

m

subsets) without sacricing the number of critical subsets.

For example, suppose we are getting the optimum conguration (i.e. number of critical subsets is minimum) with In real world these with less than

m

m1

subsets (m1

Suggest Documents