MDS Disk Array Reliability - CiteSeerX

MDS Disk Array Reliability Walt Burkhard and Jai Menon

Gemini Storage Systems Laboratory

Computer Science and Engineering Department

University of California, San Diego

November 5, 1992

MDS Disk Array Reliability Walt Burkhard

Jai Menon

Gemini Storage Systems Laboratory University of California, San Diego 9500 Gilman Drive La Jolla CA 92093-0114

Storage Architectures IBM Almaden Research Center 650 Harry Road San Jose CA 95129-6099

1 Introduction Disk array storage systems, especially those with redundant array of inexpensive disks (RAID) Level 5 data organization [14], provide excellent cost, run-time performance as well as reliability and will meet the needs of computing systems for the immediate future. We project the needs of computing systems to the year 2000 and beyond and nd that even greater reliability than provided by RAID Level 5 may be needed. Goldstein has collected data from moderate to large commercial installations and he projects that the average installation in 1995 will contain two terabytes (TB) of data [8]. We use his projection techniques to obtain results for the year 2000 and conclude the average commercial installation will have ten TB of data. We expect this projection is pessimistic since it does not account for the proliferation of data stored for video images of multimedia applications. Between now and year 2000, we expect disk form factors to shrink to 1.3 inches with capacity of 200 MB. Accordingly the average installation will have 50,000 drives. If we assume highly reliable disk drives with mean-time-between-failures (MTBF) of 500,000 hours and a repair rate of 1:1 per hour, the mean-time-to-data-loss (MTTDL) within a (10+P) RAID Level 5 array is 2:5 109 hours. Then the average ten TB installation, constituted of 5,000 such RAID Level 5 arrays, has a 500,000 hour MTTDL. That is, the installation MTTDL is identical to its constitutient disk MTBF. Next assume that we have 1000 such typical installations. Then the MTTDL of this group of installations is approximately 500 hours. That is every 500 hours, less than 21 days, some installation within this 1000 will lose data. We note that as the number of drives per installation becomes large and as the number of such installations grows, even the excellent reliabilities provided by RAID Level 5 disk organizations will not suce. While the RAID Level 5 data organization is sucient for now, we need to consider schemes that provide higher reliabilities. Our goal within this paper is to present and analyze a data organization scheme for disk arrays that provides better reliabilities than RAID Level 5. The data organization, derived from maximal distance separable (MDS) codes, retains some of the advantages of RAID Level 5 while providing additional check disks. The scheme generalizes the RAID data organization and is the most ecient data organization in the sense that for a given number of data disks it can maintain data integrity through c concurrent disk failures with exactly c check disks. Several schemes for providing better reliability than RAID Level 5 have been proposed. Gibson et al. present the \multidimensional" parity schemes [7]. Blaum et al. present a scheme that accommodates two concurrent disk failures [2] that is also based on a novel variety of MDS codes [3]. Cheung and Kumar study \multi-parity" schemes that accommodate a xed number of concurrent failures [5]. Neither the multidimensional or multi-parity schemes are MDS organizations; both require considerably more than c check disks to withstand c concurrent disk failures.

2

November 5, 1992

Our paper is organized of follows. Section 2 contains our presentation of two MDS data organization schemes. Within Section 3 we present a derivation of the mean-time-to-data-loss for the MDS schemes. We present approximations for the reliability functions for the MDS scheme with one, two and three check disks. Next we present the data loss probability as a function of the number of installed disk arrays. We re ne this result by determining the number of disk arrays that remain operational from the start until time t. Since we can interpret the reliability function as the fraction of the disk arrays that remain operational from the start until time t, we will have immediate use for our approximations. We conclude with the observation if we have several thousand disk arrays operational, we will want additional fault-tolerance provided by the two check disk MDS organizations. In Section 4 we present a parameter sensitivity analysis. Some of our results are presented in gures and others are expressions quantifying the incremental changes. The gures indicate that without major technological improvements (decreased failure rates or increased repair rates) the only way to improve the fault-tolerant capabilities of disk arrays is to include additional check disks. The last section of the paper is the Conclusion. The Appendix contains the details of the derivation of our reliability function approximations.

2 MDS Disk Organization We begin our discussion of MDS disk organizations with a bit of interesting history. The organization scheme is derived from the so-called maximal distance separable (MDS) codes that have been known for years in coding theory [10]. Independently similar schemes have been reported by Karnin, Greene, and Hellman[9] who study the sharing of information within a \faulty environment." Rabin[16] also reports an identical scheme in the context of message transmission. Abdel-Ghaar and El Abbadi report a le comparison scheme that uses this approach [1]. Preparata [15] was one of the rst to observe the connection between Rabin's approach and the MDS codes. Recently Blaum et al. [2] have created an MDS array coding scheme which we present here. MDS organizes a le into n equal-sized pieces, referred to as fragments, with each stored as an ordinary le on a dierent disk. Another MDS parameter m speci es the number of fragments such that any m suce to reconstruct the le. The fragments are approximately 1=mth the size of the le. We can construct the le if we have access to all n fragments, but this approach need not provide any fault tolerance. We could store, for example, every nth le character within the same fragment and require all n fragments to construct the le. However if the fragments are each identical to the le itself, we have (re-named) le replication; this of course will provide excellent fault tolerance but with extravagant disk storage space requirements. The process of creating the n fragments is referred to as \ le dispersal" and the process of recreating the contents of the le from m fragments is referred to as \ le reconstruction." The dispersal transformation maps a xed number of data character values to n fragment character values. We present two MDS organizations, one mapping m data byte values and the other mapping m (n ? 1) data byte values to the n fragments. As examples, we implement an n = 5; m = 3 con guration using both schemes; each fragment will be approximately one-third the size of the le. The rst MDS organization utilizes an m n dispersal matrix D to specify the dispersal operation. The dispersal operation maps m data byte values to n fragment byte values. The dispersal matrix must have the property that any m columns are linearly independent; this condition assures the MDS property will hold and we refer to it as the independence property. An example dispersal matrix is 01 0 0 1 11 D = @ 0 1 0 1 2 A: 0 0 1 1 4 There are many such dispersal matrices even for the n = 5; m = 3 con guration. File dispersal is

3

November 5, 1992

accomplished using D such that three le characters are mapped to ve fragment characters. (p3j ; p3j +1; p3j +2) D = (f0;j ; f1;j ; f2;j ; f3;j ; f4;j ) Fragment character fi;j is stored as the j th character within fragment Fi . Continuing our example, suppose that our le contains the ASCII encoded text: The old man and the sea.\n

Then the ve fragments contain the following byte values expressed in octal notation: F0 : 124 040 144 141 141 040 145 145 012 F1 : 150 157 040 156 156 164 040 141 000 F2 : 145 154 155 040 144 150 163 056 000 F3 : 131 043 051 057 153 074 066 052 012 F4 : 141 077 341 075 134 031 230 037 012 Fragments F0; F1 and F2 consist of every third value in the le as demonstrated in gure 1. The contents of fragments F2, F3 and F4 require more explanation. We desire to store the sum F0 : F1 : F2 :

T h e

\40 o l

d \40 m

a n \40

a n d

\40 t h

e \40 s

e a .

\n \0 \0

Figure 1: Fragments F0, F1 , and F2 and products of byte values within a byte; all arithmetic is within the nite eld GF(28): Addition and subtraction are just the exclusive-or operation. However the multiplication and division operations are more intricate; we can create a multiplication table using techniques described within MacWilliams and Sloane [10]. The reconstruction process is de ned as a transformation that maps byte values from m fragments to m data byte values for le. Since any m columns within D are linearly independent, we construct the m m inverse matrix R for the columns associated with the m fragments participating within the reconstruction. The fragment entries are processed sequentially. Continuing our n = 5; m = 3 example, we can reconstruct a triple of data byte values from the original le by solving a triple of linear equations in three unknowns since any triple of columns within D is linearly independent. Suppose we have access to fragments i1; i2 and i3 ; then we will have (fi1 ;j ; fi2 ;j ; fi3 ;j ) Ri1 ;i2 ;i3 = (p3j ; p3j +1; p3j +2) where Ri1 ;i2;i3 is the 3 3 inverse matrix for columns i1 ; i2; i3: Within an n = 5; m = 3 con guration, there are 10 dierent inverse matrices to store or determine dynamically. The inverse matrix for fragments 2, 3 and 4 is 0 6 5 3 1 0 0 1 1 1?1 1 R2;3;4 = 3 @ 2 1 0 A = @ 0 1 2 A 1 1 0 1 1 4 The individual reconstruction computations are (p3j +2; p3j + p3j +1 + p3j +2; p3j + 2 p3j +1 + 4 p3j +2) R2;3;4 = (p3j ; p3j +1; p3j +2) The sequence of computations to reconstruct the le from fragments F2 ; F3 and F4 would be (145; 131; 141)R2;3;4; (154; 043; 077)R2;3;4; (155; 051; 341)R2;3;4;

4

November 5, 1992

yielding the ASCII character code data for the le. (124; 150; 145); (040; 157; 154); (144; 040; 155); (t; h; e) ; ( ; o; l) ; (d; ; m) ; We observe that there are two varieties of fragments within our example; the three fragments F0, F1 and F2 allow very easy le reconstruction since merging them is all that is necessary. Any other combination of fragments will require \real" computation during reconstruction. While the presence of fragments allowing this simplistic reconstruction is not required by the MDS data organization scheme, it is certainly advantageous. Dispersal schemes containing these ease-of-reconstruction fragments are said to have the systematic property. More generally, we will require that our dispersal matrices always have this property; that is, the dispersal matrix contains m columns constituting an identity matrix. Selecting a dispersal matrix satisfying both the independence and systematic properties is relatively straightforward as shown in [17]. Reconstruction will be the very straightforward merging of les for the m fragments associated with these columns. We refer to these m fragments as primary fragments and the others as secondary fragments. In the absense of failed sites and other conditions being equal (such as site load) the reconstruction process will access only the primary sites. Finally we return to the issue of selecting a dispersal matrix. We have two requirements for these matrices: (1) the systematic property and (2) the independence property. The independence property ensures that any m fragments suce to reconstruct the le from the fragments. The systematic property ensures that for one combination of m fragments, reconstruction is accomplished by merging the fragments. The Vandermonde matrix [12] provides an n n matrix having nonzero determinant thereby assuring linear independence of its columns. We can construct dispersal matrices by truncating the bottom n ? m rows of the Vandermonde matrix. Since any m shortened columns constitute an m m Vandermonde matrix, any combination of m columns will be linearly independent as required. We then utilize elementary matrix transformations to obtain a dispersal matrix in which the leftmost m columns form the identity matrix. We can construct such a matrix over GF(28) with as many as 256 columns. Going full circle, we utilize results from coding theory regarding twice-extended Reed Solomon codes [10] to obtain suitable dispersal matrices over GF(28) with 257 columns. For practical applications this is probably more than adequate. We note that our previous dispersal matrix was not selected in this fashion. The Blaum et al. MDS scheme is based on array codes; we present an n = 5; m = 3 con guration here as well. Dispersal maps 12 data byte values to 20 fragment values. Codewords are 4 5 arrays

2 3 66 77 42 2 2 2 25

2 2 33 66 3 2 77 4 3 25 2 3

3 3 3 3 3

Figure 2: Array Code Parity Matrices in which each entry is a byte value. The byte values within each column are stored on a disk drive. The parity conditions (syndromes) are listed here for codeword C = fci;j g where 0 i 3 and 0 j 4:

M 4

M 5

k=0

ci;j = 0

for 0 i 3

ck;(j +k) mod 5 = 0

for 0 j 4

j =0

5

November 5, 1992

The rst equation de nes the horizontal syndromes and the second the diagonal syndromes. The pair of arrays within gure 2 geometrically de ne the parity constraints for our code. In our example, each codeword designates 20 byte values stored four per disk on ve disks. Twelve data character values are positioned within the leftmost three columns of the matrix; this is the systematic property for these codes. The rightmost two columns contain the parity information. Equation 1 de nes the parity values in terms of the data values. The exclusive-or operation is designated by the mark. c0;4 = c1;0 c2;1 c3;2 c0;3 = c0;0 c0;1 c0;2 c0;4 c1;4 = c0;3 c2;0 c3;2 c1;3 = c1;0 c1;1 c1;2 c1;4 c2;4 = c1;3 c0;2 c3;0 (1) c2;3 = c2;0 c2;1 c2;2 c2;4 c3;4 = c0;1 c1;2 c2;3 c3;3 = c3;0 c3;1 c3;2 c3;4 We continue our example and suppose that our le contains the ASCII encoded text: The old man and the sea.\n

Then the ve fragments contain the following byte values expressed in octal notation: F0 : 124 040 144 141 141 040 145 145 012 000 000 000 F1 : 150 157 040 156 156 164 040 141 000 000 000 000 F2 : 145 154 155 040 144 150 163 056 000 000 000 000 F3 : 171 120 175 126 105 175 112 146 012 012 012 012 F4 : 040 163 124 171 056 101 174 114 000 012 012 012 Fragments F0; F1 and F2 consist of every third value in the le as demonstrated in gure 3. Fragments F3 and F4 contain the parity values according to the eight equations presented above. F0 : F1 : F2 :

T h e

\40 o l

d \40 m

a n \40

a n d

\40 t h

e \40 s

e a .

\n \0 \0

\0 \0 \0

\0 \0 \0

\0 \0 \0

Figure 3: Fragments F0, F1 , and F2 Reconstruction of the data should one or two disks fail proceeds is a two phase process. First we calculate the horizontal and diagonal syndromes assuming two disks have failed; the associated columns within the codewords contain zero entries. The second phase determines the values in the two \erased" columns using equations similar to 1. The horizontal syndromes Sh and the diagonal syndromes Sd are Sh[i] = Sd[j] =

M 4

j =0

M 3

k=0

ci;j

for 0 i 3

ck;(j +k) mod 5

for 0 j 4

The second phase begins with selection of the initial row r to reconstruct. We let i and j designate the two failed disks that cannot access data where 0 i < j n ? 1:

6

November 5, 1992

c4;i = 0 for ( r = j ? i ? 1; r 6= 4; r = r + (j ? i) mod 5 ) cr;j = Sd[(j ? r) mod 5] cr?(j ?i) mod 5;i cr;i = Sh[r] cr;j We continue our example and assume fragments F0 and F1 are not accessible. Here is the rst codeword to \decode" with the rst two columns set to zero.

2 000 66 000 4 000 000

000 000 000 000

145 154 155 040

171 120 175 126

040 163 124 171

3 77 5

For this codeword, the syndrome values are Sh[0] = 074 Sh[1] = 117 Sh[2] = 104 Sh[3] = 017 Sd[0] = 073 Sd[1] = 150 Sd[2] = 101 Sd[3] = 012 Sd[4] = 000 The second phase of reconstruction proceeds as follows: c0;1 c0;0 c1;1 c1;0 c2;1 c2;0 c3;1 c3;0

= = = = = = = =

Sd[1] c4;0 Sh[0] c0;1 Sd[0] c0;0 Sh[1] c1;1 Sd[4] c1;0 Sh[2] c2;1 Sd[3] c2;0 Sh[3] c3;1

= = = = = = = =

150 000 074 150 073 124 117 157 000 040 104 040 012 144 017 156

Thus the rst array yields the codeword

2 124 66 040 4 144 141

150 157 040 156

145 154 155 040

171 120 175 126

040 163 124 171

= = = = = = = =

150 124 157 040 040 144 156 141

3 77 5

The rst twelve data byte values are now accessible. Each consecutive array is decoded in this fashion. This scheme utilizes only the exclusive-or operation and it is applicable for disk arrays containing a prime number of disks; it is discussed in greater generality within [2, 3]. One generalization is the number of concurrent faulty disks accommodated and another relaxes the restriction to only prime numbers of disks per array.

3 Failure Models Gibson presents a characterization of disk failures [6] from which he concludes that an exponential lifetime distribution is a reasonable approximation especially for disk units with mean time to failure of at least 200,000 hours. He compares the exponential and Weibull distributions for disk units having smaller expected lifetime gures. In this situation, he concludes that neither distribution

7

November 5, 1992

adequately models these lifetimes and possibly a three parameter distribution is necessary. However Gibson states that the exponential distribution is plausible throughout. The infant mortality phenomena within disk units has been observed. We have anecdotal evidence stating that one-half of the failures, occurring within a nominal lifetime span for disk units, occur within the the rst four months of operation. We have modeled this eect by piecing together two exponential lifetime distributions; the dierence between using a single and a pair of exponential distributions is very small within our results. Accordingly we present results for a single exponential distribution lifetime model. The Markov model depicted in Figure 4 is a simple model sucient for reliability groups containing n = m + c disks and accommodating up to c concurrent disk failures without data loss. We assume nλ n

(n-1)λ

(i+1)λ ...

n-1

iλ

(m+1)λ ...

i

mλ m

F

µ (n-i)µ

(n-m)µ

Figure 4: MDS Markov Model disk failures are independent, the failure rate per disk is and the repair rate per disk is . The state labels designate the number of operational and accessible fragments. Associated with each state is probability Pi (t) designating the probability of being in state i at time t. The failure state F is absorbing in our model since it is entered only if fewer than m fragments are accessible. Other Markov models are possible here. We could repair a single disk as a time, but the repair process produces all fragments at once. To a rst approximation, this variety of model change only slightly aects the reliability values and not our conclusions.

3.1 Mean Time To Data Loss The mean time to data loss MTTDL is the expected time to enter state F. A similar derivation obtaining approximate values is presented in the USENIX Proceedings [4]. The reliability function R(t) is the probability of being in one of the non-failure states at time t; that is R(t) = Pm+c (t) + Pm+c?1 (t) + + Pm+1 (t) + Pm (t): (2) The mean time to data loss for the con guration with c check disks is MTTDLc

=

Z1 0

R(t) dt:

(3)

We determine the c for the con guration with c check disks using the Laplace transformation since its \integral property" simpli es our task. Then MTTDL

MTTDLc

=

Z1 0

R(t) dt = R (0) =

mX +c i=m

Pi(0)

where R and Pi designate transformed functions. We present a speci c case here; the general result is presented within the Appendix. The Laplace state transition equations for c = 2 are 01 10 P 1 0 s + (m + 2) ? ?2 0 m +2 BB Pm+1 CC = BB 0 B ?(m + 2) s + (m + 1) + 0 0C C B @0 A @ 0 ?(m + 1) s + m + 2 0 @ Pm A 0 PF 0 0 ?m s

1 CC : A

(4)

8

November 5, 1992

Then

Pm +2 (s) = (s + (m + 1) + )(s + m + 2)=(s) Pm +1 (s) = (m + 2)(s + m + 2)=(s) Pm (s) = 2 (m + 1)(m + 2)=(s)

where (s) is (s + (m + 2))(s + (m + 1) + )(s + m + 2) ? 22 (m + 1)(m + 2) ? (m + 2)(s + m + 2): Finally we obtain the following 22 + (5m + 6) + 2 (3m2 + 6m + 2) : (5) 2 = 3m(m + 1)(m + 2) In a similar manner, we can obtain the expected lifetimes for the c = 0; 1 and 3 con gurations which we list here. 1 = m 0 (2m + 1) = +2 m(m (6) 1 + 1) 63 + 2 (17m + 33) + 2 (14m2 + 47m + 33) + 23(2m3 + 9m2 + 11m + 3) (7) 3 = 4 m(m + 1)(m + 2)(m + 3) The MTTDL1 expression has been presented previously by Ng [13]. For reliability modeling such as the above, the ratio = is very large; accordingly we can approximate these mean time to data loss expressions by the following [4]: c !m + c MTTDLc = for c = 0; 1; 2; m c MTTDL

MTTDL

MTTDL

MTTDL

where ! = =. The improvement in MTTDLc as a function of c is rather dramatic. The mean time to data loss values are given in Table 1. c 0 1 2 3

MTTDL years

2.8 2:6 105 4:3 1010 1:0 1016

m = 10 data disks per array = 4 10?6 failures per hour = 4 repairs per hour

Table 1: Mean Time to Data Loss Results These results are very impressive; incrementing c increases the expected longevity by approximately ! which is 106 in this example. However as we will observe in Section 3.2, these gures provide too rosy a view on the performance of an ensemble of reliability groups.

3.2 Reliability The reliability R(t) is the probability a disk array con guration remains operational from the start until time t. Within our reliability model, this is the probability of being in a non-failure state at

9

November 5, 1992

time t. In the previous section, we avoided determining the reliability function explicitly but here we will obtain a symbolic representation for R(t). For c = 0, the reliability is (8) R0 (t) = e?tm = e?t=MTTDL0 In the cases c > 0, we have Theorem 1 which contains approximations to the reliability function; it is veri ed within the Appendix. Theorem 1 For each reliability group containing m data disks and c = 1; 2; or 3 check disks with failure rate and repair rate we have the following approximate reliability functions (9) R1(t) = C1;m;0 e?t=MTTDL1 + C1;m;? e?t(+(2m+1)) ? t (+(2m+3)) ? t (2+m) ? t= MTTDL 2 + C2;m;? e + C2;m;?2 e (10) R2(t) = C2;m;0 e R3(t) = (11) C3;m;0 e?t=MTTDL3 + C3;m;? e?t(+(2m+5)) + C3;m;?2 e?t(2+(m+1)) + C3;m;?3 e?t(3+m) where

MTTDL1 , MTTDL2

and

MTTDL3

are given as equations 5, 6 and 7,

Cc;m;0 = 1 + Hc m m c+ c

c

+1

c = 1; 2; and 3 ;

and Hc is the cth harmonic number. The other coecients are 2 + 1) ; C 4 m(m + 1)(m + 2)(m + 3) ; C1;m;? = ? m(m 3;m;?3 = ? 2 184 3 + 2) ; C 3 m(m + 1)(m + 2) ; C2;m;? = ? m(m +1)(m 2;m;?2 = 3 43 4 + 2)(m + 3) ; C 4 m(m + 1)(m + 2)(m + 3) : C3;m;? = ? m(m + 1)(m 3;m;?2 = 4 2 44

This approximation is valid provided = is very small. The coecients are determined using the partial-fraction expansion of the transformed reliability function. The roots of the denominator are approximated using Newton's iteration scheme. Moreover, the largest discarded term in each coecient is smaller, in magnitude, by a factor of =.

3.3 Data Loss vs Number of Units The reliability function provides a gure of merit for a disk array. A second gure of merit is obtained by considering a large ensemble of disk arrays. Of course we desire the arrays not lose data during the typical lifetime of the constituent disk arrays, since loss of any data, however small, is highly traumatic for the users losing data. The probability of data loss PDLc;m (n; t) depends on the number of disk arrays n within the ensemble, the number of check disks per array c, the number of data disks per array m as well as the time interval t. We consider only uniform ensembles of arrays in this analysis; moreover we assume that failures are independent. Then we have PDLc;m (n; t) = 1 ? Rc;m (t)n (12) measuring the probability of at least one data loss within the interval from the start to time t. We plot PDLc;m (n; t) for xed t in Figures 5, 6, 7 and 8. The notation Rc;m (t) for the reliability function is adopted here to emphasize the uniformity assumed of the disk arrays, each array contains m data and c check disks. The three sets of curves in Figure 5 have the same parameters except for c; the individual disk failure rates = 1=100000; 1=200000;1=300000; 1=400000 and 1=500000 failures per hour. The ve curves, in the upper lefthand corner of the graph, immediately to the left

10

November 5, 1992

1.00

c=0 |

0.75

m = 10 data disks per array = 10; 5; 3:33;2:5 and 2 10?6 failures per hour = 4 repairs per hour t = 7 years

c=1

c=2

|

0.50

data loss probability |

0.25

|

10 0

10 2

|

|

|

|

10 4 10 6 10 8 10 10 Number of Disk Array Subsystems

|

10 12

Figure 5: Probability of Data Loss vs Number of Subsystems of the phrase c = 0 designate no check disk within the arrays. The leftmost curve has probability of data loss greater than 0.9978 for one array; this curve appears almost as a horizontal line within Figure 5. The probability of failure is practically one for as few as ten disk arrays for any of our ve failure rates when c is zero. The ve curves immediately to the left of the phrase c = 1 designate single check disk probabilities and the curves to the right of c = 2 pertain to two check disks. Within each set, the curves from left to right have decreasing values. Thus for example, with n = 10000 arrays each con gured with a single check disk, we expect at least one data loss with probability approximately 2=5 for = 1=200000. However utilizing two check disks per array, the probability of data loss is very close to zero for the same number of arrays.

3.4 Expected Number of Failures The probability of data loss depends on whether our array con gurations contain two, one or no check disks. We are interested in reducing the expected number of \unhappy" clients who lose data during the disk array lifetime. We propose to re ne our measure by determining the expected number of arrays that will lose data during the observation interval. Let Xc;m (t; n) designate the number of arrays, each containing c check disks and m data disks, remaining operational at time t provided that n are initially placed in service. Then n i n?i for i = 0; 1; 2; : : :; n probfXc;m (t; n) = ig = i Rc;m (t) (1 ? Rc;m(t)) Since Xc;m has a binomial distribution, the expected number of array remaining operational at time t is nRc;m(t): Thus we can interpret Rc;m(t) as the expected fraction of the arrays remaining operational during the interval from the start until time t. The expected number of failures over a seven year period for an ensemble of 10,000 arrays each con gured with a single reliability group are tabulated in Table 2. The expected number of failures becomes insigni cant only when c is at least two for repair and failure rates typical of current technologies. We conclude that RAID Level 5 (c = 1) is not a satisfactory storage architecture for massively employed subsystems since the probability of at least one unhappy client is too large.

11

November 5, 1992

c 0 1 2

Expected Failures 9535 0.42 510?6

n = 10000 disk arrays m = 10 data disks per array = 5 10?6 failures per hour = 4 repairs per hour t = 7 years

Table 2: Expected Failures

4 Parameter Sensitivity Having touched the issue of viability of one check disk array architectures at least for current technologies, we ask the question, what is the least expensive approach to providing suitable reliability. We explore the eects of incremental changes in , , m or c individually have on the reliability. We can calculate partial derivatives of Rc;m with respect to and as well as dierences with respect to m and c. Moreover we can determine the reliability for an ensemble of identically con gured arrays providing a constant number of data disks for a variety of check disks. We are left with the question of trade-os among the reliability improvements and the costs of the incremental changes. We present curves demonstrating the sensitivity of the parameters and then conclude with some analytical results. Within Figure 5 we have already noted the signi cant improvement in the data loss probabilities when c is incremented. We begin with one check disk con gurations. Figure 6 demonstrates the eect of various failure rates with other parameters constant. The failure rates are written diagonally across the curves. A 1.00

|

0.75

|

0

00

00

50

50

1/

m = 10 data disks per array = 2 10?i failures per hour i = 4; 5; 6; 7; 8;9 = 4 repairs per hour t = 7 years c=1

00

00

00

50

1/

00

00

0

00

50

1/

1/

|

0.50

00

50

1/

00

00

0

00

50

1/


0.25

|

10 0

10 2

|

|

|

|


|

10 12

Figure 6: Probability of Data Loss vs Number of Subsystems: Various Failure Rates typical failure rate, within current technology, is 1=500000 per hour; this curve rapidly goes to one at approximately 104 disk arrays. We next consider various repair rates, very slow to \impossibly" fast, and present the results within gure 7. The repair rates are written diagonally across the curves; e.g. 4 designates four repairs per hour. Repair rates much larger than four per hour are not possible within current technologies. The incremental improvement for the data loss probability is

12

November 5, 1992

1.00

|

0.75

|

04

00

0.

04

0.

4

0.

4

40

0

40

00

40

m = 10 data disks per array = 2 10?6 failures per hour c=1 t = 7 years

0

00

40

|

0.50

4

00

0.


0.25

|

10 2

|

|

|

10 4 10 6 10 8 Number of Disk Array Subsystems

|

1010

Figure 7: Probability of Data Loss vs Number of Subsystems: Various Repair Rates less for repair increments than for failure decrements. We conclude our presentation of one check con gurations with the curves of Figure 8; the subsystems measured here are somewhat dierent from the previous arrays. Each subsystem contains exactly 120 data disks and these disks are logically partitioned into disjoint arrays each with a single check disk. For example, we might have one array with 120 message disks and one check disk or we could have 2 arrays each containing 60 message disks and one check disk and so forth. The notation i(m+1) designates i arrays each with m message disks; the product m i is always 120. The probabilities of data loss speci ed within gure 8 are for ensembles each containing 120 message disk comprised of a number of disk arrays. Each component array contains a single check disk. The 120(1+1) con guration designates data replication, there are 120 check disks within this con guration. The probability of data loss diminishes with increased numbers of check disks within the ensemble. Next we consider two check disk con gurations. We present Figures 9, 10 and 11 which contain the data loss probabilities for various failure rates, repair rates, and the constant message capacity con gurations respectively. We note within that the curves are spread further apart than for the single check disk curves. The extra check disk decreases the sensitivity to changes in the failure rate. We observe a similar behavior with respect to the repair rates in Figure 10. We observe again the improved sensitivity to increases in the repair rate; the curves are spread further apart than for the single check disk curves. Finally, we present the constant storage capacity curves in Figure 11. These curves spread over a very wide range providing improved sensitivity to increased numbers of check disks. The notation i(m+2) within the gure designates a group of i disk arrays each containing m message disks and two check disks such that i m is 120. We consider the improvement obtained due to an extra check disk per disk array versus partitioning the disks into smaller groups. For example, we consider within gure 8 the 1(120+1) and 2(60+1) curves. The pair of smaller groups has a slightly lower probability of data loss values. However, if we consider the 1(120+2) con guration within gure 11 we note the probability of data loss dominates either of the previous curves. Adding two more disks to obtain a 2(60+2) con guration is even better. We note that all the one check disk curves are dominated by each two check disk curves. For one check disk, the probability of data loss is very close to one for 104 subsystems; for two check disks, the probability is very close to zero for otherwise identical parameters.

13

November 5, 1992

|

1.00

120 data disks per array

|

0.50

1(120+1) 2(60+1) 3(40+1) 4(30+1) 5(24+1) 6(20+1) 8(15+1) 10(12+1) 12(10+1) 15(8+1) 20(6+1) 24(5+1) 30(4+1) 40(3+1) 60(2+1) 120(1+1)

|

0.75

c=1 λ = 1/100000 failures per hour

data loss probability

= 4 repairs per hour t = 7 years

|

0.25

|

|

102

10 1

|

|

103

10 4

Number of Disk Array Subsystems

Figure 8: Probability of Data Loss vs Number of Subsystems: Constant Data Disk Capacity The reliability functions of theorem 1 allow us to approximately determine the reliability values. We present evidence within tables 3, 4 and 5 of the validity of these approximations. The values within the tables are percentage intervals as speci ed within the following formula. We evaluate these approximations over a 20 year period examining the results at one year increments. At each year, we calculate three values. The rst is an estimate determined by explicitly evaluating the partial fraction expansion of the reliability function. The second uses an estimate provided by McGregor [11]. Finally we evalute the formula speci ed within Theorem 1. For each con guration, the dierence between the maximumt and minimumt values divided by the minimumt value and multiplied by 100% is the interval determined. The value reported within the tables is the maximum interval determined over the 20 year period. For particular , , and c, the formula is 20

maximumt ? minimumt

100% minimumt There are m = 10 data disks within each reliability group. max

t

=1

= 0:5 = 0:001 1:9 10?3 0.0001 7:9 10?7 0.00001 7:9 10?10 0.000002 6:3 10?12

1.0 1:2 10?4 9:9 10?8 9:9 10?11 7:9 10?13

4.0 1:6 10?6 1:5 10?6 1:5 10?12 1:2 10?14

m = 10 data disks per array. c=2 = 0:001; 0:0001;0:00001 and 0:000002 per hour. = 0:5; 1:0 and 4:0 per hour.

Table 3: Reliability: Percentage Intervals The percentage intervals are diminishing as c increases; thus the reliability values are accessible

14

November 5, 1992

1.00

|

0.75

|

00

50

1/

|

0.50

00

50

1/

m = 10 data disks per array = 2 10?i failures per hour i = 4; 5; 6 = 4 repairs per hour t = 7 years c=2

00

0

00

50

1/


0.25

|

10 0

10 2

|

|

|

|


|

10 12

Figure 9: Probability of Data Loss vs Number of Subsystems: Various Failure Rates

= 0:5 = 0:001 4:5 10?4 0.0001 4:7 10?7 0.00001 4:8 10?10 0.000002 3:8 10?12

1.0 5:8 10?5 6:0 10?8 6:0 10?11 4:8 10?13

4.0 9:3 10?7 9:4 10?11 9:4 10?13 7:5 10?15


Table 4: Reliability: Percentage Intervals

5 Conclusions Many commercial applications are projected to require 10 terabytes of on-line storage by year 2000. Even with projected improved disk form factors and storage densities, the reliabilities provided by RAID Level 5 architectures are not sucient for con gurations with this number of disk drives. The storage sizing projections are very conservative, without including video image storage requirements. We have presented the MDS disk array architectures capable of providing reliabilities arbitrarily close to one. These schemes utilize the smallest number of additional disks c to ensure data integrity in spite of up to c disk failures. Two MDS schemes are outlined { the rst based on Reed-Solomom error correcting codes and the second on Blaum-Roth error correcting codes. We considered various parameters in our reliability study; the failure and repair rates for disk drives and the number of check and data disks in a reliability group. Figure 5 indicates that incrementing c dramatically improves the probability of data loss. The probability of data loss depends exponentially on the reliability as speci ed in equation 12. Increasing the reliability lowers the probability of data loss. The ensemble of c = 1 probability of data loss curves, for RAID Level 5, are approximately one for approximately 106 disk arrays. We note the c = 2 curves, are approximately zero for this number of disk arrays. We conclude for current technologies, that the additional check disk is called for within large-sized storage systems.

15

November 5, 1992

1.00

|

0.75

|

04

00

0.

04

4

0.

0.

m = 10 data disks per array = 2 10?6 failures per hour c=2 t = 7 years

4

|

0.50

4

00

0.


0.25

|

10 2

|

|

|

|

10 4 10 6 10 8 1010 Number of Disk Array Subsystems

|

1012

Figure 10: Probability of Data Loss vs Number of Subsystems: Various Repair Rates

= 0:5 = 0:001 1:5 100 0.0001 6:1 10?4 0.00001 4:1 10?6 0.000002 1:7 10?7

1.0 2:1 10?1 1:3 10?4 1:1 10?6 4:4 10?8

4.0 3:9 10?3 7:2 10?6 6:9 10?8 2:8 10?9


Table 5: Reliability: Percentage Intervals The gure 5 curves also provide insights regarding the sensitivity of the probability of data loss value to changes in the failure rate . Figures 6 and 9 elaborate this study; in gure 6 the failure rates are varied from 2 10?4 to 2 10?9. The range 2 10?4 to 2 10?6, in gure 9, provides equally good probabilities of data loss. The 2 10?6 rate is typical of current technology; within the c = 2 con guration, the probabilities of data loss is approximately zero for up to 108 disk arrays while for c = 1 the failure rate must be diminished to 2 10?9 to obtain approximately the same performance. Figures 7 and 10 presents the results of varying the repair rate while holding the failure rate constant. Within gure 7, the repair rate varies from 4 10?4 to 4 104; rates above 4 are beyond current technologies. Within gure 10, the repair rate varies from 4 10?4 to 4. Considering the repair rate 4, within gure 7 c = 1 con gurations, the probability of data loss is approximately zero for less than approximately 103 disk arrays. Within gure 10 c = 2 con gurations, the probability of data loss remains approximately zero for less than approximately 106 disk arrays. Figure 8 compare several constant storage con gurations comprised of reliability groups each with c = 1 check disk. Figure 11 is similar except each group contains c = 2 check disks. All con gurations contain 120 data disks; the number of reliability groups ranges over 1; 2; 3; 4; :::; 30; 40; 60;120. Within gure 8, for approximately 104 ensembles of disk arrays, the probability of data loss is

16

November 5, 1992

|

1.00

120 data disks per subsystem

|

0.75

c=2 λ = 1/100000 failures per hour

4(30+2) 5(24+2) 6(20+2) 8(15+2) 10(12+2) 12(10+2) 15(8+2) 20(6+2) 24(5+2) 30(4+2) 40(3+2) 60(2+2) 1(120+2)

= 4 repairs per hour t = 7 years

2(60+2) 3(40+2)

|

0.50


0.25

120(1+2)

|

10 2

|

|

|

|

10 4 10 6 10 8 1010 Number of Disk Array Subsystems

|

1012

Figure 11: Probability of Data Loss vs Number of Subsystems: Constant Data Disk Capacity approximately one. Within gure 11, for less than 103 and up to 104 for all curves except one, the probability of data loss is approximately zero. In these con gurations, the repair rate is 4 and failure rate is 1 10?5 a rate typical of current technology. All of our comparisons indicate the improvements possible with the inclusion of a pair of check disks. We have not considered the relative costs of these actions. Decreasing the failure rate represents an improvement in the manufacturing techniques for disks. Increasing the repair rate represents an increased maintainance cost as well as improved data transfer rates. Hot-spares cannot physically swap themselves out of the disk array cage. Changes in the number of data storage disks within a disk array have moderate eect on the probability of data loss. We have not tried to optimize the reliability as a function of incremental costs. Another optimization we have not considered in this study is the run-time performance of the disk arrays as a function of m and c. We expect that increasing c will have negative eect on run-time performance. This topic is subject of our continuing research in disk arrays.

Acknowledgements We wish to acknowledge conversations with Richard Mattson regarding the use of highly reliable disk arrays for applications requiring large numbers of storage systems as well as discussions with Thomas Schwarz and Mario Blaum regarding maximum distance separable error-correcting codes.

References [1] Khaled A.S. Abedl-Ghaar and Amr El Abbadi. An Optimal Strategy for Comparing File Copies. Technical report, University of California, Davis and Santa Barbara, 1992. [2] Mario Blaum, Hsieh Hao, Richard L. Mattson, and Jai Menon. A Coding Technique for Recovery Against Double Disk Failures in Disk Arrays, 1989. US Patent Docket #SA8890443 pending.

17

November 5, 1992

[3] Mario Blaum and Ronald M. Roth. New Array Codes for Multiple Phased Burst Correction. Technical Report RJ 8303 (75664), IBM Almaden Research Center, August 1991. [4] Walter A. Burkhard and Petar D. Stojadinovic. Storage-Ecient Reliable Files. In Proceedings of the Winter 1992 USENIX Conference, pages 69{77, San Francisco, January 1992. [5] Shun Yan Cheung and Akhil Kumar. Multi-Parity: A Low Overhead Method for Achieving High Data Availability. Technical report, Cornell and Emory Universities, 1992. [6] Garth A. Gibson. Redundant Disk Arrays: Reliable, Parallel Secondary Storage. PhD thesis, University of California, Berkeley, 1990. Report UCB/CSD 91/613. [7] Garth A. Gibson, Lisa Hellerstein, Richard M. Karp, Randy H. Katz, and David A. Patterson. Failure Correction Techniques for Large Disk Arrays. In Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS III), pages 123{132, Boston, April 1989. [8] Stephen Goldstein. Storage Performance { An Eight Year Outlook. In Proceedings of the Computer Measurement Group, pages 579{585, 1987. [9] Ehud D. Karnin, Jonathan W. Greene, and Martin E. Hellman. On Secret Sharing Systems. IEEE Transactions on Information Theory, 29:35{41, 1983. [10] Florence Jesse MacWilliams and Neil James Alexander Sloane. The Theory of Error-Correcting Codes. North Holland, 1978. [11] Malcolm A. McGregor. Approximation Formula For Reliability With Repair. IEEE Transactions on Reliability, pages 64{92, December 1963. [12] L. Mirsky. An Introduction to Linear Algebra. Dover Publishers, New York, 1982. [13] Spencer W. Ng. Some Design Issues of Disk Arrays. In Proceedings of the COMPCOM Conference, pages 137{142, San Francisco, 1989. [14] David A. Patterson, Garth A. Gibson, and Randy H. Katz. A Case for Redundant Arrays of Inexpensive Disks (RAID). In Proceedings SIGMOD International Conference on Data Management, pages 109{116, Chicago, 1988. [15] Franco P. Preparata. Holographic Dispersal and Recovery of Information. IEEE Transactions on Information Theory, 35:1123{1124, 1989. [16] Michael O. Rabin. Ecient Dispersal of Information for Security, Load Balancing, and Fault Tolerance. Journal of the Association for Computing Machinery, 36:335{348, 1989. [17] Thomas J.E. Schwarz and Walter A. Burkhard. RAID Organization and Performance. In Proceedings of the 12th International Conference on Distributed Computing Systems, pages 318{325, Yokohama, June 1992.

Appendix We present a derivation of the reliability function Rc(t) for the c = 0; 1; 2 and 3 MDS organizations.

18

November 5, 1992

c=0

In this case, the Markov model of Figure 1 consists of only two states labeled m and F. The state transition equation is dPm + mP = 0 with Pm (0) = 1: (13) m dt We utilize the Laplace transform to solve this equation as well as the others in this appendix. Accordingly the Laplace transition equation is (s + m)Pm = 1 and the solution is (14) Pm (t) = R0(t) = e?tm = e?t=MTTDL0 : The principal result of this appendix is contained in theorem 1. The remainder of the Appendix is devoted to presenting our derivation of the results presented here. We conclude with a conjecture regarding the general form of Rc(t) for any c. Theorem 1 For each reliability group containing m data disks and c = 1; 2 or 3 check disks with failure rate and repair rate with = very small, we have the following approximate reliability functions

R1(t) = C1;m;0 e?t=MTTDL1 + C1;m;? e?t(+(2m+1)) R2(t) = C2;m;0 e?t=MTTDL2 + C2;m;? e?t(+(2m+3)) + C2;m;?2 e?t(2+m) R3(t) = C3;m;0 e?t=MTTDL3 + C3;m;? e?t(+(2m+5)) + C3;m;?2 e?t(2+(m+1)) + C3;m;?3 e?t(3+m) where

MTTDL1 , MTTDL2

and

MTTDL3

Cc;m;0 = 1 + Hcm

are given as equations 5, 6 and 7,

m + c c

+1

c

c = 1; 2; and 3

and Hc is the cth harmonic number. The other coecients are 2 + 1) ; C 4 m(m + 1)(m + 2)(m + 3) ; C1;m;? = ? m(m = ? 3 ;m; ? 3 2 184 3 3 + 2) ; C m(m + 1)(m + 2) ; C2;m;? = ? m(m +1)(m 2;m;?2 = 3 43 4 + 2)(m + 3) ; C 4 m(m + 1)(m + 2)(m + 3) : C3;m;? = ? m(m + 1)(m 3;m;?2 = 4 2 44

The Markov model of Figure 1 consists of c + 2 states labeled m + c, m + c ? 1, , m, and F. The c + 2 state dierential equations are represented within the state transition matrix S 1 0 s + (m + c) ? ?2 ?c 0 BB ?(m + c) s + (m + c ? 1) + 0 0 0C C BB 0 ?(m + c ? 1) s + (m + c ? 2) + 2 0 0C C BB 0 0 ?(m + c ? 2) 0 0C CC : BB .. .. . . . 0 0C . . C B@ 0 0 ?(m + 1) s + m + c 0 A 0 0 0 ?m s

19

November 5, 1992

The state transition equations are S P = I0 where P = (Pm +c ; Pm +c?1 ; Pm +c?2 ; Pm +c?3 ; ; Pm ; PF ) and initial conditions I0 = (1; 0; 0; ; 0; 0) designating that all m+c disks are operational at the start. Equation 4 is an instance of these state transition equations for c = 2. The transform of the reliability function Rc = Pm +c + Pm +c?1 + Pm +c?2 + + Pm +1 + Pm is determined from the state transition equations. It is a rational form with denominator that is the determinate of the upper-lefthand corner (c + 1 c + 1) submatrix of S. The numerator is obtained by summing m + 1 terms each obtained using Cramer's rule. We let Dc;m designate the determinant for the upper-lefthand c + 1 rows and columns within S and for all c > 0 and m > 0; we will refer to this as the reduced determinant. We have the following recurrence for the reduced determinant Dc;m = Dc?1;m+1 (s + m + c) ? cc (m + 1)(m + 2) (m + c) where D1;m = s2 + s( + (2m + 1)) + 2m(m + 1) : Similarly we let Pc;m designate the numerator of the transformed reliability function. We have the recurrence for Pc;m Pc;m = Pc?1;m+1(s + m + c) + c (m + 1)(m + 2) (m + c) where P1;m = s + + (2m + 1) These two recurrences follow directly from S. Our rst lemma demonstrates that Pc;m = Dc;m s where the division retains no remainder. Suppose for all positive m and c Dc;m has the following form for constants Ai;c;m Dc;m =

c X +1

i=0

si Ac+1?i;c;m

where A0;c;m = 1 and Ac+1;c;m = c+1 m(m + 1(m + 2) (m + c). We will verify this point in the process of verifying Lemma 1; already we noted the point for c = 1 above. Lemma 1 For all positive integers m and c Pc;m =

c X i=0

si Ac?i;c;m

We verify this lemma using mathematical induction on c. Obviously the result is true for c = 1 as we have demonstrated above. We assume the lemma holds for c < k and consider c = k. Then we expand the recurrence relations. Dk;m = Dk?1;m+1 (s + m + k) ? kk (m + 1)(m + 2) (m + k) Pk;m = Pk?1;m+1 (s + m + k) + k (m + 1)(m + 2) (m + k)

20

November 5, 1992

Our inductive hypothesis states that Dk?1;m+1 = Pk?1;m+1 =

k X i=0 kX ?1 i=0

si Ak?i;k?1;m+1 si Ak?i?1;k?1;m+1

The term by term expansion is presented and we note the identity of each coecient in the rightmost column Dk;m = sk+1 A0;k?1;m+1 A0;k;m + sk (A1;k?1;m+1 + A0;k?1;m+1(m + k)) A1;k;m + sk?1(A2;k?1;m+1 + A1;k?1;m+1(m + k)) A2;k;m .. .. . . + s2 (Ak?1;k?1;m+1 + Ak?2;k?1;m+1(m + k)) Ak?1;k;m + s(Ak;k?1;m+1 + Ak?1;k?1;m+1(m + k)) Ak;k;m + Ak;k?1;m+1(m + k) ? kk (m + 1)(m + 2) (m + k) Ak+1;k;m The nal term is expanded here. Our inductive hypothesis indicates that Ak;k?1;m+1 is k (m + 1)(m + 2) (m + k) and we observe above that Ak+1;k;m is k+1m(m + 1)(m + 2) (m + k) as we indicated. Furthermore, we note that A0;k;m = A0;k?1;m = 1 and the auxiliary point is veri ed. We compare the coecients in this expansion with those obtained for Pm;k here. Pk;m =

=

sk A0;k?1;m+1 + sk (A1;k?1;m+1 + A0;k?1;m+1(m + k)) .. . + s(Ak?1;k?1;m+1 + Ak?2;k?1;m+1(m + k)) + Ak?1;k?1;m+1(m + k) + k (m + 1)(m + 2) (m + k)

A0;k;m A1;k;m .. . Ak?1;k;m Ak;k;m

Pk si A k?i;k;m i =0

Our inductive hypothesis Ak;k?1;m+1 = k (m+1)(m+2) (m+k) is used in the Ak;k;m term. We note that Pm;c has the format claimed and since our argument is independent of m, it holds for any positive integer m. Accordingly Lemma 1 is veri ed. Ultimately we wish to nd the inverse Laplace transform of the rational form Pc;m Rc (s) = D c;m to obtain the reliability function Rc(t) for an m message and c check disk con guration. The roots of Dc;m are approximately zero, ?; ?2; ; ?c. The coecients Cc;m;r for the partial fraction expansion of Rc (s) are (s ? r)P c;m Cc;m;r = Dc;m s = r for each root r of Dc;m . Our next lemma provides the general structure of these coecients. Lemma 2 For all positive integer m and c and each root r of Dc;m

= Cc;m;r = (s ?Dr)Pc;m c;m s=r

Pc r i A c?i;c;m P i

=0 c (i i=0 + 1)ri Ac?i;c;m

:

21

November 5, 1992

The lemma is veri ed by rst noting the division has the following form for root r Dc;m (s) (s ? r) = Then we can manipulate the sums to obtain c X j =0

sc?j

j X i=0

si Aj ?i;c;m = =

c X

sc?j

c X j =0

j X

j X c ? j rj Aj ?i;c;m : s i=0

sj ?k Ak;c;m

j =o k=0 j c X X c?k s Ak;c;m j =0 k=0

=

c X

(c + 1 ? k)sc?k Ak;c;m

k=0

Finally, substituting r for s, we obtain the denominator claimed in the Lemma 2; the numerator is Pc;m evaluated at r. We consider three situations in detail: c = 1; 2 and 3. c=1

The reduced determinant of the state transition equations is D1;m = s2 + A1;1;m s + A2;1;m where A1;1;m = + (2m + 1) A2;1;m = 2 m(m + 1): Newton's iterative formulae for the roots of Dm;1 are rn = rn2 ? A2;1;m rd2 rd = rd(2rn + A1;1;m rd) where rn and rd designate the numerator and denominator respectively of the root. Dm;1 is quadratic and has two roots. The approximate values of the roots are 2 + 1) and ? 2 + 2(2m + 1) + 2(3m2 + 3m + 1) : ? +m(m (2m + 1) + (2m + 1) Finally, the coecients are given by rn + A1;1;m rd C1;m;r = 2rn + A1;1;m rd We designate the roots here by their approximate values 0 and ?. 2 2(1 + 2m) + 2 (1 + 3m + 3m2 ) 1 + 2 m(1 + m) C1;m;0 2 + + 2(1 + 2m) + 2 (1 + 2m + 2m2 ) 2 2 2 m(1 + m) m(1 + m) ? C1;m;? 2 + 2(1 +2m) 2 2 + (1 + 2m + 2m ) 2 We observe that the zero root approximation is the reciprocal and negative of the mean-time-todata-loss for the con guration and the ? root is approximately ? ? (2m + 1). Accordingly we have the excellent approximation 2 2 R1(t) 1 + m(12+ m) e?t=MTTDL1 ? m(12+ m) e?t(+(2m+1)) : (15)

22

November 5, 1992

c=2

The reduced determinant of the state transition equations is D2;m = s3 + A1;2;m s2 + A2;2;m s + A3;2;m where A1;2;m = 3 + 3(m + 1) A2;2;m = 22 + (5m + 6) + 2 (3m2 + 6m + 2) A3;2;m = 3 m(m + 1)(m + 2): Newton's iterative formulae for the roots of D2;m are rn = 2rn3 + rn2 A1;2;m rd ? A3;2;m rd3 rd = rd(3rn2 + 2rn A1;2;m rd + A2;2;m rd2) where rn and rd designate the numerator and denominator respectively of the root. D2;m is cubic and has three roots. The approximate values of the roots are 3 m(m + 1)(m + 2) ; 3 m(m + 1)(m + 2) ? ? 22 + (5m 2 2 + 6) + (3m + 6m + 2) 22 3 2 2 (2m + 3) ? 3(5m3 + 27m2 + 47m + 27) ? ? (2m + 3) ; and ? + 3 (m + 1)2?+3 m ? 2 (3m2 + 12m + 11) 3 2 ? 3m2 (m + 4) + 23 m ?2 ? m : ? 4 ? 12 2 2 ? (m + 6) + 22

The coecients are given by rn2 + rn A1;2;m rd + A2;2;m rd2 : C2;m;r = 3rn 2 + 2rn A 2 1;2;m rd + A2;2;m rd We designate the roots here by their approximate values 0, ? and ?2. 6 + 125 (5m + 6) + 642 (31m2 + 72m + 40) + 3 3 (299m3 + 1008m2 + 1080m + 360) + C2;m;0 8 86 + 125(5m + 6) + 642 (31m2 + 72m + 40) + 33 (293m3 + 990m2 + 1068m + 360) + 3 + 1)(m + 2) 1 + 3 m(m 4 3 3 3 + 1)(m + 2) ? 32 4 m(m + 1)(m + 2) + ? 3 m(m + 1)(m + 2) C2;m;? ? m(m 6 + 35 m ? 4 (6m2 + 36m + 33) + 3 3 3 + 1)(m + 2) + 32 4 (m + 1)(m + 2)(m2 + 2m + 4) + 3 m(m + 1)(m + 2) C2;m;?2 2 m(m8 6 ? 125 (m + 6) ? 4 2 (18m2 ? 192) + 43

Since the zero root is again approximately the reciprocal and negative of the mean-time-to-data-loss for the con guration and using the other approximate roots listed previously, we have the following excellent approximation R2(t) C2;m;0 e?t=MTTDL2 + C2;m;? e?t(+(2m+3)) + C2;m;?2 e?t(2+m) : (16)

23

November 5, 1992

c=3

The reduced determinant of the state transition equations is D3;m = s4 + A1;3;m s3 + A2;3;m s2 + A3;3;m s + A4;3;m where A1;3;m = 2(3 + (2m + 3)) A2;3;m = 112 + (17m + 29) + 2 (6m2 + 18m + 11) A3;3;m = 63 + 2 (17m + 33) + 2 (14m2 + 47m + 33) + 23 (2m3 + 9m2 + 11m + 3) A4;3;m = 4 m(m + 1)(m + 2)(m + 3) : Newton's iterative formulae for the roots of D3;m are rn = 3rn4 + 2rn3 A1;3;m rd + rn2 A2;3;m rd2 ? A4;3;m rd4 rd = rd(4rn3 + 3rn2 A1;3;m rd + 2rn A2;3;m rd2 + A3;3;m rd3) where rn and rd designate the numerator and denominator respectively of the root. There are four roots since D3;m is quartic. The approximate values of the roots are 4 m(m + 1)(m + 2)(m + 3) 4 m(m + 1)(m + 2)(m + 3) ? 63 + 2 (17m ? + 33) + 2 (14m2 + 47m + 33) + 63 4 3 + 7) ? 22 2(m2 + 14m + 27) + ? ? (2m + 5) ; ? 223++(5m 2 (m ? 3) ? 2 (6m2 + 37m + 51) + 4 3 ? 2 2 (7m2 + 38m + 27) + ?2 ? (m + 1) ; and ? 243 ??16 2 (m + 9) ? 2 (2m2 + 9m + 3) + 4 + 3 3(m ? 21) + 2 2 (5m2 + 12m + 99) + ?3 ? m : ? 18 3 6 ? 2 (m + 21) + 2 (2m2 + 11m + 33) +

Finally the coecients are given by rn3 + rn2 A1;3;m rd + rn A2;3;m rd2 + A3;3;m rd3 C3;m;r = 4rn 3 + 3rn2 A 2 3 1;3;m rd + 2rn A2;3;m rd + A3;3;m rd Again designating the roots by 0, ?, ?2, and ?3, the coecients are 4 + 2)(m + 3) ; C3;m;0 1 + 11 m(m + 1)(m 4 36 4 + 2)(m + 3) ; C3;m;? ? m(m + 1)(m 4 2 4 + 2)(m + 3) ; and C3;m;?2 m(m + 1)(m 44 4 + 2)(m + 3) : C3;m;?3 ? m(m + 1)(m 184

24

November 5, 1992

The zero root is approximately the reciprocal and negative of the mean-time-to-data-loss for the con guration and using the other roots lists previously, we have the following approximation for R3(t) C3;m;0 e?t=MTTDL3 + C3;m;? e?t(+(2m+5)) + C3;m;?2 e?t(2+(m+1)) + C3;m;?3 e?t(3+m) (17) We demonstrate within Lemma 3 that Cc;m;0 has the correct form in our approximations. Lemma 3 For all positive integers c and m c+1 Cc;m;0 1 + Hcm m c+ c : The coecient Cc;m;0 has the following form: Pc riA Pc 1iri?1Ac?i;c;m : Cc;m;0 = Pc i(i=0+ 1)rci?Ai;c;m = c?i;c;m i=0 1 + r Pi=0c ri Ac?i;c;m i=0

Since r is very small, this is approximately A1c?1;c;m 1 ? r AAc?1;c;m : c;c;m 1 + r Ac;c;m Lemma 4 provides the approximations for these coecients used to complete this lemma. Lemma 4 For all positive integers c and m, Ac+1;c;m = c+1 m(m + 1)(m + 2) (m + c) Ac;c;m c!c Ac?1;c;m c!Hcc?1 where Hc is the cth harmonic number. The rst claim was veri ed previously. Now we verify the remaining two claims utilizing the recurrence relation for Dc;m . First we have A1;1;m = + (2m + 1) Ac;c;m = (c + m)Ac?1;c?1;m + c (m + 1)(m + 2) (m + c) Immediately we have half of our result since by inductive hypothesis Ac?1;c?1;m (c ? 1)!c?1 : For the other, we have A0;1;m = 1 Ac?1;c;m = (c + m)Ac?2;c?1;m+1 + Ac?1;c?1;m+1 and by inductive hypothesis, we have the following approximation for the right-hand side c(c ? 1)!Hc?1c?2 + (c ? 1)!c?1 = (c!Hc?1 + (c ? 1)!)c?1 = c!Hcc?1 : Finally we need an approximation for the root closest to zero; we obtain this value using one step of Newton's iterative root nding process. c+1 + 2) (m + c) = ? mc+1 m + c root0 ? AAc+1;c;m ? m(m + 1)(m c c!c c c;c;m Substituting the values for the two coecients and the root into our approximation for Cc;m;0 we nally obtain the lemma 3 result.

25

November 5, 1992

Conjecture 1 For m message and c > 0 check disks, failure rate , and repair rate , such that = is very small, the reliability function coecients Cc;m;r have magnitude proportional to

m m c+ c except for Cc;m;0 which is approximately

c

1 + Hcm m c+ c The form of Cc;m;0 was demonstrated in Lemma 3.

+1

(18)

c

+1

:

(19)

MDS Disk Array Reliability - CiteSeerX

MDS Disk Array Reliability - CiteSeerX

Suggest Documents

Multi-Dimensional Disk Array Reliability - CiteSeerX

Disk Array Storage System Reliability - CiteSeerX

Improving Disk Array Reliability Through ... - Semantic Scholar

Performance of a Disk Array Prototype - CiteSeerX

TRAP-Array: A Disk Array Architecture Providing Timely ... - CiteSeerX

X-Code: MDS Array Codes with Optimal Encoding - CiteSeerX

Asymptotically MDS Array BP-XOR Codes - arXiv

MDS Array Codes with Optimal Rebuilding - arXiv

MDS TREC6 Report - CiteSeerX

A Performance Model of Disk Array Storage Systems - CiteSeerX

A Transactional Approach to Redundant Disk Array ... - CiteSeerX

Switchable Disk-Loaded Monopole Antenna Array with ... - CiteSeerX

TRAP-Array: A Disk Array Architecture Providing Timely Recovery to ...

Designing Disk Arrays for High Data Reliability - CiteSeerX

MDS

Cyclic Low-Density MDS Array Codes 1 - Paradise at Caltech

Repair-Optimal MDS Array Codes over GF(2) - Semantic Scholar

Cyclic Lowest Density MDS Array Codes - Lee Center for Advanced ...

Repair-Optimal MDS Array Codes over GF(2) - Semantic Scholar

Cauchy MDS Array Codes With Efficient Decoding Method - arXiv

MDS Intrepid MDS Intrepid Ultra

myelodysplastic syndromes (mds) support group - MDS Foundation

On Lowest-Density MDS Codes - CiteSeerX

MDS - Nature