Shortcut method for reliability comparisons in RAID

5 downloads 5329 Views 165KB Size Report
May 2, 2006 - ability of a RAID disk array tolerating all possible n А 1 disk failures can be specified as R % 1 А an n, where an is the .... RAID paradigm, because replicating data also doubles the ..... recover and return to the original state.
The Journal of Systems and Software 79 (2006) 1599–1605 www.elsevier.com/locate/jss

Shortcut method for reliability comparisons in RAID Alexander Thomasian

*

Computer Science Department, New Jersey Institute of Technology, Newark, NJ 07102, USA Received 19 August 2005; received in revised form 15 February 2006; accepted 17 February 2006 Available online 2 May 2006

Abstract Given that the reliability of each disk in a disk array during its useful lifetime is given as r = 1   with   1, we show that the reliability of a RAID disk array tolerating all possible n  1 disk failures can be specified as R  1  ann, where an is the smallest nonzero coefficient in the corresponding asymptotic expansion, e.g., for n-way replication R = 1  n. We compare the reliability of several mirrored disk organizations, which provide tradeoffs between reliability and load balancedness (after disk failure) by comparing their a2 values, which can be obtained via a partial reliability analysis taking into account a few disk failures. We next use asymptotic expansions to compare the reliability of hierarchical RAID disk arrays, which combine replication and rotated parity disk arrays (RAID5 and RAID6). Finally, we argue that the mean time to data loss in systems with repair is related to the reliability without repair. As part of this discussion we show how to estimate the mean time to data loss in RAID5 and RAID6 disk arrays without resorting to transient analysis. Ó 2006 Elsevier Inc. All rights reserved. Keywords: Redundant Arrays of Independent Disks – RAID; Mirrored disks; Multilevel RAID arrays; Reliability modeling; Asymptotic expansions; Hierarchical RAID; RAID repair; Rebuild time

1. Introduction The Redundant Arrays of Independent Disks – RAID classification has had a major influence on popularizing disk arrays utilizing coding techniques, rather than just mirroring (Patterson et al., 1988). There have been numerous reliability and performance evaluation studies of RAID, since the publication of Patterson et al. (1988). This paper is concerned with RAID reliability, which was the main topic of several influential theses (Gibson, 1992; Malhotra and Trivedi, 1993; Schwarz, 1994; Kari, 1997). There is renewed interest in this area, which is motivated

Abbreviations: BM, basic mirroring; CD, chained declustering; GRD, group rotated declustering; ID, interleaved declustering; JBOD, just a bunch of disks; kDFT, k disk failure tolerant; LSF, latent sector failure; MTTDL, mean time to data loss; MTTF, mean time to failure; MTTR, mean time to repair; RAIDi, RAID level i; RDP, rotated diagonal parity * Tel.: +1 973 5966597; fax: +1 973 5965777. E-mail address: [email protected] 0164-1212/$ - see front matter Ó 2006 Elsevier Inc. All rights reserved. doi:10.1016/j.jss.2006.02.035

by the requirement for storage systems with thousands of disks. Storage bricks are one possible building block for configuring large disk arrays (Gray, 2002). Each brick consists of a disk array controller, interconnect capability, DRAM serving as cache, etc. The controller can support various RAID levels: RAID0, RAID1, RAID5, and RAID6. Most disk arrays support striping for load balancing purposes, but this is the only feature distinguishing RAID0, which provides no redundancy, from Just a Bunch of Disks – JBOD. Bricks are closed systems, so that there is no inherent advantage to designating some of the disks as dedicated spares, but providing excess disk capacity in the form of distributed sparing is a viable option, since the bandwidth of all disks will be utilized (Thomasian and Menon, 1997). While this paper primarily deals with reliability analysis without taking into account the repair process, in Section 5 we discuss how more reliable disk arrays will have a higher mean time to data loss – MTTDL when repair is allowed. A weakness of reliability analysis of disk arrays is that they

1600

A. Thomasian / The Journal of Systems and Software 79 (2006) 1599–1605

postulate an infinite number of spares, while with a finite number of spares the MTTDL is just the sum of multiple time to failure intervals. Estimating the mean repair time through the rebuild process is discussed in Appendix. Replication in the form of disk mirroring, also known as RAID1, predates the RAID classification. Disk arrays utilizing coding techniques incur less redundancy than mirroring. RAID5 and RAID6 disk arrays utilize parity and Reed-Solomon codes to tolerate one and two disk failures at the minimal redundancy level of one and two disks to protect N disks, respectively (Chen et al., 1994). Parity codes such as EVENODD (Blaum et al., 1995) and RDP (rotated diagonal parity) (Corbett et al., 2004) can be used instead of Reed–Solomon codes to attain the same level of reliability at the same redundancy level as RAID6. In addition to space overhead, there is disk access bandwidth overhead for updating check blocks, which is addressed in Thomasian et al. (2004). Hierarchical RAID is a new paradigm which combines multiple RAID levels to attain a higher reliability level and performance (Baek et al., 2001). Replicated RAID5 and RAID6 disk arrays are two such configurations, which are compared with RAID5 and RAID6 disk arrays where each (virtual) disk is a mirrored pair. We are interested in comparing RAID reliabilities, while disk reliabilities are at a high level, which is the reliability interval of interest, since disks are replaced due to obsolescence long before there is a significant drop in reliability. The reliability of the remainder of the system is a multiplicative factor, which is assumed to remain fixed across different RAID configurations. The statistical analysis of data for returned failed disks led to the conclusion that the reliability of mature disk drives follows the exponential distribution r = R(t) = ekt, so that the mean time to disk failure – MTTF = 1/k (Gibson, 1992). It is well known that disk failure rates are higher at the beginning of disk lifetime due to the ‘‘infant’’ mortality rate (Trivedi, 2002). This effect has been investigated in the context of RAID disk arrays in Xin et al. (2005). In this study we assume that all disks have the same reliability function over time, which need not be exponential, consequently the reduction in reliability over time will be the same for all disks. We briefly return to this topic in Section 6. RAID system reliabilities can be compared by plotting RRAID(t) versus time or an algebraic comparison of the appropriate expressions. For example, a TMR – triple modular redundancy or 2-out-of-3 system with reliability RTMR(t) = R3(t) + 3R2(t)(1  R(t)) is more reliable than a single unit with R(t) = ekt for up to t = ln(2)/k, which is expected to exceed the mission time of interest. A simpler method to compare RAID reliabilities is proposed here, which is based on setting the reliability of a disk to a value slightly smaller than one, i.e., r = 1   with   1. The reliability of a disk with an MTTF of a million hours (114 years) at the conclusion of a useful lifetime (due to obsolescence) of three years is:

Rdisk ð3 yearsÞ  1 

3  0:975. 114

It follows that  = 0.025 in this case. Generally the value of  is expected to be small enough such that relatively accurate results can be obtained by only retaining the smallest power of  (See Eqs. (3) and (5) in Section 2). The main contribution of this work is the derivation of a simplified expressions for the reliabilities of various RAID configurations as functions of the unreliability of each disk  = 1  r. This expressions, which are in the form of asymptotic expansions, are a novel and quick method for comparing RAID reliabilities as demonstrated in the context of RAID1 organizations, RAID5, RAID6, and multilevel RAID. This paper is organized as follows. An asymptotic expression for RAID reliabilities is derived in the next section. In Section 3 we describe RAID1 organizations, which attain a more balanced disk load than basic mirroring (replicating the contents of one disk on another) after disk failure. In Section 4 we first describe RAID5 and RAID6 disk arrays, followed by two multilevel RAID arrays combining RAID1 with RAID5 and RAID6 paradigms. Disk array reliabilities and their asymptotic expansions are obtained in each section. In Section 5 we provide a justification for the reliability modeling method used in this paper. Conclusions appear in Section 6. Repair time in RAID5 via the rebuild process is described in Appendix. 2. Asymptotic reliability expression The reliability of a RAID disk array with N disks as determined solely by disk failures is given as: RRAID ðtÞ ¼

nmax X

i

N i AðN ; iÞð1  Rdisk ðtÞÞ Rdisk ðtÞ.

ð1Þ

i¼0

A(N, i) denotes the number of cases that i disk failures do not lead to data loss (A(N, 0) = 1), nmax is the maximum number of disk failures that can be tolerated: nmax = 1, nmax = 2 and nmax = N/2 for RAID5, RAID6, and RAID1 disk arrays, respectively. Substituting Rdisk(t) with r = 1   in Eq. (1) we have: nmax X i RRAID ¼ AðN ; iÞð1  rÞ rN i i¼0

¼

nmax X

AðN ; iÞi ð1  Þ

N i

;

ð2Þ

i¼0

The above equations can be rewritten as the asymptotic expansion below. Since   1 it is the smallest coefficients of  that most affect the unreliability: X RRAID ¼ 1  aj  j . ð3Þ jP1

Lemma. The reliability of a RAID with n replicated disks, which can tolerate n disk failures is given as:

A. Thomasian / The Journal of Systems and Software 79 (2006) 1599–1605

Rn

n

replicated

¼ 1  ð1  rÞ ¼ 1  n .

ð4Þ

Note that the reliability drops with the nth power of , since n  1 disk failures can be tolerated. In addition lower powers of  do not appear in the asymptotic expansion. Theorem. The reliability of a disk array, which tolerates all n  1 disk failures can be expressed as follows: X RRAID ¼ 1  aj  j  1  an  n . ð5Þ jPn

Proof by contradiction. The presence of terms with j < n would imply that the reliability of the disk array decreases more rapidly than the nth power of , but this is not possible. Given the small value of , powers greater than n can be ignored in comparing reliabilities. h Asymptotic expressions for the reliability of the various RAID1 organizations, RAID5 and RAID6, and hierarchical RAID, which are obtained in Sections 3 and 4 serve as confirmations of Eq. (5). 3. RAID1 organizations and their reliability In spite of its high redundancy level RAID1 is a popular RAID paradigm, because replicating data also doubles the bandwidth available for processing read requests. This feature is important since disk capacities are increasing at a 60% annual rate, while the disk access time is improving only at the rate of 8% per year (Gray and Shenoy, 2000). Basic Mirroring – BM has the disadvantage that the read load on the surviving disk is doubled when a companion disk fails. There are several RAID1 organizations which balance the load increase caused by disk failures, but it is shown in this section that the balanced disk load is accompanied by a reduced reliability level. We describe four RAID1 organizations and in each case we provide a full or partial reliability expression, which is then used to derive an asymptotic expression. 3.1. Basic mirroring with N disks Data on M = N/2 primary disks in BM is replicated on M secondary disks. Up to M disk failures can be tolerated, as long as one disk in each pair survives.   N =2 i AðN ; iÞ ¼ 2 ; 0 6 i 6 N =2. ð6Þ i Only first three reliability terms are required to obtain the nonzero coefficient:     N N 1 N N 2 1 2 RBM  rN þ r ð1  rÞ þ r ð1  rÞ  1  N 2 . 2 1 2 ð7Þ It follows that more than one disk failure cannot be tolerated all the time.

1601

3.2. Group rotate declustering The Group Rotate Declustering – GRD method (Chen and Towsley, 1996) similarly to BM has M(=N/2) primary and M secondary disks. Data is striped (Chen et al., 1994) and the stripe units of each primary disk are allocated in a rotated manner on secondary disks, so that if a primary disk fails, its read load can be distributed evenly among secondary disks, and vice-versa. Routing of read requests in GRD can be used to balance all disk loads, as long as there is no data loss. With GRD up to N/2 disks may fail, as long as they are all primary or secondary. A(N, i) is given as:   N =2 AðN ; iÞ ¼ 2 ; 1 6 i 6 N =2. ð8Þ i The reliability expression can be written directly, since it is as if we have two logical disks each comprising M = N/2 disks. The substitution  = 1  r yields: 1 RGRD ¼ 2rN =2  rN  1  N 2 2 . 4

ð9Þ

3.3. Interleaved declustering The Interleaved Declustering – ID method implemented in the original Teradata database machine partitions N disks into c clusters with n = N/c disks per cluster (Teradata Corp, 1985). The primary data on each disk is distributed evenly among the other n  1 disks in the cluster, so that if a single disk fails the load increase due to read requests at the surviving disks is 1/(n  1). There can only be one disk failure in each cluster with n = N/c disks, so that at most c disks failures are possible:   c i AðN ; iÞ ¼ n ; 0 6 i 6 c. ð10Þ i Proceeding similarly to BM we have:      2 c c N N 1 N 2 N RID  r þ rN 2 ð1  rÞ r ð1  rÞ þ c c 1 2   1 N  1 2 . 1 N ð11Þ 2 c 3.4. Chained declustering Chained Declustering – CD is an improvement over ID in that a larger number of disk failures than ID can be tolerated (Hsiao and DeWitt, 1993). Each disk in CD is partitioned into a primary and secondary area, and the primary data on disk i is replicated on the secondary area of the (i + 1)st disk modulo N. When a single disk fails, routing can be used to attain a balanced read load at all surviving disks, which is N/(N  1) of the original load (the write load is not affected by disk failures). For example, if disk 1 fails, disk 2 will receive all the read load due

1602

A. Thomasian / The Journal of Systems and Software 79 (2006) 1599–1605

Table 1 Coefficients of an for RRAID = 1  ann, where at most n  1 disk failures can be tolerated RAID0 RAID1(BM) RAID1(GRD) RAID1(ID) RAID1(CD) RAID5 RAID6 RAID1/5 RAID5/1 RAID1/6 RAID6/1

a1 = N a2 = N/2 a2 = N2/4 a2 = (N/2)/(N/c  1) a2 = N a2 = N(N  1)/2 a3 = N(N  1)(N  2)/2 a4 = N2(N  1)2/4 a4 = N(N  1)/4 a6 = N2(N  1)2(N  2)2/36 a6 = N(N  1)(N  2)/6

N is the total number of disks in all cases, except that it is 2N for RAID1/5 and RAID5/1.  denotes the unreliability of each disk, c the number of clusters in interleaved declustering.

to the primary data on disk 1 and 1/N of its own primary read load, disk 2 will receive (N  1)/N of the read load of its secondary data and 2/N of its own, etc. There is no data loss as long as there are no contiguous disk failures, so that at most M = N/2 disk failures are possible. With two disk failures there are N configurations leading to data loss:   N N ðN  3Þ AðN ; 2Þ ¼ N ¼ . 2 2 A closed form formula and an iterative procedure to compute A(N, i) is given in Thomasian et al. (submitted for publication), but these results are not pertinent here, since we only need the first three terms in the reliability expression to obtain the asymptotic expansion (this is because all further terms are multiplied by i with i > 2). RCD  rN þ NrN 1 ð1  rÞ þ

x = 0 if i = 1 or j  i = 1, which are the two cases where each RAID5 array has a single disk failure, and x = 1 otherwise. Given 2N disks in both cases the corresponding reliabilities are: RRAID1=5 ¼ 1  ½1  RRAID5 2 ¼ 1  ½rN þ NrN 1 ð1  rÞ2 . RRAID5=1 ¼

NRN1 RAID1



ðN  1ÞRNRAID1 2 N1

¼ N ½1  ð1  rÞ 

N ðN  3Þ N 2 2 r ð1  rÞ 2

 1  N 2 .

It is clear that RAID6 is more reliable than RAID5. Replication and coding techniques are applied recursively in hierarchical RAID arrays (Baek et al., 2001). We consider two instances combining RAID5 and RAID1, where in one case we have two mirrored RAID5 disk arrays (RAID1/5), while otherwise each (logical) disk in a RAID5 disk array is a mirrored pair (this is denoted as RAID5/1). A clarification is required for RAID1/5, that each RAID5 is a standalone disk array, since there is no mechanism to combine disks from both sides. In other words each RAID5 disk array fails after two disk failures, while the combination of the two arrays has enough disks to survive, More specifically, let {a1, a2, . . . , aN} and {b1, b2, . . . , bN} denote the status of disks in the two RAID5 disk arrays, so that ai = 1 if the disk is working and PNzero otherwise. RAID1/5 would be operational if i¼1 ai _ bi P N  1, but in fact this condition is only true for RAID5/ 1, where ai and bi denote the status of the left and right hand side mirrored disks. The reliability of RAID1/5 can also be expressed as:   j  N þ1 X X N N  ix j 2N j RRAID1=5 ¼ r ð1  rÞ . i ji i¼0 j¼0

ð12Þ

 ðN  1Þ½1  ð1  rÞ2 N .

ð15Þ ð16Þ

For small values of N it is easy to show using Eqs. (15) and (16) that RRAID5/1 > RRAID1/5, e.g., for N = 3 we have: 4

RRAID5=1  RRAID1=5 ¼ 6r2 ð1  rÞ P 0. 3.5. RAID1 reliability comparison It is easy to see that BM has highest reliability, followed by CD. The reliability of ID increases with the number of clusters (c), but is already higher than GRD for c = 2. These results are summarized in Table 1 in Section 6. 4. RAID5, RAID6, and multilevel RAID A RAID5 (resp. RAID6) disk array with N disks can survive a single (resp. two) disk failures. The reliability of RAID5 and RAID6 disk arrays is given as follows:   N N 1 1 RRAID5  rN þ r ð1  rÞ  1  N ðN  1Þ2 . ð13Þ 2 1     N N rN 1 ð1  rÞ þ rN 2 ð1  rÞ2 RRAID6 ¼ rN þ 1 2 1  1  N ðN  1ÞðN  2Þ3 . ð14Þ 6

Showing the inequality in general is not an easy task, but resorting to the asymptotic expansion trivializes the comparison: RRAID1=5 ¼ 1  ½ð1  ÞN  N ð1  ÞN 1 2 1  1  N 2 ðN  1Þ2 4 . ð17Þ 4 RRAID5=1 ¼ N ð1  2 ÞN 1  ðN  1Þð1  2 ÞN 1  1  N ðN  1Þ4 . ð18Þ 2 It follows that RRAID5/1 P RRAID1/5 holds for   1. Note that both RAID5/1 and RAID1/5 fail for a minimum number of four disks. RAID1/5 will fail if there  are 2 two  . disk failures on each side: P RAID1=5 ð2N ; 4Þ ¼ N2 = 2N 4 RAID5/1 will fail if the four failed disks affect two out of     N mirrored disks: P RAID5=2 ð2N ; 4Þ ¼ N2 = 2N . The number 4 of cases where the 4th disk failure leads to data loss is significantly higher for RAID1/5 than RAID5/1.

A. Thomasian / The Journal of Systems and Software 79 (2006) 1599–1605

We can carry out the same analysis for RAID6/1 and RAID1/6 disk arrays, where each system fails with a minimum of six disk failures.1 ð19Þ ð20Þ

It is obvious that RAID6/1 is more reliable that RAID1/6. 5. Reliability modeling Most studies of RAID reliability modeling (see Section 1) are concerned with estimating the MTTDL – Mean Time to Data Loss in RAID disk arrays allowing repair following disk failures. In this section we show how our analysis is related to estimating the MTTDL. Modeling repair in RAID disk arrays explicitly requires the transient analysis of the corresponding Markov chain model, i.e., the numerical solution of a set of linear differential equations (Trivedi, 2002). Reliability modeling packages such as Sharpe (Sahner et al., 1996), generate semi-numerical results, i.e., exponential expressions with numerical values for failure and repair rates (Sahner et al., 1996). Symbolic manipulation packages can be used to obtain explicit reliability expressions, which are too tedious to derive manually even for small problems ([Baek et al., 2001). A Markov chain model pertinent to RAID5 repair is given in Example 3.84 in Trivedi (2002), see e.g., (Gibson, 1992). States Si ; N  2 6 i 6 N of the Markov chain denote the number of functioning disks. The transition SN ! SN 1 corresponds to the failure of the first disk with rate Nk, while the transition for disk repair is: SN SN 1 has rate l = 1/MTTR, where the MTTR – mean time to repair pertains to repairing a single disk. The second disk failure which occurs with rate (N  1)k leads to data loss and the transition SN 1 ! SN 2 Estimating the MTTR in RAID5 disk arrays via the rebuild process is discussed in Appendix. At SN 1 we have two competing processes. If the repair process finishes first, then rebuild is successful and the system recovers with probability ps = l/(l + (N  1)k = 1/(1 + (N  1)MTTR/MTTF). Since MTTR is much smaller than MTTF, ps  1 and the system is expected to recover and return to the original state. The distribution of the number of successful rebuilds is given by the modified geometric distribution (Trivedi, 2002): Prob½k successful rebuilds ¼ ð1  ps Þpks ;

1 MTTF . 1¼ 1  ps ðN  1ÞMTTR

ð22Þ

Given that the time to a failure in RAID5 is MTTF/N:

1 2 N ðN  1Þ2 ðN  2Þ2 6 . 36 1  1  N ðN  1ÞðN  2Þ6 . 6

RRAID1=6  1  RRAID6=1



1603

k P 0.

ð21Þ

The mean number of successful rebuilds, which also denotes the average number of required spare disks for rebuild is: 1 The expression for RRAID1/6 can be derived easily starting with Eq. (14).

MTTDLRAID5 

MTTF2 . N ðN  1ÞMTTR

ð23Þ

A derivation of the MTTDL in RAID5, which does not resort to transient analysis is also given in Gibson (1992) and is based on estimating the expected transition time from SN to SN 2 . The mean time to data loss is dominated by the time spent at SN , which is exponentially distributed. Since the number of visits to this state has a geometric distribution, the total time has an exponential distribution (Kleinrock, 1975). In other words the reliability of a RAID5 system with repair is given as: RRAID6 ðtÞ ¼ et=MTTDL .

ð24Þ

Our analysis can be easily extended to RAID6 disk arrays with four states: Si ; N  3 6 i 6 N , where SN 3 is the failed state with three disk failures. We assume that one disk is repaired at a time and the repair rate at SN 1 and SN 2 is l, The time spent in these two states is negligibly small and can be ignored. The mean number of transitions to SN from SN 1 and SN 2 is MTTF/ ((N  1)MTTR) and MTTF/((N  2)MTTR) (indirectly), respectively. The MTTDL for RAID6 is roughly twice the MTTDL for RAID5: MTTDLRAID6  

ð2N  3ÞMTTF2 N ðN  1ÞðN  2ÞMTTR MTTF2 . N ðN  1ÞMTTR

ð25Þ

A weakness of the analysis if MTTL for RAID5, as given by Eq. (23), is that it is based on the assumption that an infinite number of spare disks are available. In what follows a finite number of spares (S) is postulated, so that the average number of successful rebuilds is minðS; SÞ. The number of disk failures that can be tolerated equals the number of spare disks plus one and two in RAID5 and RAID6 disk arrays, respectively. Even higher reliability and more disk failures can be tolerated by adopting k P 2 check disks (a k-disk failure tolerant k-DFT disk array). Increasing k has the disadvantage of increasing update processing overhead. The MTTDL in RAID1 disk arrays is rather difficult to evaluate, since it is the position of the ith disk failure that determines whether data loss occurs or not. The probability that an ith disk failure leads to data loss is given as 1  AðN ; iÞ= Ni Þ, where A(N, i), 1 6 i 6 nmax were given in Section 3. Given the large number of RAID1 organizations and repair possibilities this task remains beyond the scope of this study. We however state the following conjecture: A RAID1 disk array which has a higher mean time to failure without repair is expected to have a higher MTTDL when spare disks for repair are available.

1604

A. Thomasian / The Journal of Systems and Software 79 (2006) 1599–1605

6. Conclusions An asymptotic expansion technique is introduced for a quick comparison of RAID reliabilities by using the expression RRAID  1  ann, n P 1. The reliability of each disk is expressed as r = 1  , where n denotes the minimum number of disk failures leading to data loss. In fact the reliability expression is a polynomial in , but since   1, it is sufficient to retain the term with lowest power. It is possible that more than one power of  will be required in the asymptotic expansion to compare the reliability of two systems. The asymptotic expansion method in addition to its pedagogic value, is useful in simplifying the comparison of RAID reliabilities, which may become quite cumbersome even in the case of RAID5/1 versus RAID1/5. Furthermore, the asymptotic expansion can be derived from the reliability expression taking into account very few disk failures, e.g., in the case of CD (chained declustering) where the analysis in the general case is nontrivial (Thomasian et al., submitted for publication). The conclusions of this study are summarized in Table 1. RAID6/1 is the most reliable and RAID0 the least reliable array considered in this study. Among the four mirrored disk organizations BM and CD rank first and second with a2 = N/2 and a2 = N, respectively, while ID (with c = 2) and GRD rank third and fourth with a2 = N/2(N/2  1) and a2 = N2/4. respectively. Disks are known to have a higher failure rate during their burn-in period, The reliability of RAID1 with r1 = 1  1 and r2 = 1  2 is: RRAID1 = 1  12 = 1  2, where pffiffiffiffiffiffiffiffi  ¼ 1 2 is the geometric mean. A more complex PM PexpresM sion is obtained for Eq. (9): RGRD ¼ 1 i¼1 j¼1 i j , where the indices i and j pertain to primary and secondary disks, respectively. Denoting the mean  for primary and pffiffiffiffiffiffiffiffi secondary disks as p and s and using  ¼ p s leads back to Eq. (9). Acknowledgement We acknowledge the support of NSF through Grant 0105485 in Computer Systems Architecture. Appendix. Repair time in RAID5 The rebuild process in RAID5 systematically reconstructs the contents of a failed disk on a spare disk in dedicated sparing and spare disk areas in distributed sparing (Thomasian and Menon, 1997). Only dedicated sparing is considered here for brevity. Rebuild time is important since there will be data loss, unless rebuild is completed before a second disk failure occurs. Rebuild time can be determined using simulation, e.g., (Fu et al., 2004), or analysis, e.g., (Thomasian and Menon, 1997). To simplify the discussion we consider a RAID5 disk array without parity declustering (Chen et al., 1994), i.e., the parity group size is set equal to the number of

disks. This has the consequence that the spare disk does not constitute a bottleneck as rebuild progresses and rebuild time is determined by the time to read any one of the surviving disks, since they have a balanced load (Thomasian and Menon, 1997). Rebuild time of a RAID5 disk array at a disk utilization q in normal mode, before a disk failure occurred, is denoted by Trebuild(q). For an idle disk Trebuild(0) roughly equals the number of disk tracks times the disk rotation time. The fraction of rebuild requests incurring a seek increases with q and this results in an increase in rebuild time. A detailed analysis of rebuild is reported in Thomasian and Menon (1997), but for a typical zoned disk drive rebuild time may be estimated as: T rebuild ðqÞ ¼

T rebuild ð0Þ . 1  aq

where curve-fitting against simulation results yields a  1.75 (Fu et al., 2004). RAID5 rebuild time is negligibly short compared to the disk MTTF, especially when an online (dedicated) spare disk is available. Disk rebuild is successful unless a second disk failure occurs before rebuild is completed or a latent sector failure – LSF is encountered, while rebuild is in progress (Blaum et al., 1995). LSFs are unreadable disk sectors which disallow the rebuild process from progressing (Kari, 1997). Disk scrubbing is a technique, which scans disks to detect and fix LSFs (Xin et al., 2005). RAID6 disk arrays, including EVENODD and RDP (Corbett et al., 2004), were introduced to cope with LSFs and multiple disk failures. One method to take into account the effect of LSFs on rebuild time is to adjust the probability of a successful rebuild as follows: p0s ¼ ps  p‘ . References Baek, S.H., Kim, B.W., Jeung, E., Park, C.W., 2001. Reliability and performance of hierarchical RAID with multiple controllers. In: Proc. 20th Annual ACM Symp. on Principles of Distributed Computing – PODC, Newport, RI, August 2001, pp. 246–254. Blaum, M., Brady, J., Bruck, J., Menon, J., 1995. EVENODD: an optimal scheme for tolerating double disk failures in RAID architectures. IEEE Trans. Comput. 44 (2), 192–202. Chen, P.M., Lee, E.K., Gibson, G.A., Katz, R.H., Patterson, D.A., 1994. RAID: high-performance, reliable secondary storage. ACM Comput. Surv. 26 (2), 145–185. Chen, S.-Z., Towsley, D.F., 1996. A performance evaluation of RAID architectures. IEEE Trans. Comput. 45 (10), 1116–1130. Corbett, P., English, B., Goel, A., Grcanac, T., Kleiman, S., Leong, J., Sankar, S., 2004. Row-diagonal parity for double disk failure correction. In: Proc. 3rd Conf. File and Storage Technologies – FAST’04, San Francisco, CA, March/April 2004. Fu, G., Thomasian, A., Han, C., Ng, S.W., 2004. Rebuild strategies for redundant disk arrays. In: Proc. NASA/IEEE MSST’04: 12th NASA Goddard, 21st IEEE Conf. on Mass Storage and Technologies, April 2004. Available from: . Gibson, G.A., 1992. Redundant Disk Arrays: Reliable, Parallel Secondary Storage. The MIT Press. Gray, J., Shenoy, P.J., 2000. Rules of thumb in data engineering. In: Proc. IEEE Int. Conf. Data Eng. – ICDE, San Diego, CA, March 2000, pp. 3–12.

A. Thomasian / The Journal of Systems and Software 79 (2006) 1599–1605 Gray, J., 2002. Storage bricks – Keynote speech, USENIX Conf. on File and Storage Technologies – FAST’02. Monterey, CA, January/ February 2002. Hsiao, H.-I., DeWitt, D.J., 1993. A performance study of three high availability data replication strategies. J. Distributed Parallel Databases 1 (1), 53–80. Kari, H.H., 1997. Latent sector Faults and Reliability of Disk Arrays. Ph.D. Thesis, Helsinki Institute of Technology, Espoo, Finland, May 1997. Kleinrock, L., 1975. Queueing Systems: Theory. Wiley-Interscience. Malhotra, M., Trivedi, K.S., 1993. Reliability analysis of redundant arrays of inexpensive disks. J. Parallel Distributed Comput. 17 (1/2), 146–151. Patterson, D.A., Gibson, G.A., Katz, R.H., 1988. A case for redundant arrays of inexpensive disks (RAID). In: Proc. ACM SIGMOD Int. Conf. on Management of Data, Chicago, IL, June 1988, pp. 109–116. Sahner, R.A., Trivedi, K.S., Puliafito, A., 1996. ACM Performance and Reliability Analysis of Computer Systems. Kluwer Academic Publishers.

1605

Schwarz, T.J.E., 1994. Reliability and Performance of Disk Arrays. Ph.D. thesis, University of California at San Diego. Teradata Corp. DBC/1012 Database Computer System Manual, Release 2, November 1985. Thomasian, A., Menon, J., 1997. RAID5 performance with distributed sparing. IEEE Trans. Parallel Distributed Syst. 8 (6), 640–657. Thomasian, A., Han, C., Fu, G., Liu, C., 2004. A performance tool for RAID disk arrays. In: Proc. Quantitative Evaluation of Systems – QEST’04, Enschede, The Netherlands, September 2004, pp. 8–17. Thomasian, A., Blaum, M., submitted for publication. Mirrored disk reliability and performance. IEEE Trans. Comput. Trivedi, K.S., 2002. Probability and statistics with reliability. Queueing and Computer Science Applications. Wiley. Xin, Q., Schwarz, T.J.E., Miller, E.L., 2005. Disk infant mortality in large storage systems. In: Proc. 13th Annual Meeting of the IEEE Symp. on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems – MASCOTS’05, Atlanta, Georgia, September 2005, pp. 125–134.