Clustered RAID Arrays and Their Access Costs - Semantic Scholar

3 downloads 35924 Views 271KB Size Report
Aug 19, 2005 - processing requirements of clustered RAID in normal and degraded modes of operation. For given .... also be used to obtain disk service times for given disk characteristics. ..... tool have shown that our analysis provides a good estimate ..... Architectures and algorithms for on-line failure recovery in.
© The Author 2005. Published by Oxford University Press on behalf of The British Computer Society. All rights reserved. For Permissions, please email: [email protected] Advance Access published on August 19, 2005 doi:10.1093/comjnl/bxh108

Clustered RAID Arrays and Their Access Costs Alexander Thomasian Department of Computer Science, New Jersey Institute of Technology—NJIT, Newark, NJ 07102, USA Email: [email protected] RAID5 (resp. RAID6) are two popular RAID designs, which can tolerate one (resp. two) disk failures, but the load of surviving disks doubles (resp. triples) when failures occur. Clustered RAID5 (resp. RAID6) disk arrays utilize a parity group size G, which is smaller than the number of disks N, so that the redundancy level is 1/G (resp. 2/G). This enables the array to sustain a peak throughput closer to normal mode operation; e.g. the load increase for RAID5 in processing read requests is given by α = (G − 1)/(N − 1). Three methods to realize clustered RAID are balanced incomplete blocks designs and nearly random permutations, which are applicable to RAID5 and RAID6, and RM2 where each data block is protected by two parity disks. We derive cost functions for the processing requirements of clustered RAID in normal and degraded modes of operation. For given disk characteristics, the cost functions can be translated into disk service times, which can be used for the performance analysis of disk arrays. Numerical results are used to quantify the level of load increase in order to determine the value of G which maintains an acceptable level of performance in degraded mode operation. Received 3 December 2004; revised 26 May 2005

1.

INTRODUCTION

Five RAID (redundant arrays of inexpensive/independent disks) levels (1–5) were originally proposed in [1] to cope with the high cost of mainframe disks of the time by replacing them with inexpensive disks developed to be included with PCs (personal computers). RAID levels 3–5 utilize parity blocks occupying the capacity equivalent of a single disk to tolerate single disk failures and striping to balance disk loads. Two additional RAID levels were introduced in [2]. RAID0 implements only striping but provides no protection against disk failures, and RAID6 in addition provides protection against two disk failures. RAID2 disk arrays which are based on Hamming codes are not viable [2]. Mirrored disks, labeled as RAID1, predate the RAID classification. In basic mirroring, data is duplicated on two disks to tolerate single disk failures. Advances in magnetic recording density have resulted in dramatic increases in disk capacities, but there has been limited improvement in disk access time. Doubling of data access bandwidth for read requests in RAID1 is a major advantage over other disk arrays. RAID1 is becoming more viable in view of the rapidly declining disk costs. Although both disk drives have to be updated, this is less costly than the updating required in RAID5 and RAID6 disk arrays. This is the basis for recommending mirroring rather than parity in [3]. Reliability and load balancing issues in RAID1 disk organizations are discussed in [4]. A weakness of RAID1 disk arrays is the lack of flexibility in load balancing when disk failures occur. For

example, in basic mirroring, with one disk mirroring another, the read load on the surviving disk is doubled when the other disk fails. Interleaved declustering with n disks in a cluster distributes the read load of a failed disk over the remaining n−1 disks [4]. Since at most one disk failure can be tolerated per cluster, the value of n should be kept small to increase the reliability of the disk array. A block from a failed disk in RAID3–5 can be reconstructed via fork-join requests, which access the N − 1 corresponding blocks on the surviving disks and exclusiveOR (XORs) them to obtain the new block [2, 5]. RAID6 disk arrays use check data based on Reed-Solomon codes [5]. The reconstruction of a data block in RAID6 requires the reading of the corresponding N − 3 data blocks and one of the two check blocks. Since N − 2 of N − 1 surviving disks are involved in this process, the load increase in RAID6 is slightly lower than that in RAID5. When two disks fail in RAID6 then N − 2 disks remain to be read and the load on the surviving disks is tripled. The load increase in RAID5 and RAID6 disk arrays operating in degraded mode results in elevated disk utilizations, resulting in higher disk response times. The clustered RAID paradigm is a solution to the load increase and the reduction in the maximum sustainable throughput problem in RAID5 [6], but it is also applicable to disk arrays with k ≥ 1 check disks, although only RAID5 with k = 1 and RAID6 with k = 2 are considered in this study.

The Computer Journal Vol. 48 No. 6, 2005

“bxh108” — 2005/10/19 — page 702 — #1

Clustered RAID Arrays and Their Access Costs In clustered RAID the parity group (PG) size denoted by G, is set to be smaller than N . The load increase of surviving disks for RAID5 as a result of read requests is given by the declustering ratio: α = (G − 1)/(N − 1) [6]. The PG size affects the fraction of redundant data being held; i.e. in RAID5 (resp. RAID6), 1/G (resp. 2/G) of the blocks hold redundant data. Clustering provides a continuum of redundancy levels to minimize the impact of disk failures on performance in degraded mode. In Appendix A we describe three clustered RAID layouts. Balanced incomplete block designs (BIBDs) [7, 8] and nearly random permutations (NRPs) [9] are applicable to RAID5 and RAID6. RM2, based on the redundancy matrix, is a specialized array designed for tolerating two disk failures [10] which also happens to be a clustered RAID, since its PG size is smaller than the number of disks. This paper deals with the performance analysis of clustered RAID5 and RAID6 disk arrays. There have been several papers analyzing variations of RAID5 under Markovian assumptions, see e.g. [11]. An analytical model is used in [12] to compare RAID5 performance with parity striping [13]. RAID5 disk arrays in normal mode, degraded mode (with a single disk failure), and rebuild mode, which constitutes a systematic reconstruction of the contents of a failed disk on a spare disk, are analyzed in [14, 15]. This methodology is extended in [16, 17] to analyze the performance of RAID5 and RAID6 disk arrays in normal mode and degraded mode with one and two disk failures. The analysis of a clustered RAID5 in normal and rebuild mode is given in [9] (see Section 2). In this paper we obtain the processing cost of a workload of read and write requests in clustered RAID5 and RAID6 disk arrays in normal and degraded mode. The cost functions can be used to estimate I/O bus bandwidth requirements for a given arrival rate of disk requests. These costs can also be used to obtain disk service times for given disk characteristics. In this study we limit ourselves to obtaining the maximum throughputs for different read to write ratios, but disk response times can also be obtained under favorable assumptions, as was done in [16]. It should be noted that when rebuild requests are processed at a lower priority than external requests, as in [14, 15], they will not have an effect on the maximum throughput. This paper is organized as follows. Section 2 describes the operation of RAID5 and RAID6 disk arrays. Appendix A gives a description of BIBD, NRP and RM2 and can be skipped by readers not interested in the details of clustered RAID organizations. Section 3 specifies the disk array model and the notation used in this paper. Cost functions for clustered RAID5 and RAID6 disk arrays are given in Sections 4 and 5 respectively. The analysis of RAID6 without clustering in Appendix B is a special case of the analysis in Section 5. Numerical results to quantify the effect of disk failures on RAID5 and RAID6 performance are given in Section 6. Conclusions appear in Section 7.

2.

703

RAID ARRAYS TOLERATING SINGLE AND DOUBLE DISK FAILURES

RAID5 (resp. RAID6) disk arrays utilize the capacity equivalent of one disk (resp. two disks) to tolerate a single (resp. double) disk failure. We describe RAID5 and RAID6 operation in normal, degraded and rebuild modes. Rebuild is a systematic reconstruction of the contents of a failed disk on a spare disk. Rebuild processing in RAID6 is similar to the process in RAID5 and is not discussed here. 2.1.

RAID5 disk arrays

RAID disk arrays utilize striping for load balancing and check disks for fault tolerance. Striping partitions large files into stripe units or SUs, which are allocated in a round-robin manner across the N disks of a disk array in successive rows to constitute a stripe. Disks hold SUs from different files with different access rates, so that the disk loads are balanced. One out of N SUs in a RAID5 stripe (or row) is dedicated to parity, e.g. with N = 5 disks we have Di , 1 ≤ i ≤ 4 data blocks and P1,4 is the parity computed over them: P1:4 = D1 ⊕ D2 ⊕ D3 ⊕ D4 . The modification of a small data block requires the updating of the corresponding parity block, hence the small write penalty. The reconstruct write method reads the corresponding new ← N − 2 data blocks to compute the parity, i.e. p1:4 new d1 ⊕ d2 ⊕ d3 ⊕ d4 . The read–modify write (RMW) method reads and overwrites data and parity blocks as an atomic disk access. A request to read dold is issued to the data disk, ddiff = dold ⊕dnew is computed, and dnew is written to replace dold after one disk rotation. ddiff is sent to the parity disk, which reads pold , computes pnew as pnew = pold ⊕ ddiff , and writes pnew after one disk rotation [15]. This ensures that ddiff is available by the time pnew is to be computed and written. The RMW method is adopted in this analysis, since it outperforms the reconstruct write method, unless N , or rather G, is small [18]. If all parities resided on one disk, as in RAID4, then if all disk requests were writes, the access rate for RMWs on that disk would be N − 1 times higher than for the other disks and that disk would constitute a bottleneck. Parity blocks are held on one disk in RAID3, but since all disks are accessed together, the parity disk does not constitute a bottleneck. RAID5 balances disk loads in processing write requests by rotating the parities; e.g. the left symmetric organization assigns the parities according to right to left diagonals, to balance disk loads. Full-stripe writes are another method of dealing with the small write penalty. Log-Structured Arrays (LSAs), rather than writing dirty blocks in place, accumulate a stripe’s worth of dirty blocks and write them out with a full-stripe write [19]. The parity SU is computed on the fly as the data is being written out. Storage Technology’s Iceberg, in addition to being a RAID6 disk array, uses the LSA paradigm (see Section 5.2 in [2]), so that the two check SUs can be computed efficiently. A background process is required in LSA to garbage collect old versions of blocks and prepare empty stripes for full-stripe writes.

The Computer Journal Vol. 48 No. 6, 2005

“bxh108” — 2005/10/19 — page 703 — #2

704

A. Thomasian

When a single RAID5 disk, say disk 2, fails then its contents can be reconstructed by reading all the corresponding blocks in the stripe via a fork-join request and XORing them, e.g. d2 = d1 ⊕ d3 ⊕ d4 ⊕ p1,4 . Each surviving disk in the disk array, in addition to processing its own requests, is subjected to fork-join requests to reconstruct blocks on the failed disk. Assuming that all disk requests are read requests, the arrival rates to surviving disks and their utilization factors are doubled. For example, the updating of d2 can be carried out using the reconstruct write new = d ⊕ d new ⊕ d ⊕ d ; i.e. N − 2 data method: p1,4 1 3 4 2 blocks need be read and the parity block to be written. If instead p1,4 is unavailable then only d2 needs to be updated. The increase in load in processing write requests tends to be smaller. We refer to fault-free operation as normal mode and operation with failed disks as degraded mode. To quantify performance in degraded mode, we assume that disk request arrivals are Poisson with rate λ, so that the processing at each disk can be analyzed using the M/G/1 queueing model [20]. The mean waiting time at the disk is Wnormal =

λx 2 disk , 1−ρ

where ρ = λx is the utilization factor of the disk and x i denotes the ith moment of disk service time. In degraded mode both λ and ρ are doubled with all read requests, so that Wdegraded /Wnormal = 2(1 − ρ)/(1 − 2ρ), e.g. a 3-fold increase in mean waiting time for ρ = 0.25. In addition to its poor performance, RAID5 in degraded mode is vulnerable to data loss, since a second disk failure cannot be tolerated. The rebuild process initiated by the OS (with or without operator intervention) or the disk array controller systematically reads successive rebuild units, e.g. tracks, from surviving disks, XORs them to recreate the lost tracks, and writes the computed tracks on the spare disk. A second disk failure during rebuild is rare, unless disk failures are correlated, but media failure manifesting as unreadable sectors may result in the failure of a rebuild process. Two-disk-failure-tolerant (2DFT) arrays are discussed in Section 2.2. Two variations of rebuild that have been treated analytically are mentioned here. The vacationing server model (VSM) [21] processes rebuild requests at a lower priority than user requests [14]. The analysis of VSM in [14] for a RAID5 with dedicated sparing is extended to a distributed sparing system in [15]. The analysis in [14] is extended and simplified in [22]. The permanent customer model (PCM) [23] processes one rebuild request at a time, so that a completed rebuild request is immediately replaced by a new request (to the next rebuild unit) [9]. PCM processes rebuild (read) requests at the same priority as user requests, so that unlike VSM it affects the maximum throughput attainable by the system. The performance of clustered RAID5 in degraded mode is not considered in [9].

2.2.

2DFT Arrays

We first consider 2DFT arrays with a minimum level of redundancy, i.e. the capacity of two disks to tolerate as many disk failures. Each stripe in a 2DFT has two check SUs, P and Q, which are usually computed using Reed-Solomon codes, e.g. StorageTek’s Iceberg and the RAID 5DP (double parity) disk array from HP. A 2DFT can be based solely on parity, and EVENODD [24] and RDP (row-diagonal parity) from Network Appliance [25] are two such examples. Check SUs are laid out in parallel left symmetric diagonals, as in RAID5, to balance disk loads as a result of updating. Reed-Solomon codes, EVENODD and RDP incur different computational costs, but when the symbol size for EVENODD and RDP is small, they have the same disk access pattern as RAID6; i.e. the analysis for RAID6 at the disk access level is applicable to EVENODD and RDP. Clustering is applicable to RAID6 disk arrays, where to read a missing data block on a failed disk we need to access one of the two check blocks, so that the increase in disk load is β = (G − 2)/(N − 2) for a single disk failure and twice that for two disk failures.

3.

DISK ARRAY MODEL AND NOTATION

The analysis is concerned with performance of clustered RAID5 and RAID6 disk arrays processing discrete requests, which originate from a large number of finite sources, e.g. concurrent transactions of an OLTP (online transaction processing) application being processed at a computer system. The cost of single read (SR), single write (SW) and RMW disk access is given as DSR , DSW and DRMW respectively. Disk writes are slightly more costly than disk reads owing to head settling time [26]. An RMW request is an SR, followed by a disk rotation, so that no head settling time is required. The mean service times for these requests can be determined for given workload and disk characteristics. The mean service times for SR, SW and RMW requests are denoted by x SR , x SW and x RMW respectively. The cost functions derived in this paper are sums of appropriate multiples of these costs and can be used to estimate the volume of data being transferred for each (logical) read or write request. The volume of data is much higher in degraded mode than in normal mode owing to forkjoin requests. Even higher data transfer rates are required for rebuild processing. In addition, the cost functions in combination with disk drive characteristics can be used to determine the mean disk service time and the maximum disk array throughputs for different operating modes. We use D, P and Q to denote the fact that the data and parity blocks in RAID5 and RAID6 are available, and D, P and Q to denote that they are not because they reside on a failed disk. We use C and C to denote whether a cluster (or PG) has a disk failure and C0 , C1 and C2 for the number of failed disks in a cluster of RAID6 with two disk failures. In deriving cost function we use tree diagrams to enumerate the cases systematically and to compute the cost functions. Equation numbers associated with each branch are given in Figure 2. We will be using the convention of specifying the

The Computer Journal Vol. 48 No. 6, 2005

“bxh108” — 2005/10/19 — page 704 — #3

705

Clustered RAID Arrays and Their Access Costs probability of an event, i.e. a certain array configuration, followed by ⇒ pointing to the cost. 4. 4.1.

RAID5 COST FUNCTIONS RAID5 in normal mode

Each read request incurs a cost DSR . For each write request a disk receives two RMW requests, the first of which is due to the data block residing on that disk. The second RMW to update the parity is due to the fact that each disk holds 1/(N − 1) of the parity blocks from the remaining N − 1 disks. The cost of processing each read and write is as follows: read CRAID5/F 0 = DSR , write CRAID5/F 0

(1)

= 2DRMW .

(2)

In order to determine the maximum bandwidth, we need to obtain the mean disk service time. For this purpose we substitute the cost functions DSR , DSW and DRMW with the appropriate mean disk service times x SR , x SW and x RMW respectively. In RAID0 the mean service time is obtained as a weighted average of service times for read and write requests: x RAID0 = fr x SR + fw x SW . The maximum throughput in RAID0 is given as follows: max RAID0 =

N x RAID0

(3)

.

In RAID5 and RAID6 we use i in Fi to denote the number of failed disks. The mean service time in RAID5 with no disk failures is: x RAID5/F 0 = fr x SR + 2fw x RMW . The maximum throughput for RAID5 in normal mode is N times the throughput of a single disk: max RAID5/F 0 =

N x RAID5/F 0

.

(4)

The maximum throughput for RAID5 with a single disk failure and RAID6 with zero, one and two disk failures can be obtained similarly, i.e. by dividing the number of surviving disks by the mean disk service time. These expressions are omitted here for brevity, but in Section 6 we plot the maximum throughputs as G is varied. The disk array throughput multiplied by the cost functions can be used to estimate the volume of data to be transferred. In the case of SR and SW accesses the volume of data transferred equals one block. For RMW requests the disk holding the data block receives the new version of the data block and it computes ddiff and sends it to the check disks involved.

4.2.

RAID5 in degraded mode

4.2.1. Read requests There are two cases: either the data block is available or it is not. We first specify the probability associated with an event and provide the cost following ⇒. Pr[D] =

N −1 ⇒ DSR , N

Pr[D] =

1 ⇒ (G − 1)DSR . N

The overall cost for read requests is read CRAID5/F 1 =

N +G−2 DSR . N

(5)

4.2.2. Write requests Pr[D] and Pr[D] are given as before. We next determine whether the PG to which D belongs contains the failed disk or not, and in the former case the probability that P is on the failed disk. We use C and C to denote whether there is a failure in the PG or not. Pr[C|D] =

N −G , N −1

Pr[C|D] =

G−1 . N −1

The probability that the parity disk has failed when there is a failed disk in the PG is Pr[P |C ∧ D] =

1 , G−1

Pr[P |C ∧ D] =

G−2 . G−1

Data and parity blocks are both available in two cases: (i) the data block is not on a failed disk and the PG to which it belongs does not include the failed disk; (ii) the failed disk is one of the disks in the PG of the data block, but neither the data nor parity blocks are on a failed disk. Using the law of total probabilities, e.g. Pr[A] = Pr[A|B]Pr[B] + Pr[A|B]Pr[B] [20], we have Pr[D ∧ P ] = Pr[P |C ∧ D] × Pr[C|D] × Pr[D] + Pr[P |C ∧ D] × Pr[C|D] × Pr[D] N −G N −1 G−2 × =1× + N −1 N G−1 N −2 G−1 N −1 × = × ⇒ 2DRMW . N −1 N N That Pr[D ∧ P ] = (N − 2)/N is consistent with our expectations. When the data block is available and the parity block is failed the data block can be updated by an SW. The probability of this event can be obtained by appling the law of conditional probabilities, Pr[A ∧ B] = Pr[A|B]Pr[B]. We have Pr[D ∧ P ] = Pr[P |C ∧ D] × Pr[C|D] × Pr[D] 1 G−1 N −1 1 = × × = ⇒ DSW . G−1 N −1 N N The data block is on the failed disk, so that P is certainly on a non-failed disk. Pr[D] =

1 ⇒ (G − 2)DSR + DSW . N

Note that the three events cover all cases, so that the probabilities sum to one, i.e. Pr[D ∧ P ] + Pr[D ∧ P ] +

The Computer Journal Vol. 48 No. 6, 2005

“bxh108” — 2005/10/19 — page 705 — #4

706

A. Thomasian cost for read requests is as follows: N −1 G−2 DSR + DSR N N N +G−3 = DSR . N

read CRAID6/F 1 =

N

(8)

In the case of a write there are three cases, with probabilities Pr[D∧P ∧Q] = (N −3)/N , Pr[D∧P ∧Q] = Pr[D∧P ∧] = 1/N and Pr[D] = 1/N . The corresponding costs are 3DRMW , 2DRMW and (G − 3)DSR + 2DSW . write CRAID6/F 1 =

FIGURE 1. Tree diagram for RAID6 with two disk failures.

Pr[D] = (N − 2)/N + 1/N + 1/N = 1. Although the three probabilities could have been written directly, the more detailed analysis ensures that it is the parity block which corresponds to the data block that is being considered. The overall cost for write requests is write CRAID5/F 1 =

2(N − 2) 2 G−2 DRMW + DSW + DSR . N N N

Note that the PG size affects only the number of disk blocks accessed, not the probabilities for various events. 5.

5.3.

3N − 5 G−3 2 DRMW + DSR + DSW . N N N (9)

Clustered RAID6 with two disk failures

5.3.1. Read requests This case is similar to operation with a single disk failure, except that the probability that D resides on a failed disk is doubled. Furthermore, when the data block is not available, the remaining G − 2 blocks in the PG, one or two of which are parities, should be read to reconstruct the missing block. We have the following two cases: Pr[D] =

N −2 ⇒ DSR , N

The analysis of RAID6 without clustering has been given in [17] but is repeated in Appendix B using the simplified (as compared with [17]) tree diagram in Figure 1. The analysis of clustered RAID6 is reported in three subsections, dealing with the operating cost with zero, one and two disk failures. We show the equivalence of the equations from the viewpoint of probabilistic analysis, except that the number of disks to be read as part of reconstruct operations is G rather than N. It is shown in Appendix B that setting G = N in the equations obtained in this section, we obtain the same result as the corresponding equation in the appendix.

 d  N−d 

5.2.

(6)

= 3DRMW .

m

G−m

N 

,

0 ≤ m ≤ 2.

G

The cost for read and write requests is given as follows:

write CRAID6/F 0

(10)

Data block D is available The probability mass function (pmf) for the number of failed disks in the PG to which the data block belongs is given by the hypergeometric pmf [20]. This is the probability of selecting m faulty disks in a random sample of G disks from N disks, d = 2 of which are defective, which is given as

Clustered RAID6 in normal mode

read CRAID6/F 0 = DSR ,

N + 2G − 6 DSR . N

5.3.2. Write requests There are many cases, which are given in Figure 2 and enumerated below.

Pr[m; G, d, N ] = 5.1.

2 ⇒ (G − 2)DSR . N

The overall cost for read requests is read CRAID6/F 2 =

RAID6 COST FUNCTIONS

Pr[D] =

(7)

Clustered RAID6 with a single disk failure

RAID6 operation with one disk failure is similar to RAID5, except that we need to read only one of the two parity blocks for recovery. The cost equations in this case are similar to those for RAID6 in Appendix B, with the difference that G − 2 blocks in a PG need to be read to recreate a block. The

In what follows Ci denotes the number of failed disks in the PG: (N − G)(N − G − 1) , (N − 1)(N − 2) 2(G − 1)(N − G) Pr[C1 |D] = , (N − 1)(N − 2) (G − 1)(G − 2) Pr[C2 |D] = . (N − 1)(N − 2)

Pr[C0 |D] =

The probability and the cost with no disk failures in the PG can be obtained by multiplying Pr[C0 ] by the probability

The Computer Journal Vol. 48 No. 6, 2005

“bxh108” — 2005/10/19 — page 706 — #5

707

Clustered RAID Arrays and Their Access Costs

FIGURE 2. Tree diagram for clustered RAID6 with two disk failures. Equation numbers corresponding to the leaf nodes are enclosed in parentheses.

that the data block D is not among the two failed disks, i.e. (N − 2)/N . (N − G)(N − G − 1) Pr[D ∧ P ∧ Q ∧ C0 ] = ⇒ 3DRMW . (N − 1)N (11) With a single failed disk in the PG, the probability that P is failed or not failed is G−2 Pr[P |D ∧ C1 ] = , G−1

1 Pr[P |D ∧ C1 ] = . G−1

Q cannot be failed when P has failed, since there is only a single failed disk in the PG. When P has not failed, there are two cases: G−3 Pr[Q|D ∧ P ] = , G−2

2(N − G)(G − 3) ⇒ 3DRMW , (N − 1)N (12) 2(N − G) Pr[D ∧ P ∧ Q ∧ C1 ] = ⇒ 2DRMW , (N − 1)N (13) 2(N − G) Pr[D ∧ P ∧ Q ∧ C1 ] = (14) ⇒ 2DRMW . (N − 1)N

Pr[D ∧ P ∧ Q ∧ C1 ] =

When two disks in the PG have failed and D is known not to be failed, the probability that P is failed or not failed is G−3 , G−1

Pr[Q|D ∧ P ∧ C2 ] =

G−4 , G−2

Pr[P |D ∧ C2 ] =

2 . G−1

Pr[Q|D ∧ P ∧ C2 ] =

2 . G−2

When P has failed, there are two cases for Q: Pr[Q|D ∧ P ∧ C2 ] =

G−3 , G−2

Pr[Q|D ∧ P ∧ C2 ] =

1 . G−2

We have the following four cases: Pr[D ∧ P ∧ Q ∧ C2 ] =

(G − 3)(G − 4) ⇒ 3DRMW , (N − 1)N (15)

Pr[D ∧ P ∧ Q ∧ C2 ] =

2(G − 3) ⇒ 2DRMW , (N − 1)N

(16)

Pr[D ∧ P ∧ Q ∧ C2 ] =

2(G − 3) ⇒ 2DRMW , (N − 1)N

(17)

2 ⇒ DSW . (N − 1)N

(18)

1 Pr[Q|D ∧ P ] = . G−2

We have three cases with a single disk failure in the PG:

Pr[P |D ∧ C2 ] =

When P has not failed, there are two cases for Q:

Pr[D ∧ P ∧ Q ∧ C2 ] =

Data block D is unavailable The probability that D is unavailable is Pr[D] = 2/N . Let C denote the event that another disk in the PG has not failed, and C that it has. The probabilities for the two cases are P [C] =

N −G , N −1

P [C] =

G−1 . N −1

The probability that in the second case that P has failed or not failed is Pr[P |C] =

1 , G−1

The Computer Journal Vol. 48 No. 6, 2005

“bxh108” — 2005/10/19 — page 707 — #6

Pr[P |C] =

G−1 . G−2

708

A. Thomasian

The probability that Q has failed or not failed is given as follows: Pr[Q|P ∧ C] =

1 , G−2

Pr[Q|P ∧ C] =

G−3 . G−2

Obviously when P has failed Q cannot be failed, since we already have two failed disks in the system. The probabilities and costs in the four cases are 2(N − G) (N − 1)N ⇒ (G − 3)DSR + 2DSW , 2(G − 3) Pr[D ∧ P ∧ Q ∧ C] = (N − 1)N ⇒ (G − 3)DSR + 2DSW , 2 Pr[D ∧ P ∧ Q ∧ C] = (N − 1)N ⇒ (G − 3)DSR + DSW , 2 Pr[D ∧ P ∧ Q ∧ C] = (N − 1)N ⇒ (G − 3)DSR + DSW .

Pr[D ∧ P ∧ Q ∧ C] =

(19)

(20)

(21)

(22)

The overall cost for writes is then write CRAID6/F 2 =

(N − 3)(3N − 4) DRMW (N − 1)N 2(2N − 3) 2(G − 3) DSW + + DSR . (23) (N − 1)N N

It is interesting to note that the cost in the first two cases is the same as for a nonclustered RAID6. We can make the stronger statement that clustered RAID can be analyzed by carrying out the analysis for unclustered RAID6, substituting the cost of processing reconstruction in the last step.

6.

NUMERICAL RESULTS

We first specify the disk parameters which are used to obtain mean disk service times. The maximum throughputs for clustered RAID disk arrays are then given. 6.1.

Disk parameters

As stated in Section 3, we assume that disk requests are to small blocks which are uniformly distributed over all disk blocks. The mean disk service times for processing SR, SW and RMW requests are denoted by x SR , x SW and x RMW respectively. We have ignored disk controller overhead, since it is small, and cache hits in the onboard cache, since they have no effect when disk accesses are random [26]. The mean service time for SR requests is x SR = x seek + x latency + x transfer , which is the sum of the mean seek, mean latency and mean transfer time. The

mean service time for SW and RMW requests is given as x SW = x SR + T hst and x RMW = x SR + Trot , where Thst and Trot denote the head settling time and disk rotation time respectively. A first-come, first-served (FCFS) policy is postulated in disk scheduling and influences the following analysis. The seek distance pmf for uniform accesses to disk cylinders is given by pD [0] = 1/C and pD [d] = 2(C − d)/ C 2 , 1 ≤ d ≤ C − 1, where C is the number of disk cylinders. This expression holds with the assumption that accesses to disk blocks are uniform, provided that the disk has the same number of sectors per track. This expression can be used to approximate the pmf in zoned disks, where outer tracks hold more sectors than inner tracks [26] and the probability of accessing a disk cylinder is proportional to the number of blocks on the cylinder. A method to compute the seek distance distribution and moments of seek time is given in [15], but in computing the mean disk service time we have used the method in [27], which lends itself to a more intuitive derivation. The mean rotational latency for accesses to small disk blocks is approximated by one-half of the mean disk rotation time. The mean disk transfer is determined by the cylinder on which the disk block resides, so that a weighted average based on cylinder capacities is used for this purpose. Numerical results in this study are based on 7200 RPM, 9.17 GB IBM Ultrastar 18ES disk drives, whose detailed characteristics are given at [28]. Although higher RPM disk drives are readily available, 7200 RPM disk drives are commonly used in servers. Disk rotation time is Trotation = 8.33 ms, so that the mean latency is T latency = 4.16 ms. The outermost and innermost tracks have 390 and 247 sectors respectively. Each disk cylinder has five tracks. The transfer time of 4 KB blocks is between 0.17 and 0.27 ms, with a mean x transfer = 0.21 ms. The mean seek time is T seek = 7.16 ms and the head settling time is Thst = 0.8 ms. We have x SR = 11.54 ms, x SW = 12.34 ms and x RMW = 19.87 ms. The disk can sustain 87 requests per second to read 4 KB blocks. The assumption that disk requests arrive according to a Poisson process makes it possible to utilize the M/G/1 queuing model to estimate the mean response time of user requests RSR = W + x SR , where W is the mean waiting time according to the M/G/1 model, which applies to all requests processed by the disk [16, 17]. A significant improvement in RSR is achievable by giving a higher priority to read requests. Simulation studies using the Disk Array Simulator tool have shown that our analysis provides a good estimate of the mean disk response time of an individual disk and also of the mean response time for various (unclustered) RAID levels with FCFS scheduling [17]. These studies quantify the effect of the overhead resulting from fault tolerance on RAID5 and RAID6 performance by comparing their mean response time and maximum throughput against RAID0 disk arrays. This study is concerned only with the maximum throughput and can be extended to non-FCFS disk scheduling by appropriately adjusting the mean disk service time.

The Computer Journal Vol. 48 No. 6, 2005

“bxh108” — 2005/10/19 — page 708 — #7

Clustered RAID Arrays and Their Access Costs

FIGURE 3. PG size versus maximum throughput per millisecond, R : W = 1 : 0.

6.2.

709

FIGURE 4. PG size versus maximum throughput per millisecond, R : W = 3 : 1.

Maximum system throughputs

i (L, f , N, G), We report the maximum throughput Fmax r where i denotes the number of failed disks. We consider RAID5 and RAID6 disk arrays, denoting their levels L = 5 and L = 6 respectively. We set the number of disks to N = 21 and vary the group size as 3 ≤ G ≤ N . We do not consider G = 2, since it amounts to mirroring. In Figures 3, 4 and 5 we plot the maximum attainable throughi (.) (per millisecond) versus the PG size (G) for put Fmax RAID5 with i = 0, 1 and RAID6 with i = 0, 1, 2 disk failures and based on the following fractions of read requests: fr = 1.0, fr = 0.75 and fr = 0.5. Less formally, the R:W (read:write) ratios are 1:0, 3:1 and 1:1. Higher fractions of write requests are not meaningful for OLTP environments. We have also added the maximum throughput for RAID0 as a metric against which the cost of fault tolerance can be gauged. The following observations can be made. For fr = 1.0 or a read–write ratio R : W = 1 : 0, RAID5 and RAID6 with i = 0 yield the same maximum throughput, which also equals that of RAID0. The obvious requirement is that the number of the disks be the same in all 0 (5, 1.0, N, G) = F 0 (6, 1.0, N, G ), cases. In fact, Fmax max  1 (5, 1.0, N, G) = where G  = G . In degraded mode Fmax 0 (5, 1.0, N, G), whereas the degradation for RAID6, 1/2Fmax F 1 max (6, 1.0, N, G), is slightly F 0 (6, 0.75, N, G), is concerned, Fmax max which is due to the fact that RAID6 requires the updating of two check blocks rather than just one. In the case of single 1 (5, f , N, G) ≈ 1/2F 0 (5, f , N, G), disk failures Fmax r r max whereas the percentage reduction is smaller for RAID6. 2 (6, 0.75, N, G) ≤ For RAID6 with two disk failures Fmax F 0 1/3max (6, 0.75, N, G).

The Computer Journal Vol. 48 No. 6, 2005

“bxh108” — 2005/10/19 — page 709 — #8

710

A. Thomasian

We finally consider fr = 0.5 or R : W = 1:1. In normal mode there is an even further reduction in 0 (L, 0.5, N, G) with respect to f = 1.0 and f = Fmax r r 0.75. There is a reduction of 55% (resp. 66%) in the maximum throughput of RAID5 (resp. RAID6) with G = N with respect to RAID0. The difference in the maximum throughputs for RAID5 and RAID6 is even more pronounced, since RAID6 updates two check blocks per write, whereas RAID5 updates only one block. For RAID5 with a single disk 1 (5, 0.5, N, G) ≈ 1/2F 0 (5, 0.5, N, G). For failure Fmax max F 0 (6, 0.5, N, G) is much lower than the maximum RAID6 max throughput for read requests only. The maximum throughput drops by one-third for a single disk failure and by one-half for two disk failures. 7.

CONCLUSIONS

We have introduced a systematic method for evaluating the cost of processing read and write requests for various modes of operation in clustered RAID5 and RAID6 disk arrays via tree diagrams. The cost functions are expressed as sums of multiples of basic RAID operators. The analysis is rather detailed and takes into account the fact that each row in the RAID5 and RAID6 disk arrays holds more than one PG; i.e. it is the check disks corresponding to a data block that affect performance. Interestingly this detail does not affect the final results, so that the analysis could have been carried out more easily by first doing a probabilistic analysis of an unclustered system and substituting costs associated with clustering at the end, i.e. substituting G with N when appropriate. In other words the PG size does not affect the probability of global events that influence processing costs. Given disk characteristics, the cost functions can be translated into mean disk service times in normal and degraded operating modes. These cost functions can be utilized to obtain the maximum throughput attainable by RAID5 and RAID6 systems. The analysis is also applicable to EVENODD and RDP disk arrays with small symbol sizes [16]. We have quantified the effect of PG size G and the read to write ratio on performance. Our results show that the selection of smaller values for G makes it possible to build RAID5 and RAID6 disk arrays exhibiting very small degradation in performance with respect to normal mode, while tolerating one and two disk failures respectively. This is in contrast to some RAID1 organizations which have a higher level of redundancy and cope poorly in terms of performance [4]. The maximum throughputs reported in this study are of interest from the viewpoint of their relative, rather than absolute, values. We have assumed that disk requests are uniformly distributed over all disk blocks, resulting in relatively long seek distances and hence mean seek times. A beta distribution across disk cylinders, approximating an organ pipe organization, would be a better representation for seek distances. The assumption that the disk scheduling policy is FCFS simplifies the analysis, since the mean service time does not depend on the number of requests available for disk scheduling, but it provides rather pessimistic results. Most disk controllers implement a more aggressive disk

scheduling policy than FCFS. The shortest processing time first policy, with 32 randomly placed disk requests available for scheduling, cuts the mean disk service time by half and doubles the throughput of individual disks [29]. We have ignored the effect of the disk array controller cache. In addition to read hits, certain efficiencies in processing write requests result in an improved performance as follows. First, according to the fast-write paradigm, writes are cached in NVRAM at the disk array controller. A dirty block may be overwritten while it resides in NVRAM, resulting in a reduced rate of write requests to disk. When the NVRAM is full, the destaging of blocks is optimized to take advantage of disk geometry, e.g. destaging blocks on neighboring cylinders. Results from I/O trace analysis to capture these effects have been incorporated into the computation of mean service time in [11, 15]. Generalizations to more than two check disks are possible [30]. With k check disks k + 1 RMWs are required, and with k disk failures and 100% read requests the load on each surviving disk is increased (k +1)-fold, unless clustered RAID is adopted. Note that reconstruct writes become more desirable for higher k [18]. Different applications require different RAID levels to attain improved performance. The heterogeneous disk array (HDA) allows multiple virtual arrays (VAs) with different RAID levels to share space on the same disk [31]. Clustered RAID5 and RAID6 are desirable when disk capacity utilization is low and disk (bandwidth) utilization is high. They allow higher flexibility in allocating VAs, while ensuring that disk saturation does not occur. The formulas derived in this paper can be used as input to a data allocation algorithm for an HDA. ACKNOWLEDGEMENTS Figures 1 and 2 were prepared by Ms C. Liu and Figures 3–5 by Mr J. Xu. We acknowledge the support of the NSF through Grant 0105485 in Computer System Architecture. REFERENCES [1] Patterson, D. A., Gibson G. A. and Katz. R. H. (1998) A case for redundant arrays of inexpensive disks (RAID). In Proc. ACM SIGMOD Int. Conf. on Management of Data, Chicago, IL, June 1–3, pp. 109–116. ACM Press. [2] Chen, P. M., Lee, E. K., Gibson, G. A., Katz, R. H. and Patterson, D. A. (1994) RAID: high-performance, reliable secondary storage. ACM Comput. Surv., 26(2), 145–185. [3] Gray, J. and Shenoy, P. J. (2000) Rules of thumb in data engineering. In Proc. 16th IEEE Int. Conf. on Data Engineering (ICDE 2000), San Diego, CA, February 28–March 3, pp. 3–12. [4] Thomasian, A. and Xu, J. (2005) Reliability and Performance of Mirrored Disk Organizations. Technical Report ISL-200406. Submitted for publication, Integrated Systems Laboratory, Computer Science Department, NJIT. [5] Gibson, G. A. (1992) Redundant Disk Arrays: Reliable, Parallel Secondary Storage. MIT Press. [6] Muntz, R. R. and Lui, J. C. S. (1990) Performance analysis of disk arrays under failure. In Proc. 16th Int’l Conf. on very

The Computer Journal Vol. 48 No. 6, 2005

“bxh108” — 2005/10/19 — page 710 — #9

711

Clustered RAID Arrays and Their Access Costs

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18] [19]

[20]

[21] [22]

[23]

[24]

Large Databases (VLDB), Brisbane, Queensland, Australia, August 13–16, pp. 162–173. Ng, S. W. and Mattson, R. L. (1994) Uniform parity distribution in disk arrays with multiple failures. IEEE Trans. Comput., 43(4), 501–506. Holland, M., Gibson, G. A. and Sieworek, D. P. (1994) Architectures and algorithms for on-line failure recovery in redundant disk arrays. J. Distrib. Parallel Dat., 2(3), 295–335. Merchant, A. and Yu, P. S. (1996) Analytic modeling of clustered RAID with mapping based on nearly random permutation. IEEE Trans. Comput., 45(3), 367–373. Park, C. I. (1995) Efficient placement of parity and data to tolerate two disk failures in disk array systems. IEEE Trans. Parall. Distr. Syst., 6(11), 1177–1184. Menon, J. M. (1994) Performance of RAID5 disk arrays with read and write caching. J. Distrib. Parallel Dat., 11(3), 261–293. Chen S. Z. and Towsley, D. (1993) The design and evaluation of RAID 5 and parity striping disk array architectures. J. Parall. Distr. Comput., 17(1/2), 58–74. Gray, J., Horst, B. and Walker, M. (1990) Parity striping of disc arrays: low-cost reliable storage with acceptable throughput. In Proc. 16th Int. Conf. on Very Large Data Bases (VLDB), Brisbane, Queensland, Australia, August 13–16, pp. 148–161. Thomasian. A. and Menon, J. (1994) Performance analysis of disk RAID5 disk arrays with a vacationing server model. In Proc. 10th IEEE Int. Conf. on Data Eng. (ICDE), Houston, TX, February 14–18, pp. 111–119. Thomasian, A. and Menon, J. (1997) RAID5 performance with distributed sparing. IEEE Trans. Parall. Distr. Syst., 8(6), 640–657. Han, C. and Thomasian. A. (2003) Performance of two disk failure tolerant disk arrays. In Proc. Int. Symp. Performance Evaluation of Computer and Telecommunication Systems (SPECTS), Montreal, Quebec, Canada, July 20–24, pp. 572–579, The Society for Modeling and Simulation Int’l. Thomasian, A., Han, C., Fu, G. and Liu, C. (2004) A performance evaluation tool for RAID disk arrays. In Proc. 1st Int. Conf. on Quantitative Evaluation of Systems (QEST’05), Enchede, The Netherlands, September 27–30, pp. 8–17. Thomasian, A. (2005) Read-modify vs reconstruct writes for RAID. Inform. Process. Lett., 93(4), 163–168. Menon, J. M. (1995) A performance comparison of RAID5 and log-structured arrays. In Proc. 4th IEEE Int. Symp. High Performance Distributed Computing (HPDC), Washington, DC, August 2–4, pp. 167–178. Trivedi, K. S. (2002) Probability and Statistics with Reliability, Queueing, and Computer Science Applications. WileyInterscience. Takagi, H. (1991) Queueing Analysis—A Foundation of Performance Evaluation. North-Holland. Thomasian, A., Fu, G. and Ng, S. W. (2005) Analysis of Rebuild Processing in RAID5 Disk Arrays. Technical Report ISL-2004-05. Integrated Systems Laboratory, Computer Science Dept., NJIT. Boxma, O. J. and Cohen, J. W. (1991) The M/G/1 queue with permanent customers. IEEE J. Sel. Area. Commun., 9(2), 179–184. Blaum, M., Brady, J., Bruck, J. and Menon, J. (1995) EVENODD: an efficient scheme for tolerating double disk failures in RAID architectures. IEEE Trans. Comput., 44(2), 192–202.

[25] Corbett, P., English, B., Goel, A., Grcanac, T., Kleiman, S., Leong, J. and Sankar, S. (2004) Row-diagonal parity for double disk failure correction. In Proc. 3rd USENIX Conf. on File and Storage Technologies (FAST’04), San Francisco, CA, March 31–April 2. [26] Ng, S. W. (1998) Advances in disk technology: Performance issues. IEEE Computer, 31, 75–81. [27] Thomasian, A. (2005) Comment on ‘RAID5 performance with distributed sparing’. IEEE Trans. Parall. Distr. Syst., to appear. [28] Diskspecs. Validated disk parameters. Available at http://www. pdl.cmu.edu/DiskSim/diskspecs.html [29] Thomasian, A. and Liu, C. (2005) Empirical Analysis of the SATF and SCAN Disk Scheduling Policies. Technical Report ISL-2005-02. Submitted for publication. [30] Alvarez, G. A., Burkhard, W. A. and Cristian, F. (1997) Tolerating multiple failures in RAID with optimal storage and uniform declustering. In Proc. 24th Int. ACM Symp. Computer Architecture (ISCA), Denver, CO, June 2–4, pp. 62–72. [31] Thomasian, A. and Han, C. (2005) Heterogeneous Disk Array: Design and Data Allocation. In Proc. 2005 Int’l Symp. on Performance Evaluation of Computer and Telecommunication Systems—SPECTS’05, Cherry Hill, NJ. July 24–28. The Society for Modeling and Simulation International, pp. 617–624. [32] Hall, M. (1986) Combinatorial Theory. Wiley. [33] Durstenfeld, R. (1964) Algorithm 235: random permutation. Commun. ACM, 7(7), 420.

APPENDIX A: CLUSTERED RAID DESIGNS Balancing the load for updating parities in RAID5 and RAID6 can be handled easily using the left symmetric data layout. Data allocation in clustered RAID to balance update load is   N a more challenging problem. Given N and G there are G permutations, but this number may be too large to yield a balanced allocation on finite capacity disks, at least for the fraction of disk capacity that is utilized [6]. BIBDs were proposed in [7, 8] as an alternative. NRP is an alternative to BIBD to balance the load for updating parities [9]. We also describe the RM2 disk array, which can tolerate two disk failures and is in addition a clustered RAID [10]. BIBD data layout A BIBD data layout is a grouping of N distinct objects into b blocks, such that each block contains G objects, each object occurs in exactly r blocks, and each pair of objects appears TABLE 1. BIBD data layout with N = 10 disks and PG size G = 4.

PG #

Disk # 1

2

3

4

5

6

7

8

9

10

1 2 3 4 8 12

1 4 6 7 9 13

2 4 5 7 10 14

3 4 5 6 11 15

1 5 8 10 11 13

2 6 8 9 11 14

3 7 8 9 10 15

1 5 9 12 14 15

2 6 10 12 13 15

3 7 11 12 13 14

The Computer Journal Vol. 48 No. 6, 2005

“bxh108” — 2005/10/19 — page 711 — #10

712

A. Thomasian TABLE 2. Allocation of PGs before permutation (N = 10, G = 4).

PGs (initial allocation)

Disk # 0

1

2

3

4

5

6

7

8

9

D0 D8 D15 D23

D1 P6–8 D16 P21–23

D2 D9 D17 D24

P0–2 D10 P15–17 D25

D3 D11 D18 D26

D4 P9–11 D19 P24–26

D5 D12 D20 D27

P3–6 D13 P18–20 D28

D6 D14 D21 D29

D7 P12–14 D22 P27–29

TABLE 3. Permuted data blocks with nearly random permutation method (N = 10, G = 4).

Final allocation

Disk # 0

1

2

3

4

5

6

7

8

9

D0 D8

D4 P0–11

D3 D11

P3–6 D13

D6 D14

D5 D12

P0–2 D10

D2 D9

D7 P12–14

D1 P6–8

in exactly L blocks [32]. Only three out of five variables are free, since bG = N r and r(G − 1) = L(N − 1). For N = 10, G = 4, the number of PGs is b = 15, the number of domains (different PGs) per disk is r = 6, and the number of PGs common to any pair of disks is L = 2, as shown in Table 1 adapted from [7]. Many more BIBD tables can be derived from the specifications of BIBD designs given in [32]. On the other hand BIBDs are not available for all values of N and G, e.g. a layout for N = 33 with G = 12 does not exist, but G = 11 and G = 13 can be used instead [8]. In the case of RAID5 (resp. RAID6), the P (resp. P and Q) parities can be consistently assigned as the last two SUs in each PG. NRP data layout Disk array space is organized as an M × N matrix, with M rows corresponding to stripes and N columns corresponding to disks. PGs of size G < N are placed sequentially on the N ×M SUs in the array, so that PG i occupies SUs iG through iG + (G − 1). The initial allocation with N = 10 and G = 4 is shown in Table 2. (Pi–j stands for the parity of SUs Di through Dj). Note that the parity SUs appear on only onehalf of the disks in this example, so that the parity update load is not balanced. This load can be balanced by randomizing the placement of blocks on the disks. The row number I is used as a seed to a pseudo-random number generator to generate a random permutation of {0, 1, . . . , N − 1}, given as PI = {P0 , P1 , . . . , PN−1 }. The permutation is generated using Algorithm 235 in [33].1 If mod(N, G) = 0 then the same random permutation is repeated G times and if mod(N, G)  = 0 then the random permutation is repeated K = LCM(N, G)/N times, where LCM(N, G) denotes the least common multiple of N and G. Given the random permutation P1 = {0, 9, 7, 6, 2, 1, 5, 3, 4, 8} for the first row in Table 2 and since K = LCM(10, 4)/10 = 2, the same permutation is repeated in the second row, as shown in Table 3. The random permutations 1 Shuffle array a[i], i = 0, . . . , n − 1. for i := n step −1 until 2 do {j = entier(i × random + 1); b := a[i]; a[i] := a[j]; a[j] := b}.

TABLE 4. Data layout for RM2 with N = 7 and M = 3. D0

D1

D2

D3

D4

D5

D6

p0 d2,3 d1,4

p1 d3,4 d2,5

p2 d4,5 d3,6

p3 d5,6 d4,0

p4 d0,6 d5,1

p5 d0,1 d6,2

p6 d1,2 d0,3

allocates approximately the same number of parity blocks per disk. Note that SUs of a PG straddling two rows are mapped onto different disks, so that (i) the RAID5 paradigm can be used for recovery and (ii) all the data SUs in the PG can be accessed in parallel. For a given block number, knowing N , the SU size, and the RAID level, we can determine the row number I . The SU in which the block resides is obtained by applying the permutation using I as a seed. The other SUs in the PG are also easily determined. This data layout can be used in conjunction with RAID6 disk arrays, where the last two SUs in a PG are assigned to be P and Q. The RM2 disk array The RM2 method is defined as follows [10]: ‘Given a redundancy ratio p and the number of disks N , construct N PGs each of which consists of 2(M − 1) data blocks and one parity block such that each data block should be included in two groups, where M = 1/p.’ An algorithmic solution to this problem is based on an N × N redundancy matrix (RM), which determines the data layout for a given redundancy ratio p. Details are omitted here for brevity. For N odd, N ≥ 3M − 2, and for N even, N ≥ 4M − 5. The data layout for an RM2 array with N = 7 and M = 3 is shown in Table 4. Single disk failures in RM2 can be handled easily, but for two disk failures we need a recovery path. In Figure 4 consider an access to d2,3 with disk D0 and D3 failed. Utilizing p2 and the associated PG we have d2,3 ← p2 ⊕ d2,5 ⊕ d6,2 ⊕ d1,2 . However, if D2 fails instead of D3 then d2,3 cannot be reconstructed using p3 directly, since d3,6 is not available. However, d3,6 can be reconstructed

The Computer Journal Vol. 48 No. 6, 2005

“bxh108” — 2005/10/19 — page 712 — #11

713

Clustered RAID Arrays and Their Access Costs using p6 , so we have the following two steps: (i) d3,6 ← p6 ⊕ d5,6 ⊕ d0,6 ⊕ d6,2 , (ii) d2,3 ← p3 ⊕ d3,4 ⊕ d3,6 ⊕ d0,3 . Therefore, the recovery path for block d2,3 is d3,6 → d2,3 and has a length of two. With disks D0 and D1 failed, the path to reconstruct d1,4 is d2,5 → d2,3 → d3,4 → d1,4 . On average 2.4 disk accesses are required per (surviving) disk. In fact, RM2 load with two disk failures is unbalanced. Disk loads can be balanced by randomizing the allocations on successive rows, e.g. using the NRP method. Balanced loads are assumed in the analysis that follows. The analysis of RM2 in degraded mode is based on a function F (N, M) which denotes the number of blocks read for rebuild processing [16]. The function is obtained by applying curve fitting to simulation results. Validation against detailed simulations showed that this analysis is not acceptably accurate, which is due to the variability in the number of blocks accessed. The hybrid method utilized in [17] incorporates the moments of disk service time which are obtained via simulation into the M/G/1 formula for mean waiting time. RM2 outperforms RAID6 without clustering when there is a single disk failure, since it is a clustered RAID. However, RM2 tends to be outperformed by RAID6 when there are two disk failures.

The costs of processing read and write requests correspond to Equations (8) and (9) for clustered RAID6 in Section 5.2 respectively. Two disk failures The probability that the data block D is available or not for reading and the associated costs are Pr[D] =

N −2 ⇒ DSR , N

Pr[D] =

2 ⇒ (N − 2)DSR . N

In fact when D is not available, there are two cases: when the corresponding failed block is data or parity. However, both cases incur the same cost. The cost of processing read requests is then given by Equation (10) in Section 5.3.1 with the substitution G = N . We have the following probabilities and cost functions, which follows directly for the branches of the tree diagram in Figure 1: Pr[D ∧ P ∧ Q] =

(N − 3)(N − 4) ⇒ 3DRMW , (N − 1)N

(24)

APPENDIX B: ANALYSIS OF BASIC RAID6

Pr[D ∧ P ∧ Q] =

2(N − 3) ⇒ 2DRMW , (N − 1)N

(25)

The normal mode of operation is similar to that of basic RAID5, except that three rather than two RMWs are required to update D, P , and Q.

Pr[D ∧ P ∧ Q] =

2(N − 3) ⇒ 2DRMW , (N − 1)N

(26)

2 ⇒ DSW , (N − 1)N

(27)

Single disk failure The probability that the data block D is available or not for reading and the associated costs are Pr[D] =

N −1 ⇒ DSR , N

Pr[D] =

1 ⇒ (N − 2)DSR . N

Note that either parity can be used in reconstructing D. In the case of writes there are three cases: (i) D, P and Q blocks are not affected by the failure; (ii) one of the two check disks is affected by the failure; (iii) the data disk is affected by the failure. Pr[D ∧ P ∧ Q] =

N −3 ⇒ 3DRMW , N

Pr[D ∧ P ∧ Q] = Pr[D ∧ P ∧ Q] =

Pr[D ∧ P ∧ Q] =

Pr[D ∧ P ∧ Q] =

Pr[D ∧ P ∧ Q] =

2 ⇒ (N − 3)DSR + DSW , (N − 1)N (29)

Pr[D ∧ P ∧ Q] =

2 ⇒ (N − 3)DSR + DSW . (N − 1)N (30)

1 ⇒ 2DRMW , N

1 Pr[D ∧ P ∧ Q] = ⇒ (N − 3)DSR + 2DSW , N

2(N − 3) ⇒ (N − 3)DSR + 2DSW , (N − 1)N (28)

The overall cost for processing read and write requests is given by Equation (23) in Section 5.3.2, with the substitution G = N.

The Computer Journal Vol. 48 No. 6, 2005

“bxh108” — 2005/10/19 — page 713 — #12