Performance, reliability, and performability of a hybrid RAID array and ...

1 downloads 0 Views 618KB Size Report
Jun 21, 2012 - Performance, reliability, and performability of a hybrid RAID array and a comparison with traditional RAID1 arrays. Alexander Thomasian · Yujie ...
Cluster Comput (2012) 15:239–253 DOI 10.1007/s10586-012-0216-9

Performance, reliability, and performability of a hybrid RAID array and a comparison with traditional RAID1 arrays Alexander Thomasian · Yujie Tang

Received: 15 October 2011 / Accepted: 25 May 2012 / Published online: 21 June 2012 © Springer Science+Business Media, LLC 2012

Abstract We describe a hybrid mirrored disk organization patented by LSI Logic Corp. and compare its performance, reliability, and performability with traditional mirrored RAID1 disk organizations and RAID(4 + ),  ≥ 1. LSI RAID has the same level of redundancy as mirrored disks, but also utilizes parity coding. Unlike RAID1, which cannot tolerate all two disk failures, LSI RAID similarly to RAID6 is 2 Disk Failure Tolerant (2DFT), but in addition it can tolerate almost all three disk failures, while RAID1 organizations are generally 1DFT. We list analytic expressions for the reliability of various RAID1 organizations and use enumeration when the reliability expression cannot be obtained analytically. An asymptotic expansion method based on disk unreliabilities is used for an easy comparison of RAID reliabilities. LSI RAID performance is evaluated with the Read-Modify-Write (RMW) and ReConstruct Write (RCW) methods to update parities. The combination of the two methods is used to balance data and parity disk loads, which results in maximizing the I/O throughput. The analysis shows that LSI RAID has an inferior performance with respect to basic mirroring in processing an OLTP workload, but it outperforms RAID6. LSI RAID in A. Thomasian () · Y. Tang Shenzhen Institutes of Advanced Technology (SIAT), Chinese Academy of Sciences, Shenzhen, China e-mail: [email protected] Y. Tang e-mail: [email protected] A. Thomasian Thomasian & Associates, 17 Meadowbrook Rd., Pleasantville, NY 10570, USA Y. Tang ECE Dept., University of Waterloo, Waterloo, Ontario, Canada

spite of its higher Mean Time to Data Loss (MTTDL) is outperformed by other RAID1 organizations as far as its performability is concerned, i.e., the number of I/Os carried out by the disk array operating at maximum I/Os Per Second (IOPS) until data loss occurs. A survey of RAID1 organizations and distributed replicated systems is also included. Keywords Disk mirroring · RAID1 organizations · Parity encoding · Hybrid mirrored disks · LSI RAID · SSPiRAL RAID · RAID5 · RAID6 · Multiway replication · Performance analysis · Reliability analysis, · Performability Abbreviations BM Basic Mirroring CD Chained Declustering CRAID Clustered RAID Ddisk Data disk DoutD Data out-Degree GRD Group Rotate Declustering HRAID Hierarchical RAID HST Head Settling Time ID Interleaved Declustering IOPS I/Os per Second kDFT k Disk Failure Tolerant LSE Latent Sector Error MDS Maximum Distance Separable MTTDL Mean Time to Data Loss MTTF Mean Time to Failure OLTP OnLine Transaction Processing OSM Orthogonal Striping and Mirroring PCM Permanent Customer Model Pdisk Parity disk PinD Parity in-Degree RAID Redundant Array of Independent Disks RCW ReConstruct Write

240

RMD RMW RPM RS SADA SSPiRAL VSM XOR

Cluster Comput (2012) 15:239–253

Rotated Mirrored Declustering Read-Modify-Write Rotations Per Minute Reed-Solomon code Self-Adaptive Disk Array Survivable Storage using Parity in Redundant Array Layouts Vacationing Server Model eXclusive OR

1 Introduction Disk mirroring which corresponds to RAID level 1 (RAID1) [4] has the following advantages over rotated parity RAID5 arrays with erasure coding: (i) Doubling of disk access bandwidth for read requests. (ii) Less costly data updates by avoiding the small write penalty [4]. (iii) More efficient ondemand reconstruction and rebuild processing. Most RAID1 organizations can tolerate the failure of half of their disks, but there are two disk failures which may lead to data loss, RAID1 disk arrays similarly to RAID5 disk arrays are therefore kDFT with k = 1 [39]. We consider a hybrid RAID organization patented by LSI Logic Corp. [47], referred hereafter as LSI RAID, which combines mirroring and parity coding. Each Data disk (Ddisk) in LSI RAID is protected by two Parity Disks (Pdisks) and vice-versa each Pdisk protects two Ddisks. LSI RAID is therefore 2DFT, although it tolerates almost all three disk failures. The reliability, performance, and performability of the following RAID1 arrays are evaluated in [36, 38]: 1-Basic Mirroring (BM). 2-Group Rotate Declustering (GRD) [5]. 3-Interleaved Declustering (ID) [25]. 4-Chained Declustering (CD) [12, 13], which was later adopted in the Petal disk array [27]. The asymptotic reliability analysis method in [35] can be used to rank RAID1 reliabilities against each other and other arrays. In [38] we compare the performance of RAID1 organizations against each other and RAID5 and RAID6 with zero, one, and two disk failures using queueing analysis. The Mean Time to Data Loss (MTTDL) metric is used to compare their reliabilities. We start the paper with a review of RAID1 organizations. LSI RAID performance is compared with the Read-ModifyWrite (RMW) and ReConstruct Write (RCW) methods for updating parities. A combination of the two methods is used to balance the loads on Ddisk and Pdisks with the goal of maximizing the array throughput, i.e., the maximum I/Os Per Second (IOPS) that can be sustained by it. We extend the asymptotic reliability analysis method in [35] to show that LSI RAID is more reliable than other RAID1 organizations and also RAID5 and RAID6, but less reliable than RAID7 [39] and the SSPiRAL disk arrays [2]. LSI RAID is

shown to have an MTTDL higher than RAID7. While LSI RAID does not tolerate all three disk failures, it tolerates some four disk failures, which RAID7 does not. We extend the performability evaluation of the four RAID1 organizations in [38] to LSI RAID. A comprehensive performance evaluations of several RAID organizations considered in [9] concludes that LSI RAID has a superior performance with respect to most 2DFTs, but no comparison with other RAID1 organizations is reported. LSI RAID performance is compared with RAID1 BM organization and RAID6 in this study. The paper is organized as follows. Section 2 is a brief introduction to RAID arrays with erasure coding. In Sect. 3 we review RAID1 organizations, including recent proposals. In Sect. 4 we evaluate the performance of LSI RAID organization with RMW and RCW methods and compare its performance with BM and RAID6. The combination of the RMW and RCW methods to optimize performance is also investigated. In Sect. 5 we list the reliability equations derived for RAID1 in [36] and extend the analysis to LSI RAID and the SSPiRAL disk array [2]. In Sect. 6 we outline the approximate reliability analysis method in [35] and apply it to LSI RAID. In Sect. 7 we obtain the maximum throughput for LSI RAID operating with variable number of failed disks. These maximum IOPS along with those obtained in [38] are used to compare RAID1 performabilities with LSI RAID. In Sect. 8 we propose multidimensional hybrid mirrored disks. Section 9 summarizes the contributions of this work.

2 Brief introduction to RAID Maximum Distance Separable (MDS) arrays incur the minimum level of redundancy by utilizing the capacity of k disks to tolerate k disk failures [39]. RAID5 is an MDS 1DFT with a single parity disk to tolerate one disk failure. RAID(4 + k), k ≥ 1 arrays are MDS kDFTs using Reed-Solomon coding. The cost of computing check codes in RAID6 can be reduced via parity coding, such as in the case of EVENODD and RDP, which have been extended to 3DFTs [39]. X-code is another MDS 2DFT array with N disks where N is prime. Its P and Q parities are placed horizontally as the last two rows of successive N × N segments [39]. There is a similarity to RM2, which also places parities horizontally, but is not MDS [21]. Striping in RAID arrays balances disk loads by partitioning large datasets into strips, which are placed round-robin across all disks [4]. Striping is applied to most RAID arrays with various levels of redundancy. RAID0 is an exception in that it utilizes striping but not redundancy. In RAID(k + 4), k ≥ 1 strips per row (or stripe) are assigned to check strips. These are placed according to the left symmetric organization [4], repeating right to left diagonals, to balance the par-

Cluster Comput (2012) 15:239–253

ity update load. The performance penalty for higher fault tolerance is quantified in [37] by comparing RAID0, RAID5, and RAID6 performance in normal and degraded operating modes for an OnLine Transaction Processing (OLTP) workload. This workload, which generates accesses to small randomly placed blocks is also postulated in this study. On demand reconstruction of a missing block on a failed disk in RAID(4 + k) arrays with 1 ≤ i ≤ k disk failures requires parallel accesses to the corresponding N − k blocks. This fork-join access is completed when all blocks are accessed and eXclusive-ORed (XOR)ed to reconstruct the missing block. RAID5 with a single disk failure results in the doubling of the read load on surviving disks. This results in an increased mean response time in disk accesses. In Clustered RAID (CRAID) the number of disks in each parity group (G) is less than the number of disks in the array (N ), hence G < N , so that the increase in the read load of surviving disks is α = (G − 1)/(N − 1) < 1, e.g., α = 1/3 for N = 10 and G = 4 [19, 33, 39]. CRAID with G = 2 is tantamount to data replication. Balance Incomplete Block Designs (BIBD) and Nearly Random Placements (NRP) are two methods to balance check strip placement in CRAID [33, 39]. Rebuild is the systematic reconstruction of the contents of a failed disk on a spare disk. The reconstruction of a Rebuild Unit (RU) in RAID5 requires accessing the corresponding N − 1 RUs from surviving disks, XORing them, and writing the reconstructed RU on: (1) a spare disk in the case of dedicated sparing [30], (2) spare areas in the case of distributed sparing [31], (3) overwriting check strips in the case or restriping [42]. (4) parity sparing combines two RAID5 arrays into one [16]. Rebuild may be unsuccessful due to a second disk failure before it is completed. Latent Sector Errors (LSEs) investigated in [23], are a much more likely cause of unsuccessful rebuilds than secondary disk failures [15] and the main reason for the introduction of 2DFTs. LSEs can be dealt with by disk scrubbing [26] and the Intra-Disk Redundancy (IDR) method [7]. The superiority of the latter method is shown in [15]. Prefailure rebuild in RAID(4 + ),  ≥ 1 is preferable to postfailure rebuild in that it requires the copying of the single disk, which has been diagnosed to be about to fail. As compared to RAID5 and RAID6, rebuild in RAID1 requires the copying of a single disk, so that the volume of data to be read and the possibility of encountering LSEs is reduced significantly. An improvement to disk copying time based on out-of-order opportunistic processing of rebuild reads is proposed in [3] in conjunction with two rebuild methods proposed for RAID5 arrays: (i) Vacationing Server Model (VSM) [30], (ii) Permanent Customer Model (PCM) [18]. VSM gives lower priority to rebuild reads than user requests, while PCM processes rebuild reads at the same priority as user requests, although a single request is introduced after

241

each rebuild read is completed. VSM is therefore expected to result in a lower response time for user requests compared to PCM. It is argued in [39] allows more rebuild reads to be processed without interruption, i.e., additional seeks, so that a shorter rebuild time is to be expected. The proposed method in [3] gives priority to the reading of unread rebuild units closest to the read/write head after completing user requests.

3 LSI RAID and related RAID1 arrays The total number of RAID1 disks considered in [36] and [38] is N = 2M, although an even number of disks is not required by RAID1 organizations, which utilize half of the disk capacity as mirrors. Basic Mirroring (BM) in its simplest form consists of a pair of disks, but for higher volumes of data there are N = 2M disks or M > 1 disk pairs. Similarly to RAID5 and RAID6 data is striped across disks to balance disk loads. Logically contiguous data blocks in each stripe on each disk are referred to as a strip. A multidisk RAID1 with BM organization with N = 8 is shown in Fig. 1, which may be considered a RAID1/0 array, since we have mirrored RAID0 arrays. In RAID0/1 we have a RAID0 array whose logical disks are mirrored disks. Although disk loads are balanced across disk pairs, the failure of a disk results in the doubling of the load on its pair. Group Rotate Declustering (GRD) places primary disks on one side and secondary disks at the other as shown in Fig. 2, similarly to Fig. 1, except that the data strips on secondary disks are rotated from row to row. The advantage of GRD over BM is that upon the failure of a single disk, its Cluster 1 1 A E I M

2 B F J N

Cluster 2 3 C G K O

4 D H L P

5 A E I M

6 B F J N

7 C G K O

8 D H L P

Fig. 1 Basic Mirroring with N = 2M disks. Primed blocks are mirrors

Cluster 1 1 A E I M

2 B F J N

Cluster 2 3 C G K O

4 D H L P

5 A H K P

6 B E L M

Fig. 2 Group Rotate Declustering with N = 2M disks

7 C F I N

8 D G J O

242

Cluster Comput (2012) 15:239–253

Cluster 1 1 A b3 c2 d1

2 B a1 c3 d2

Cluster 2 3 C a2 b1 d3

4 D a3 b2 c1

5 E f3 g2 h1

6 F e1 g3 h2

Cluster 1 7 G e2 f1 h3

8 H e3 f2 g1

Fig. 3 Interleaved Declustering with N = 8 disks, K = 4 disks per cluster, and c = 2 clusters

1 B0 B4 B8 M9 M10 M11

2 B1 B5 B9 M6 M7 M8

3 B2 B6 B10 M3 M4 M5

4 B3 B7 B11 M0 M1 M2

Fig. 4 RAID-x architecture with M = 4 disks in one cluster

read load is evenly distributed over the M disks at “the other side”. Up to M disk failures will not lead to data loss as long as they are all on one side, but the probability that a second disk failure leads to data loss is high: M/(2M − 1) > 0.5. Interleaved Declustering (ID) partitions the disk array to equal-sized clusters, e.g., for N = 8, c = 2 clusters with K = N/c = 4 disks each. Disks hold primary and secondary data blocks, where the secondary data blocks are copies of primary data blocks. We use capital letters for primary and small letters for secondary data blocks as shown in Fig. 3. The advantage of ID over BM is that the read load increase due to a single disk failure is 1/(K − 1) < 1/2 for K > 2. Unlike BM and GRD (and also CD, discussed below), which tolerate up to I = M disk failures, a maximum of I = c disk failures can be tolerated by ID or one disk failure per cluster. Dual Striping method differs from ID in that it combines large and small strips to improve performance [17]. Database queries requiring table scans and OLTP workloads accessing small data blocks, access data from the larger and smaller strips, respectively. Having the data distributed over small strips reduces data access skew for OLTP applications, while table scans for ad hoc query processing and decision support are processed more efficiently by accessing large strips. Orthogonal Striping and Mirroring (OSM) or RAID-X proposed in [14] has a similarity to ID, as shown in Fig. 4. Data strips on primary disks are placed diagonally in the secondary areas of the other disks in a cluster. Chained Declustering (CD) is an improvement over ID in terms of reliability. The space on disk Di , 1 ≤ i ≤ N is equipartitioned to primary and secondary areas. Data blocks in the primary area of one disk are replicated on the secondary area of the following disk modulo N , as shown in Fig. 5. In the case of disk failures, fractional routing of read requests can be used to attain a balanced read load [38]. CD can tolerate up to M disk failures, as long as no two failed disks are contiguous. Rotational Mirrored Declustering (RMD) was proposed in [6] to support high data availability for Video on Demand (VoD) servers. RMD stores replicas in different arrays in a rotated manner. LSI RAID places a Pdisk (parity disk) between pairs of Ddisks (data disks). Pdisks can be conceptually placed in

1 A H

2 B A

3 C B

4 D C

5 E D

6 F E

7 G F

8 H G

Fig. 5 Chained Declustering with N = 2M disks. Primed blocks are mirrors

a second row, so that with N = 2M = 4, there are four Ddisks and four Pdisks (denoted by Ds and P s) we have Pi,i+1(mod M) = Di ⊕ Di+1(mod M) , 1 ≤ i ≤ 4. We instead place all N = 2M = 8 disks in one row as shown below: (D1 , P1,2 , D2 , P2,3 , D3 , P3,4 , D4 , P4,1 ). We have Pi,i+1 = Di ⊕ Di+1(mod M) , 1 ≤ i ≤ M. The Data out-Degree (DoutD) is the number of parity elements to which a data element contributes and the Parity in-Degree (PinD) is the number of data elements used in computing a parity [9]. Both DoutD and PinD are set to two in this case. LSI RAID tolerates all two disk failures and most three disk failures. Three consecutive disk failures can be tolerated in one half of cases, when the middle disk is a Pdisk. If the three disks (D1 , P1,2 , D2 ) fail then block d1 on D1 can be reconstructed as d1 = d4 ⊕ p4,1 and block d2 on D2 can be reconstructed as d2 = d3 ⊕ p2,3 . Finally, p1,2 = d1 ⊕ d2 . Chained recovery is not possible when the middle disk is a Ddisk or 2-out-of-3 failed consecutive disks are Pdisks, e.g., (P1,2 , D2 , P2,3 ). Consequently, only one half of three consecutive disk failures can be tolerated. Striping with shifted stripes, as shown below, where ds and ps represent strips in the first row, and d¯ and p¯ represent strips in the second row. This data layout has the advantage of balancing disk loads for processing updates, but cannot tolerate consecutive three disk failures: (d1 , p1,2 , d2 , p2,3 , d3 , p3,4 , d4 , p4,1 ), (p¯ 4,1 , d¯1 , p¯ 1,2 , d¯2 , p¯ 2,3 , d¯3 , p¯ 3,4 , d¯4 ). M = N/2 disk failures can be tolerated as long as all Ddisks survive. The rebuild load for a Ddisk can be shared among the two disk pairs on both sides, but the spare disks will constitute a

Cluster Comput (2012) 15:239–253

243

Disk0

Disk1

Disk2

Disk3

Disk4

Disk5

Disk6

P0 D2,3 D1,4

P1 D3,4 D2,5

P2 D4,5 D3,6

P3 D5,6 D4,0

P4 D0,6 D5,1

P5 D0,1 D6,2

P6 D1,2 D0,3

Fig. 6 RM2 Disk Layout for N = 7 and p = 1/3. Disk0 . . . Disk6 are the seven disks. P0 , . . . , P6 are the parity blocks for the seven parity groups. Data block Di,j is protected by two parity blocks Pi and Pj

bottleneck if the utilizations of the disks being read are low, so that the rate of rebuild writes exceeds the bandwidth of the spare disk. RM2 Disk Arrays. The RM2 disk array is a nonMDS 2DFT proposed in [21]. There is a similarity to LSI RAID in that each data block is protected by two parity blocks, but each parity blocks protects 2M − 2 data blocks, where M is the inverse of the redundancy level p or M = 1/p. For p = 1/3 and M = 3, N = 7 is the smallest number satisfying the inequalities: p ≥ 3/(N + 2) or p ≥ 4/(N + 5) for an odd or even number of disks, A RAID6 with N = 7 disks has a lower redundancy level 2/7 < 1/3, so that RM2 is nonMDS. The layout for data strips with M = 3 and N = 7 is shown in Fig. 6, where for each row of parity strips, we have two rows of data strips. For p = 1/2 there are 2M − 2 = 2 data strips in a parity group, so that there is a parity strip associated with two data strips. For even N the condition 1/2 ≥ 3/(N + 2) yields N ≥ 4. Setting N = 8 we have eight disks which share space between data and parity strips. The data strips protected by parities are determined by the algorithm specified in [21], but parity placements are satisfactory as long as that they are different from the data strips they protect, e.g., (P1,2 , P2,3 , P3,4 , P4,5 , P5,6 , P6,7 , P7,8 , P8,1 ), (D8 , D1 , D2 , D3 , D4 , D5 , D6 , D7 ). Self-Adaptive Disk Array (SADA) is a mirrored disk organization which is similar to LSI RAID in reverse, as illustrated by the following example from [20]. SADA combines mirroring with parity coding as disks are lost, so that it gradually reverts from RAID0/1 to the low redundancy RAID5 and eventually to RAID0, thus parity protection is available preceding the last step and furthermore all data blocks can be accessed directly without resorting to XORing. The starting point is a standard mirrored organization with no parity disks: four disk pairs A1 , A2 , B1 , B2 , C1 , C2 , D1 , D2 holding two copies of datasets A, B, C, and D. If B1 fails extra protection is provided for B by setting A1 = A ⊕ B. If D1 fails the system sets C2 = C ⊕ D. If D2 fails the system sets A1 = (A ⊕ B) ⊕ (C ⊕ D) and C2 = D. In effect this is a RAID5 array with one parity disk. Finally, if B2 fails the system XORs A from A2 , C from C1 , and D from C2 with A1 so that A1 = B.

SSPiRAL (Survivable Storage using Parity in Redundant Array Layout) extends the LSI RAID paradigm to the case where Pdisks are computed over m = 3 Ddisks and each Ddisk is protected by m = 3 Pdisks, so that DoutD = PinD = 3 [2]. For N = 8 the four Ddisks A, B, C, and D are protected by the four Pdisks holding A ⊕ B ⊕ C, B ⊕ C ⊕ D, C ⊕ D ⊕ A, and D ⊕ A ⊕ B. If Ddisks and Pdisks are placed in two rows then each Pdisk is the XOR of the Ddisk above it and the two Ddisks that follow it, modulo M = N/2. It is easy to verify that up to three disk failures can be tolerated in all cases. If all Ddisks fail they can be reconstructed using Pdisks, which is not the case for LSI RAID. There is data loss if a Ddisk and   the three Pdisks in which it participates fail. There are 84 = 70 configurations with four disk failures. Enumeration shows that for N = 8 data loss occurs in 1/5th of four disk failure cases [2]. Repair with up to four disk failures is utilized in this study to increase the probability that there is no data loss at the end of the “economic lifespan” of the disk array. This probability is determined by transient reliability analysis of the system, but this analysis has the following shortcomings: (i) The repair rate is set proportional to the number of failed disks and does not take into account potential hardware bottlenecks. (ii) An infinite supply of spare disks is postulated. (iii) Repair rates are exponentially distributed. so that due to its memoryless property as a new disk fails the rebuild process at all disks under repair is restarted. (iv) The analysis does not take into account LSEs, which are the main cause of rebuild failures [15]. B-code has similarities to LSI RAID [48]. It stores parities associated with the data in each column. In the case of n = 3 bits per column and  = 2n = 6 columns, we have Bˆ 6 , (a dual B6 code) as shown below based on Fig. 1 in [48]:   a1 a2 + a3 a4 + a6

a2 a3 + a4 a5 + a1

a3 a4 + a5 a6 + a2

a4 a5 + a6 a1 + a3

a5 a6 + a1 a2 + a4

a6 a1 + a2 a3 + a5

Like the Reed-Solomon, EVENODD, and Rotated Diagonal Parities (RDP) [39], B-Code is MDS, it is parity based, and has several other properties listed in [48]. Weaver codes have a similarity to LSI RAID, where for Weaver(n, k.t) each parity element has an in-degree k and the data elements have out-degree t [10]. As shown in Fig. 2 in [10] each strip has r data elements and q parity elements, so that the efficiency is e = r/(r + q) = k/(k + t). Since k ≤ t the maximum efficiency of this code is 50 %. Weaver(n, 2, 2) corresponds to LSI RAID. Multiway placement is an extension of mirroring to more than 2-way replication [24]. Three-way versions of chained, group rotate, and standard mirroring are specified. The Shifted Declustering (SD) provides optimal parallelism, since the data replicas are distributed evenly. The similarity of the data layout for DS to that proposed in [1] requires further investigation.

244

Replica placement in the context of large-scale data storage systems to attain increased availability is investigated in [44]. Three data placements are considered: (a) Declustered with r replicas of an object placed randomly at some r of the n nodes. (b) Clustered the n nodes are divided into disjoint sets of r nodes. (c) k-Clustered the n nodes are partitioned into disjoint sets of k nodes and the declustered placement is followed at these nodes. It is shown that for a replication factor of two all placements have an MTTDL within a factor of two. Placement of more than two replicas in clustered and declustered organizations is considered in [45]. Declustered placement spreads replicas across all other nodes, while a minimum number of nodes are used by clustered placement. It is assumed that average lifetime of a node is of the order of δ −1 = 105 hours and given c bytes per node and a rebuild bandwidth b it takes c/b = 10 hours to rebuild a node, so that δc/b  1. Given that the probability that the system experiences data loss is PDL , than MTTDL ≈ [nδPDL ]−1 . Equation (4) in the paper leads to the formula for PDL . For r = 2MTTDLclus ≈ b/(ncδ 2 ) and MTTDLdeclus ≈ b/(2ncλ2 ), so that the MTTDL is inversely proportional to the number of nodes. For r = 3MTTDLclus ≈ b2 /(nc2 δ 3 ) MTTDLdeclus ≈ (n − 1)b2 /(4nc2 δ 3 ), which is almost independent of the number of nodes. The issue of the reliability of data storage systems under bandwidth constraints for rebuild processing is discussed in [46].

4 Performance analysis OLTP workloads read and update small, randomly placed blocks. The Disk Array Controller (DAC) has a large cache, which satisfies a significant fraction of Single Read (SR) accesses. LSI RAID is inferior to other RAID1 organizations in processing read requests, since they provide double the bandwidth of LSI RAID directly, without resorting to XOR processing. In the case of LSI RAID a block d1 at Ddisk D1 can be accessed directly or reconstructed as (p1,2 ⊕ d2 ) or (p4,1 ⊕ d4 ). This may be necessary to increase Di ’s access bandwidth or reduce the access time if it is overloaded. This action will however reduce the overall array throughput. In estimating disk loads for BM, we assume that read requests are routed uniformly over the two disks. We do not consider dynamic routing of read requests, which may result in a reduction in mean disk service time, such as sending requests to a disk that reduces the mean seek distance from C/3 to C/5, where C is the number of disk cylinders [34]. It was observed in this study that disk scheduling has a more significant impact on performance than disk routing. We postulate FCFS scheduling so that the mean disk service time does not improve with increased disk queuelengths resulting from higher arrival rates. This would be

Cluster Comput (2012) 15:239–253

the case with the Shortest Access Time First (SATF) policy, which minimizes disk positioning time (sum of seek time and latency) [40]. Writing of data blocks in BM and other basic RAID1 organizations requires two Single Writes (SWs). LSI RAID behaves similarly to RAID6 in that two parity blocks, in addition to the data block, need to be updated [37, 39]. We postulate an XOR capability at the disk drives. The writing of block d1new on D1 requires a RMW (Read-Modify-Write) diff = d1old ⊕ d1new , which is access to read d1old , compute d1 then sent to the two corresponding Pdisks P1,2 and P4,1 . After one disk rotation d1new overwrites d1old . RMW accesses are then applied to Pdisks to read corresponding pold blocks, which are XORed with ddiff , and overwritten after one disk rotation. In fact RMW need not be an atomic disk access and opportunistic disk accesses are possible during disk rotation. A safer way to update data and parity blocks, which avoids “write holes” is to compute the modified parity blocks at the DAC [37]. In the cases of RAID(4 + k), k ≥ 1 arrays this requires issuing k +1 SR accesses to read the data and check blocks, followed by k + 1 Single Writes (SW) accesses to write them. In effect each RMW is implemented as an SR followed by an SW, which seems to require more disk service time than a RMW access. In fact, when there are no intervening disk accesses between the SR and SW accesses, the disk read/write head remains on the same track, the total disk service time is the same as an RMW access. In the case of LSI RAID requires three SRs and three SWs. Synchronization is required for the rare case when two data blocks being updated concurrently on two Ddisks affect the same parity block on a Pdisk. NonVolatile Storage (NVS) provides a fast-write capability, so that updates are considered completed as soon as they are written onto the NVS cache at the DAC. Dirty blocks may be overwritten several times, before they are destaged (written to disk) when the cache capacity is exhausted. Higher disk access efficiency is attainable by batching destages, since duplexed NVS provides the same reliability level as magnetic disks. These two effects were quantified via disk trace analysis and taken into account in the RAID5 performance analysis in [31]. These effects are not considered in this study, therefore the disk load for updates is overestimated. The Reconstruct Write (RCW) method for RAID5 is applicable to LSI RAID [32]. To update block d1 on D1 we access block d2 on D2 to compute the parity block: p1,2 = d1new ⊕ d2 on P1,2 . Similarly, d4 on D4 is accessed to compute p4,1 = d1new ⊕ d4 on P4,1 . Two SRs and three SWs are required for updating a single block. In comparing the relative performance of the RMW and RCW methods for LSI we only consider disk loads, rather than the load at the DAC and disk drive controller. since disks constitute the bottleneck resource. The three compo-

Cluster Comput (2012) 15:239–253

nents of disk service time are seek time, latency, and transfer time. The mean seek time x seek is determined by the placement of blocks being accessed and the disk scheduling policy [40]. The mean rotational latency: x lat is approximately one half of disk rotation time (Trot ) for accesses to small disk blocks with FCFS scheduling, but is reduced by the SATF scheduling policy. The mean transfer time x xf er is negligibly small compared to the positioning time, since 8 KB (kilobyte) blocks accessed by OLTP applications correspond to sixteen 512 B sectors, which is a small fraction of sectors on a track. The mean service times for SR, SW, and RMW requests are: x SR = x seek + x lat + x xf er , x SW = x SR + Th , x RMW = x SR + Trot , where Th is the Head Settling Time (HST) and Trot = 60, 000/RPM in milliseconds (ms), where RPM stands for rotation per minute. The rate of requests to mirrored disk pairs is denoted by λ, so that the total rate is Λ = Mλ for N = 2M disks. The fraction of reads is fR and the fraction of writes (updates) is fW = 1 − fR . In numerical examples we consider 10,000 RPM disk drives with Trot = 6 ms and x seek = 4 ms. We ignore the transfer time for small 8 KB blocks (x xf er ≈ 0), and also assume that Th ≈ 0. It follows that x SW ≈ x SR = 10 ms and x RMW ≈ 16 ms. Analyses of I/O traces in earlier studies led to the conclusion that read accesses tend to dominate writes (updates) in OLTP workloads, so that the read to write ratio R:W = 4:1 and the fraction of reads and updates: fR = R/(R + W ) = 0.8. With the availability of very large caches R:W = 1:1 or fR = fW = 0.5, since although read requests are issued more frequently by the application, a large fraction of read requests are satisfied by the cache, The load for LSI RAID with the RMW method is: RMW = λ(f x ρDdisk R SR + fW x RMW ). The load on adjacent Pdisks is: ρPRMW disk = 2λfW x RMW . The factor of two is due to the fact that each Pdisk is adjacent to two Ddisks. The Ddisk is a bottleneck for fR x SR > fW x RMW . Let R:W = ρ RMW /ρ RCW then for f = 0.2: F 4:1 ≈ 1.8 and FD/P W Ddisk P disk D/P 1:1 for fW = 0.5: FD/P ≈ 0.77. With the RCW method the RCW = λ[f x Ddisk load is: ρDdisk R SR + fW (2x SR + x SW )]. The load on each adjacent Pdisk is ρPRCW disk = 2fW λx SW . For fW = 0.2, FD/P = 0.35 and for fW = 0.5, FD/P = 2. For the RMW method the load at Ddisks is lower than the RCW method, but it is higher for Pdisks. Denoting the R:W , for fW = 0.2, relative utilization of Ddisks as FRMW/RCW 4:1 1:1 FRMW/RCW = 0.84 and for fW = 0.5, FRMW/RCW = 0.71. The RMW method is preferable to RCW for read-intensive

245

workloads. The relative utilizations of Pdisks are given as: x RMW /x SW ≈ 2.2, regardless of the R:W ratio. Given the reversal in relative Ddisk and Pdisk utilizations for RMW and RCW methods, we balance Ddisk and Pdisk loads by processing a fraction β of updates as RMWs and a fraction 1 − β as RCWs. We obtain the value of β for varying R:W ratios by solving the following equality: RMW RCW RCW βρDdisk + (1 − β)ρDdisk = βρPRMW disk + (1 − β)ρP disk ,

β=

x SR (1 + fW )x SR − fW x SW ≈ . fW (x RMW + 2x SR − x SW ) fW (x RMW + x SR )

The approximation holds for x SW ≈ x SR , since Th ≈ 0. For fW = 0.2, β > 1, i.e., all updates should be processed as RMWs, while for fW = 0.5, β ≈ 0.7. Equalizing Ddisk and Pdisk utilizations insures that there is no bottleneck and the maximum IOPS for LSI RAID is maximized. We next compare LSI RAID performance with the BM organization based on their maximum throughputs when disks approach full utilization. The load per disk for BM is ρ = λ[(fR /2)x SR + fW x SW ], where λ is the rate per disk pair. By setting ρ = 1 the maximum disk throughput for N disks is obtained as: −1  ΛBM . max = N (fR /2)x SR + fW x SW For LSI we only consider the RMW method, since it incurs RMW = 1 we have: a lower load at Ddisks. Setting ρDdisk −1 ΛLSI max = M[fR x SR + fW x RMW ] .

The relative throughput of BM with respect to LSI for a read/write ratio R:W is: R:W FBM/LSI =

2M[fR x SR + fW x RMW ] . M[(fR /2)x SR + fW x SW ]

4:1 1:1 FBM/LSI ≈ 3.6 and FBM/LSI ≈ 2.4, which is the performance degradation to attain a higher reliability. A RAID6 disk array with the same volume of data as an LSI RAID with N = 2M disks has M + 2 disks. Since there are three RMW accesses per write then the load per disk is: R:W ρRAI D6 = fR x SR + 3fW x RMW . We have: R:W FRAI D6/LSI =

(M + 2)[fR x SR + fW x RMW ] . M[fR x SR + 3fW x RMW ]

4:1 1:1 It follows FRAI D6/LSI ≈ 0.51 and FRAI D6/LSI ≈ 1/3. LSI has superior performance with respect to RAID6 with the same capacity for data, but more so for 2M disks. As most RAID1 arrays LSI RAID is inferior with respect to RAID5 and RAID6 for full stripe reads and writes [4]. The selection of the RAID level (RAID1 versus RAID5) to minimize disk array loads is discussed in [41]. The updating of a data block on a Ddisk in SSPiRAL requires the updating of three Pdisks. In the case of an update

246

Cluster Comput (2012) 15:239–253

to the A Ddisk, the difference block is applied as RMWs to the three parity disks in which A participates. The RCW method reads the corresponding blocks from B, C, and D disks to compute the three parities, so that three SR requests and four RW requests are required. The RMW method is preferable, especially for high RPM disks.

5 Reliability analysis The reliability R(t) of a system is the probability that it works at time t given that R(0) = 1 [43]. Although studies of disk reliability have shown that a Weibull distribution yields a better approximation to the time to disk failure than the exponential distribution [8, 22]. Most mathematical analyses of RAID reliability starting with [8], adopt a Continuous Time Markov Chain (CTMC) model, with exponential time to disk failure and repair time, although the latter is more difficult to justify than the former. Disk reliability is given as R(t) = e−δt , where δ is the disk failure rate [43]. The Mean  ∞ Time to Failure (MTTF) for disks then equals MTTF = 0 R(t)dt = 1/δ. There are many difficulties associated with modeling repairs in RAID1, as exemplified by the criticism of the analysis in [2] in Sect. 2. We therefore do not consider repair processing in this study. Figure 1 in [38], reproduced below, depicts the Markov chain depicting system transitions, where Si denotes the state with i disk failures. The transition probabilities among these states are: S0 → S 1 → S 2 → · · · → S I . Data loss may occur following the first disk failure, since RAID1 arrays are generally kDFTs with k = 1, while LSI RAID is a 2DFT. The transition probability leading to data loss is qi . Si → F ,

A(N, i)r N −i (1 − r)i .

In the case of ID with c clusters and K = N/c disks per cluster, we can have only one disk failure per cluster and any one of the n disks in the cluster can fail.

c A(N, i) = K i , 0 ≤ i ≤ c. (5) i The expression for A(N, i) for CD is derived in [36]:



N −i −1 N −i A(N, i) = + , 1 ≤ i ≤ M. (6) i −1 i The probability that RAID1 survives i disk failures, the probability that state Si of the Markov chain specified earlier is visited is given by Vi . A(i) Vi = N  ,

1 ≤ i ≤ I.

(7)

i

Given the transition probabilities among the states of CTMC: pi = P [Si−1 → Si ] and qi = P [Si−1 → F ] = 1 − pi and with the initialization V0 = 1, Vi can also be calculated as Vi =

i

pi ,

1 ≤ i ≤ M.

(8)

j =0

Let A(N, i) , 0 ≤ i ≤ I denote the number of cases that do not lead to data loss with i disks failures. The maximum number of disk failures that can be tolerated without data loss for mirroring with N = 2M disks is I = M for most RAID1 organizations, but I = c for the ID organization. A(N, 0) = 1 by definition and A(N, i) = 0 for i > M. Setting r = R(t) to simplify the notation, the reliability of RAID arrays can be expressed as follows: M

In the case of GRD up to M disks can fail as long as they are on “one side” (see Fig. 2).

M A(N, i) = 2 , 0 ≤ i ≤ M. (4) i

Vi = Vi−1 pi ,

2 ≤ i ≤ I.

RRAI D (N ) =

Given n-way replication with M = N/n groups of n-way replicated disks the above formula can be simply extended as follows:

M i (3) A(N, i) = n , 0 ≤ i ≤ M. i

(1)

i=0

In the case of BM up to M disk failures can be tolerated, as long as one disk in each pair survives:

M i A(N, i) = 2 , 0 ≤ i ≤ M. (2) i

For example, in the case of GRD pi = 2(M − i)/(2M − i), i = 0, M, so that:   i 2 Mi M −i M −1 ··· , pj = Vi = N  = 2M − 1 2M − i i

1 ≤ i ≤ M.

j =0

A closed form expression for A(N, i) for LSI RAID is not available, but can be obtained using enumeration to obtain the number of cases with data loss. As explained in Sect. 3 and elaborated in Table 1, three consecutive disk failures may lead to data loss. We have used zeroes to denote failed disks and ones to denote nonfailed disks. The leftmost disk in the tables is a Ddisk. The identity of disks changes as we rotate the failed disks, so in the case of the first row only four cases lead to data loss, when the middle disk is a Ddisk.

Cluster Comput (2012) 15:239–253

247

Table 1 Configurations with 3 disk failures, 4 of which lead to data loss. Total number of possible 4th disk failures and number of cases they lead to data loss (brackets all cases and parentheses half of the cases) Configurations

All/DataLoss

3 → 4 cases

Failure

1

0 0 0 [1] 1 1 1 [1]

8/4

4×5

4+4

2

0 0 (1) 0 [1] 1 1 [1]

8/0

8×5

8+4+4

3

0 0 [1] [1] 0 1 1 1

8/0

8×5

4+4

4

0 0 [1] 1 1 0 [1] [1]

8/0

8×5

4+4+4

5

0 0 [1] 1 1 [1] 0 (1)

8/0

8×5

4+4+8

6

0 [1] 0 [1] 0 1 1 1

8/0

8×5

4+4

7

0 (1) 0 0 1 1 1 [1]

8/0

8×5

8+4

56/4

260

80

Table 2 All cases with four disk failures for LSI RAID Configurations

All cases

Data loss cases

1

00001111

8

8

2

00010111

8

4

3

00011011

8

4

4

00011101

8

4

5

00100111

8

4

6

00110011

4

0

7

00101101

8

0

8

00110101

8

0

9

00101011

8

0

10

01010101

2

1

70

25

Tables 1 and 2 list all possible configurations with four disk failures. The probability of data loss with three and four disk failures is: q3 = 4/56 = 1/14 and q4 = 80/260 = 20/65 = 4/13. The latter probability is given as 16/65 in Sect. 3.1.2 in [2], since the analysis misses the case that data loss occurs in half of the cases with a pair of two consecutive disk failures as given by Row 5 in the table, i.e., the configuration 00100111 shifted once, or: {d1 , p 1,2 , d 2 , P 2, 3, d 3 , p 3,4 , d4 , p4,1 }. In Table 2 all cases with four consecutive disk failures lead to data loss, while half of the cases with three disk failed (rows 2–4) lead to data loss, as before. In row 10 data loss occurs when all four Ddisks are broken and unlike the SSPiRAL data layout recovery is not possible when all four Disks have failed. It follows   that in the case of LSI RAID with N = 8, A(8, i) = 8i , 1 ≤ i ≤ 2, A(8, 3) = 52 and A(N, 4) = 45, Vi = 1, 0 ≤ i ≤ 2, V3 = 13/14, and V4 = 9/14.

Fig. 7 Reliability versus normalized time

While we have not derived a closed form expression for the reliability of LSI RAID, the number of cases leading to data loss can be easily enumerated. For example, in rows 2–4 in Table 2, the 4th failed disk can occupies three possible positions, so that the total number of cases leading to data loss is twelve. The number of cases leading to data loss is further multiplied by the number of possible rotations. SSPiRAL tolerates all disk failures up to three disks: Vi = 1 and A(N, i) = Ni for 0 ≤ i ≤ 3. Eighty percent of four disk failures are tolerated for N = 8, so that V4 = 4/5  and A(8, 4) = (4/5) 84 = 56. In Fig. 7 we plot the reliabilities for various RAID1 organizations and RAID5/6/7 for N = 8 disks versus time normalized with respect to disk MTTFs, which is δt for the exponential disk failure rate δ.1 For small values of t, as expected, RAID7 has the highest reliability, because it tolerates all three disk failures, while LSI RAID is second since it tolerates most three disk failures. RAID6 has the higher reliability than BM for small values of t, but there is a crossover point since BM can tolerates more than two disk failures. RAID5 has the lowest reliability since it does not tolerate more than one disk failure. This graph can be used to determine if the economic lifespan objective is met. The values for pi and Vi for different RAID1 organizations for N = 8 are given in Table 3. The MTTDL is the mean passage time to the failed state, which is the weighted sum of the mean holding times (Hi ) in various states. In a system with no repairs the holding time is simply the time to next failure, or inverse of the disk failure rate: 1 Figure

5 in [36] is different in that the system reliability is plotted versus decreasing disk reliabilities, not normalized time.

248

Cluster Comput (2012) 15:239–253

Table 3 Transition probabilities and number of visits to Markov chain states for N = 8 disks RAID

P [S0 −→ S1 ]

P [S1 −→ S2 ]

P [S2 −→ S3 ]

P [S3 −→ S4 ]

V0

V1

V2

V3

V4

BM

1

2 3 1 3

2 5 1 5

1

1

1

1

4 7 1 7

8 35 1 35

1

1

0

0

1

1

6 7 3 7 4 7 5 7

1

1

1

2 7 13 14

1

1

1

1

1 35 9 14 4 5

GRD

1

ID

1

CD

1

6 7 3 7 4 7 5 7

LSI

1

1

2 5 13 14

SSP

1

1

1

1 10 9 13 4 5

Table 4 MTTDLs as a ratio and a fraction of the MTTF (δ −1 ) and the first term in asymptotic reliability expression with ε denoting the unreliability RAID5

BM

CD

GRD

ID

RAID6

LSI

RAID7

SSP

15 56δ

163 280δ

379 840δ

3 8δ

61 168δ

73 168δ

82 105δ

533 840δ

701 840δ

0.268δ −1 N  2 2 ε

0.582δ −1

0.451δ −1

0.375δ −1

0.363δ −1

Nε 2

N (N −1)ε 2

N (N −c)ε 2

4

2c

0.435δ −1 N  3 3 ε

0.781δ −1 N  N  3 3 − 2 ε

0.635δ −1 N  4 4 ε

0.8345δ −1   1 N 4 5 4 ε

N ε2 2

Hi = [(N − i)δ]−1 . MTTDL =

M

Vi Hi =

i=0

M i=0

Vi . (N − i)δ

(9)

In a RAID system with no repair the MTTDL can be expressed as:

∞ MTTDL = RRAI D (N )dt 0

= 0

M ∞

A(N, i)r N −i (1 − r)i dt.

(10)

i=0

The integration can be carried out symbolically for r = e−δt , but otherwise numerical integration methods are applicable. The MTTDL for various RAID1 organizations, but also for RAID5/6/7 are given for systems without repair. It is interesting to note that LSI RAID has the highest MTTDL, followed only by RAID7 and BM. MTTDLs for various arrays discussed in this paper are summarized in Table 4. It is important to note that the result for SSPiRAL only applies to N = 8 and that it is more reliable than RAID7, since in addition to all three disk failures it tolerates some four disk failures.

6 Approximate reliability analysis Rather than specifying the disk reliability distribution, we express the disk reliability as r = 1 − ε. Assuming that the MTTF for disks is a million hours or 114 years, a good approximation for a small t (ts ) for an exponential distribution

is: R(ts ) ≈ 1 − δts , so that after 3 years R(3) ≈ 1 − 3/114 = 0.975 and ε = 0.025  1. The approximate reliability analysis in [35] expresses the system reliability as a power series of disk unreliabilities. However, since ε  1, only its smallest power need to be retained for reliability comparison purposes. We use r = 1 − r = ε to simplify the notation. For n-way replication allows n − 1 disk failures to be tolerated. Rn−way (n) = 1 − (1 − r)n = 1 − (r)n = 1 − ε n .

(11)

For multidisk RAID1 organizations with N disks at most M = N/2 disk failures can be tolerated. For the BM organization we utilize the expression for A(N, i) given by (2), but only retain the ε 2 term for comparison purposes.2 RBM (N ) = r + N r N

N −1

M N −2 2 r +4 r r 2

≈ 1 − N ε 2 /2.

(12)

For GRD instead of using A(N, i) in (4), we consider an alternative method by considering mirroring where each virtual disk consists of M disks, so that RM = r M .3 RGRD (N ) = 1 − (1 − RM )2 = 2r M − r 2M ≈ 1 − N (N − 1)ε 2 /4.

2 We 3 The

have corrected the first part of (7) in [35]. final result given here is slightly different from (9) in [35].

(13)

Cluster Comput (2012) 15:239–253

249

In the case of ID we simply use the expression for A(N, i) given by (5). RI D (N ) ≈ 1 − (N/2)(N/c − 1)ε 2 .

(14)

In the case of CD the reliability expression givenin(6) is not required. It can be argued that while there are N2 failures of two disks, there are N consecutive two disk failures that lead to data loss, so that A(N, 2) = N (N − 1)/2 − N . RCD (N ) ≈ r N + N r N −1 r +

N (N − 3) N −2 2 r r 2

≈ 1 − Nε 2 .

(15)

Note that BM is the most reliable among the four organizations followed by CD, while the relative reliability of GRD and ID depends on c. For c = N/2 ID is equivalent to BM. LSI RAID  tolerates all single and double disk failures: A(N, i) = Ni , 0 ≤ i ≤ 2, but only half of three adjacent   disk failures, so that A(N, 3) = N3 − N/2 = (N 3 − 3N 2 − N)/6. We have: RLSI (N) ≈

3

A(N, i)r N −i (1 − r)i ≈ 1 −

i=0

N 3 ε . 2

(16)

LSI RAID is more reliable than the aforementioned RAID1 organizations and also RAID5 and RAID6. According to row 5 in Table 2 two pairs of adjacent disk failures, with an intervening Pdisk, may also lead to data loss in LSI RAID. There is no need to consider this case, since the probability of four disk failures is much smaller than three disk failures. It follows from the RAID5 reliability expression: RRAI D5 (N ) ≈ 1 − N (N − 1)ε 2 /2, that it is less reliable than the four RAID1 organizations, since it cannot tolerate more than one disk failure. In the case of RAID(k + 4), k ≥ 1, which is an kDFT it can be shown by induction that:

N RRAI D(4+k) ≈ 1 − ε k+1 k+1

N + (k + 1) ε k+2 − · · · (17) k+2 According to (16) for N = 8 disks RLSI (8) ≈ 1 − 4ε 3 , while it follows from (17) that: RRAI D6 (8) ≈ 1 − 56ε 3 . LSI RAID is more reliable than RAID6, because in addition to all two disk failures, it tolerates most three disk failures. A RAID6 array requires N  = 6 disks to hold the

same volume of data as LSI, so that RRAI D6 (6) = 1 − 20ε 3 , which is still less reliable than LSI RAID. It follows from (17) that for RAID7 with N = 8:

N 4 RRAI D7 (8) ≈ 1 − ε = 1 − 70ε 4 . 4 SSPiRAL tolerates all disk failures up to three and assuming that 1/5th of four disk failures for N = 8 lead to data loss:

1 N 4 RSSP iRAL ≈ 1 − ε . (18) 5 4 The second term in the asymptotic expansion is 56ε 4 for N = 8, which confirms that RAID7 is more reliable than SSPiRAL. RAID(4 + k) for k = 4 which tolerates all four disk failures: RRAI D8 (8) ≈ 1 − 85 ε 5 = 1 − 56ε 5 . The above discussion is summarized in Table 4. We conclude our discussion with a comparison of the reliability of RAID1/0 versus RAID0/1. RAID1/0 is a mirrored disk with two virtual disks, where each virtual disk is a RAID0 array with M disks, whose reliability is designated as R0 . 2  R1/0 = 1 − (1 − R0 )2 = 1 − 1 − r M = 2r M − r 2M ≈ 1 − M 2 ε 2 . RAID0/1 is RAID0 array where each disk is a mirrored disk. M  M  R0/1 = 1 − (1 − r)2 = 2r − r 2 ≈ 1 − Mε 2 . It is interesting to note that R1/0 < R0/1 , since RAID1/0 is considered failed if one disk on either side fails, since data is replicated at RAID0 level. This inequality can be shown easily for smaller values of M without resorting to the asymptotic expansion.

7 LSI RAID performability In Table 5 we list the maximum load incurred by read and write requests. These costs are affected by not only the total number of failed disks, but also the number of failed Ddisks. In what follows we use x to specify the service time for SR and SW requests, whose inverse is the maximum throughput in those states. A RMW request is an SR followed by an SW, so it counts as two requests. RCWs entail SRs and SWs. The values given are the maximum read and write costs for a given failure configuration. An example is given to explain why the maximum write load with four disk failures three of which are Ddisks is 6x,

250 Table 5 Maximum disk access costs for varying number of disk and Ddisk failures. We use x to denote mean disk service times

Cluster Comput (2012) 15:239–253

1 failure

2 failures

3 failures

4 failures

Cases

Read cost

Write cost

Fraction

0 data disk failures

x

3x

1 data disk failure

1.5x

4x

1 2 1 2

0 data disk failures

x

3x

1 data disk failure

1.5x

4x

2 data disk failures

2x

4x

0 data disk failures

x

2x

1 data disk failure

1.5x

4x

2 data disk failures

2x

5x

3 data disk failures

4x

5x

0 data disk failures

x

x

1 data disk failure

1.5x

3x

2 data disk failures

2x

5x

3 data disk failures

4x

6x

6 28 16 28 6 28 4 52 20 52 24 52 4 52 1 45 8 45 20 45 16 45

which corresponds to the last line in Table 5. Consider the array, where d or p denotes a broken disk. {d 1 , p1,2 , d 2 , p2,3 , d 3 , p3,4 , d4 , p4,1 }

(1) To write d1new , we need to update p1,2 and p4,1 . Read d4 and p4,1old to compute d1old , compute d1diff = d1old ⊕ d1new , then compute and write p1,2new = p1,2old ⊕ d1diff . Read d4 and compute and write p4,1 = d1new ⊕ d4 . (2) To write d2new we need to update p1,2 and p2,3 . Read d4 and p4,1 to compute d1 = d4 ⊕ p4,1 . Given d1 compute and write p1,2 = d1old ⊕ d2new . To update p2,3 compute then d2old = d1 ⊕ p1,2 , d2diff = d2new ⊕ d2old , and p2,3new = p2,3old ⊕ d2diff . (3) To write d3new requires updating only p2,3 , since p3,4 is broken. We can read d4 and p4,1 and compute d1 = d4 ⊕ p4,1 , then read p1,2 and obtain d2 = d1 ⊕ p1,2 . Compute and write p2,3 = d2 ⊕ d3new . (4) To update d4new we use the RMW method to update d4 and p4,1 . In other words if Ddisks are accessed with rate λ then P4,1 will be accessed with rate 6λ. The maximum throughput in IOPS (I/O’s per Second) is: Ti with i failed disks is given as the number of nonfailed Ddisks divided by their mean service time for a given fraction of reads and writes. Maximum throughputs with zero to four disk failures are given in Fig. 8 postulating x = 10 ms. For example, the maximum throughput with one disk failure for fR = 0 is the average of the two cases with no and one Ddisk failure: (400 + 3 × 66.7)/2 = 300 IOPS.

Fig. 8 Maximum throughput given as I/Os per Second (IOPS) for varying number of failed disks and varying fractions of writes: fW = 0, fW = 0.2, and fW = 0.5

The performability measure combines the failure process with performance [11]. In this study we define performability as the number of I/O requests that are processed by a system from the beginning of its operation to the point that data loss occurs. Referring back to the Markov chain model specifying the disk failure process, the performability is obtained by summing over all intervals with 0 ≤ i ≤ M failed disks of the duration of the interval times the maximum read throughput during the interval, denoted by Ti . Note similarity to (9) to compute the MTTDL. P=

M i=0

v i Ti . (N − i)δ

(19)

Cluster Comput (2012) 15:239–253

251

Data blocks on data disk Di,j are protected by parity blocks at the four Pdisks Pi±1,j ±1 and vice-versa. All additions and subtractions are modulo N , so that there is wraparound both horizontally and vertically. Pi,j = Di,j −1 ⊕ Di+1,j ⊕ Di,j +1 ⊕ Di−1,j . It takes four disks to reconstruct a block on a failed disk, one of its four Pdisks and the three Ddisks surrounding that Pdisk. For example, if D4,4 fails, to reconstruct it using P3,4 we compute: D4,4 = P3,4 ⊕ D3,3 ⊕ D2,4 ⊕ D3,5 .

Fig. 9 Performability for LSI RAID compared with the other four RAID1 organizations

In Fig. 9 we plot the performability for the four RAID1 organizations plus LSI RAID. Similarly to [38] we only consider read requests to simplify the discussion. We set δ = 10−6 hours−1 and have multiplied times (in hours) by IOPS per second, so that the performability is 3600 times higher. It is interesting to note that in spite of its higher MTTDL than other RAID1 organizations, LSI RAID has the smaller performability, which is due to the fact that it has four Ddisks. A tool for performability analysis of storage systems allowing repair, was developed in a recent study at Sun/ Oracle [29]. A Markovian model was used to evaluate the probabilities associated with normal and degraded states based on field reliability data collected from customer sites. Fault injection tests were conducted to measure the performance of the storage system in degraded states with an internal performance benchmark.

8 Multidimensional hybrid mirrors We propose an extension to LSI RAID in two dimensions with DoutD = PinD = 4. Shown below is an N × N with N = 2M = 8 disks placed in a square. It is a mirrored array in that N 2 /2 disks are parity disks. ⎛ ⎞ D1,1

⎜ P2,1 ⎜D ⎜ 3,1 ⎜ P4,1 ⎜ ⎜D5,1 ⎜ ⎜ P6,1 ⎝ D7,1 P8,1

P1,2 D2,2 P3,2 D4,2 P5,2 D6,2 P7,2 D8,2

D1,3 P2,3 D3,3 P4,3 D5,3 P6,3 D7,3 P8,3

P1,4 D2,4 P3,4 D4,4 P5,4 D6,4 P7,4 D8,4

D1,5 P2,5 D3,5 P4,5 D5,5 P6,5 D7,5 P8,5

P1,6 D2,6 P3,6 D4,6 P5,6 D6,6 P7,6 D8,6

D1,7 P2,7 D3,7 P4,7 D5,7 P6,7 D7,7 P8,7

P1,8 D2,8 ⎟ P3,8 ⎟ ⎟ D4,8 ⎟ ⎟. P5,8 ⎟ ⎟ D6,8 ⎟ ⎠ P7,8 D8,8

The contents of most failed disks can be reconstructed in one step, but with a very large number of disk failures the recovery of the contents of a disk may require multiple steps. The minimum number of disk failures that will lead to data loss is N , when the failed disks constitute “diagonals”. In the system under consideration diagonals with negative slopes are as follows: (1,1)–(8,8); (3,1)–(8,6) and (1,7–(2,8); (5,1)–(8,3) and (1,5)–(4,8); (7,1)–(1,3) and (1,3)–(6.8). Diagonals with positive slopes are as follows: (7,1)–(1,7) and (8,8); (5,1)–(1,5) and (8,6)–(6,8); (3,1)–(1,3) and (8,4)– (4,8); (1,1) and (8,2)–(1,7). The scheme may be extended to three dimensions with DoutD = PinD = 6.

9 Conclusions We have described the LSI RAID, where each Ddisk (data disk) is protected by two Pdisks (parity disks) and each Pdisk protects two Ddisks. LSI RAID can be classified as RAID1 or a mirrored disk because of its redundancy level, but unlike RAID1 which is 1DFT, it is a 2DFT, which also tolerates almost all three disk failures, excepting three consecutive disk failures with two Pdisks. We describe the RMW and RCW methods to update parities and evaluate their performance. For an OLTP workload LSI RAID has an inferior performance with respect to other RAID1 organizations, but it outperforms RAID6. We propose a method to balance Ddisk and Pdisk loads in LSI RAID to maximize the IOPS. The reliability comparison of RAID arrays is carried out by: (i) deriving the reliability expression for RAID arrays and plotting the reliability graph versus time; (ii) using a power series expansion method, which does not require the derivation of the full reliability expression. Although a closed form analytic expression is not derived for LSI RAID, we outline an enumerative approach to estimate the reliability of the LSI RAID method. Systems with repair are not considered in this study, because of the weakness

252

of analytic methods in this case, as stated in discussing the analysis in [2]. Disk arrays with nonMDS coding are relevant to the discussion of hybrid RAID. Data disks placed in a square can be protected by parity disks placed in a column and a row [4]. The Grid code protects data strips placed in two dimensions with horizontal and vertical codes, providing protection against k horizontal and  vertical strips [28], In the simplest case we have a Single Parity Code (SPC), which can be used both as a horizontal and vertical code (k = 1 or  = 1). RDP or EVENODD are 2DFT horizontal codes, so that k = 2 and the X-code is a 2DFT vertical code with  = 2, which is restricted to a prime number of disks.

References 1. Alvarez, G.A., Burkhard, W.A., Stockmeyer, L.J., Cristian, F.: Declustered disk array architectures with optimal and nearoptimal parallelism. In: Proc. 25th Ann’l Int’l Symp. on Computer Architecture (ISCA 1998), Barcelona, Spain, June, pp. 109–120 (1998) 2. Amer, A., Long, D.D.E., Paris, J.-F., Schwarz, T.: Increased reliability with SSPiRAL data layouts. In: Proc. 16th Int’l Symp. on Modeling, Analysis, and Simulation of Computer and Telecomm. Systems (MASCOTS’08), Baltimore, MD, Sept., pp. 189–198 (2008) 3. Bachmat, E., Schindler, J.: Analysis of methods for scheduling low priority disk drive tasks. In: Proc. ACM SIGMETRICS Conf. on Measurement and Modeling of Computer Systems, Los Angeles, CA, June, pp. 55–65 (2002) 4. Chen, P.M., Lee, E.K., Gibson, G.A., Katz, R.H., Patterson, D.A.: RAID: high-performance, reliable secondary storage. ACM Comput. Surv. 26(2), 145–185 (1994) 5. Chen, S.-Z., Towsley, D.F.: A performance evaluation of RAID architectures. IEEE Trans. Comput. 45(10), 1116–1130 (1996) 6. Chen, M.S., Hsiao, H.-I., Li, C.-S., Yu, P.S.: Using rotational mirrored declustering for replica placement in a disk-array-based video server. Multimed. Syst. 5(6), 371–379 (1997) 7. Dholakia, A., Eleftheriou, E., Hu, X.-Y., Iliadis, I., Menon, J., Rao, K.K.: A new intra-disk redundancy scheme for high-reliability RAID storage systems in the presence of unrecoverable errors. ACM Trans. Storage 4(1), 1 (2008) 8. Gibson, G.A.: Redundant Disk Arrays: Reliable, Parallel Secondary Storage. MIT Press, Cambridge (1992) 9. Hafner, J.L., Deenadhayalan, V., Kanungo, T., Rao, K.K.: Performance metrics for erasure codes in storage systems. In: IBM research report RJ 10231, Almaden, CA, USA, August (2004) 10. Hafner, J.L.: WEAVER codes: highly fault tolerant erasure codes for storage systems. In: Proc. 4th USENIX Conf. on File and Storage Technologies (FAST’05), San Francisco, CA, December, pp. 211–224 (2005) 11. Haverkort, B.R., Marie, R., Rubino, R., Trivedi, K.S.: Performability Modelling: Techniques and Tools. Wiley, New York (2001) 12. Hsiao, H.-I., DeWitt, D.J.: Chained declustering: a new availability strategy for multiprocessor database machines. In: Proc. IEEE Int’l Conf. on Data Engineering (ICDE’90), Los Angeles, CA, February, pp. 456–465 (1990) 13. Hsiao, H.-I., DeWitt, D.J.: A performance study of three high available data replication strategies. Distrib. Parallel Databases 1(1), 53–80 (1993)

Cluster Comput (2012) 15:239–253 14. Hwang, K., Jin, H., Ho, R.S.C.: Orthogonal striping and mirroring in distributed RAID for I/O-centric cluster computing. IEEE Trans. Parallel Distrib. Syst. 13(1), 26–44 (2002) 15. Iliadis, I., Haas, R., Hu, X.-Y., Eleftheriou, E.: Disk scrubbing versus intradisk redundancy for RAID storage systems. ACM Trans. Storage 7(2), 5 (2011) 16. Menon, J., Mattson, D.: Comparison of sparing alternatives for disk arrays. In: Proc. 19th Ann’l Int’l Symp. on Computer Architecture (ISCA 1992), Gold Coast, Australia, May, pp. 318–329 (1992) 17. Merchant, A., Yu, P.S.: Analytic modeling and comparisons of striping strategies for replicated disk arrays. IEEE Trans. Comput. 44(3), 419–433 (1995) 18. Merchant, A., Yu, P.S.: Analytic modeling of clustered RAID with mapping based on nearly random permutation. IEEE Trans. Comput. 45(3), 367–373 (1996) 19. Muntz, R.R., Lui, J.C.S.: Performance analysis of disk arrays under failure. In: 6th Int’l Conf. on Very Large Data Bases, Brisbane, Queensland, Australia, August, pp. 162–173 (1990) 20. Paris, J.-F., Schwarz, T.J.E., Long, D.D.E.: Self-adaptive disk arrays. In: Proc. 8th Int’l Symp. on Stabilization, Safety, and Security of Distributed Systems (SSS 2006), Dallas, TX, November, pp. 469–483 (2006) 21. Park, C.-I.: Efficient placement of parity and data to tolerate two disk failures in disk array systems. IEEE Trans. Parallel Distrib. Syst. 6(11), 1177–1184 (1995) 22. Schroeder, B., Gibson, G.A.: Understanding disk failure rates: what does an MTTF of 1,000, 000 hours mean to you? ACM Trans. Storage 3(3), 8 (2007) 23. Schroeder, B., Damouras, S., Gill, P.: Understanding latent sector errors and how to protect against them. ACM Trans. Storage 8(3), 8 (2010) 24. Shang, P., Wang, J., Zhu, H., Gu, P.: A new placement-ideal layout for multiway replication storage system. IEEE Trans. Comput. 60(8), 1142–1156 (2011) 25. Teradata: DBC/1012 database computer system manual release 2.0. Document No. C10-0001-02, Teradata Corp., November (1985) 26. Kari, H.H.: Latent sector faults and reliability of disk arrays. Ph.D. thesis, University of Helsinki, Espoo, Finland (1997) 27. Lee, E.K., Thekkath, C.A.: Petal: distributed virtual disks. In: Proc. 7th Int’l Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VII), Cambridge, MA, October, pp. 84–92 (1996) 28. Li, M., Shu, J., Zheng, W.: GRID codes: strip-based erasure codes with high fault tolerance for storage systems. ACM Trans. Storage 4(4), 15 (2009) 29. Sun, H., Tyan, T., Johnson, S., Elling, R., Talagala, N., Wood, R.B.: Performability analysis of storage systems in practice: methodology and tools. In: Proc. 3rd Int’l Service Availability Symp. (ISAS 2006). Helsinki, Finland, May 2006. Lecture Notes in Computer Science, vol. 4328, pp. 62–75. Springer, Berlin (2006) (Revised selected papers) 30. Thomasian, A., Menon, J.: Performance analysis of RAID5 disk arrays with a vacationing server model for rebuild mode operation. In: Proc. IEEE Int’l Conf. on Data Engineering (ICDE’94), Houston, TX, February, pp. 111–119 (1994) 31. Thomasian, A., Menon, J.: RAID5 performance with distributed sparing. IEEE Trans. Parallel Distrib. Syst. 8(6), 640–657 (1997) 32. Thomasian, A.: Reconstruct versus read-modify writes in RAID. Inf. Process. Lett. 93(4), 163–168 (2005) 33. Thomasian, A.: Clustered RAID arrays and their access costs. Comput. J. 48(6), 702–713 (2005) 34. Thomasian, A.: Mirrored disk routing and scheduling. Clust. Comput. 9(4), 475–484 (2006)

Cluster Comput (2012) 15:239–253 35. Thomasian, A.: Shortcut method for reliability comparisons in RAID5. J. Syst. Softw. 79(11), 1599–1605 (2006) 36. Thomasian, A., Blaum, M.: Mirrored disk organization reliability analysis. IEEE Trans. Comput. 55(12), 1640–1644 (2006) 37. Thomasian, A., Fu, G., Han, C.: Performance of two-disk failuretolerant disk arrays. IEEE Trans. Comput. 56(6), 799–814 (2007) 38. Thomasian, A., Xu, J.: Reliability and performance of mirrored disk organizations. Comput. J. 51(6), 615–629 (2008) 39. Thomasian, A., Blaum, M.: Higher reliability redundant disk arrays: organization, operation, and coding. ACM Trans. Storage 5(3), 7 (2009) 40. Thomasian, A.: Survey and analysis of disk scheduling methods. Comput. Archit. News 39(2), 8–25 (2011) 41. Thomasian, A., Xu, J.: RAID level selection for heterogeneous disk arrays. Clust. Comput. 14(2), 115–127 (2011) 42. Thomasian, A., Tang, Y.: Performance, reliability, and performability aspects of hierarchical RAID. In: Proc. 6th Int’l Conf. on Networking, Architecture, and Storage (NAS 2011), Dalian, China, July, pp. 92–101 (2011) 43. Trivedi, K.S.: Probability and Statistics with Reliability, Queuing, and Computer Science Applications, 2nd edn. Wiley, New York (2001) 44. Venkatesan, V., Iliadis, I., Hu, X.-Y., Haas, R., Fragouli, C.: Effect of replica placement on the reliability of large-scale data storage systems. In: Proc. 18th Ann’l IEEE/ACM Int’l Symp. on Modeling, Analysis and Simulation of Computer and Telecomm. Systems (MASCOTS’10), Miami, FL, August, pp. 79–88 (2010) 45. Venkatesan, V., Iliadis, I., Fragouli, C., Urbanke, R.: Reliability of clustered vs. declustered replica placement in data storage systems. In: Proc. 19th Ann’l IEEE/ACM Int’l Symp. on Modeling, Analysis and Simulation of Computer and Telecomm. Systems (MASCOTS’11), Raffles Hotel, Singapore, August, pp. 307–317 (2011) 46. Venkatesan, V., Iliadis, I., Hass, R.: Reliability of data storage systems under network rebuild bandwidth constraints. In: Proc. 20th Ann’l IEEE/ACM Int’l Symp. on Modeling, Analysis and Simulation of Computer and Telecomm. Systems (MASCOTS’12), Washington, D.C., August, pp. 79–88 (2012) 47. Wilner, A.: Multiple drive failure tolerant RAID system. US Patent 6,327,672, December 2001 48. Xu, L., Bohossian, V., Bruck, J., Wagner, D.G.: Low-density MDS codes and factors of complete graphs. IEEE Trans. Inf. Theory 45(6), 1817–1836 (1999)

253 Alexander Thomasian received his Ph.D. degree in Computer Science from UCLA. He spent two years as a Sr Staff Scientist at Burroughs Corp. (now UNISYS) and fifteen years at IBM’s T.-J. Watson Research Center as a Research Staff Member. He was a faculty member at Case Western Reserve University, University of Southern California, and New Jersey Institute of Technology. His research on storage systems at NJIT was funded by NSF, Hitachi Global Storage Technologies, and AT&T Research. He was an Outstanding Visiting Scientist at Shenzhen Institutes of Advanced Technology (SIAT), Chinese Academy of Sciences, Shenzhen, China, for the year starting with November 2010. At Thomasian & Associates he is conducting research and consulting in the area of storage systems. He has published about 150 book chapters, journal, and conference papers. in the area of concurrency control in databases, highdimensional indexing and dimensionality reduction, performance analysis, and storage systems. He is the coauthor of four US patents, two dozen invention disclosures, and recipient of the IBM Outstanding Innovation Award. He was an area editor of IEEE Trans. on Parallel and Distributed Systems and has been on the program committees of numerous IEEE and ACM conferences. He has been a Fellow of the IEEE since January 2000. Yujie Tang received her M.S. degree in Information and Communications Engineering from Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China in December 2010. She was a research staff at Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China for six months, before joining the University of Waterloo, Ontario, Canada to pursues the Ph.D. degree. She is affiliated with the Broadband Communications Research Group led by Prof. Jon W. Mark with interests in cognitive radio, cooperative networks, and the smart grid.

Suggest Documents