7
Higher Reliability Redundant Disk Arrays: Organization, Operation, and Coding ALEXANDER THOMASIAN Thomasian and Associates and MARIO BLAUM Universidad Complutense de Madrid (UCM)
Parity is a popular form of data protection in redundant arrays of inexpensive/independent disks (RAID). RAID5 dedicates one out of N disks to parity to mask single disk failures, that is, the contents of a block on a failed disk can be reconstructed by exclusive-ORing the corresponding blocks on surviving disks. RAID5 can mask a single disk failure, and it is vulnerable to data loss if a second disk failure occurs. The RAID5 rebuild process systematically reconstructs the contents of a failed disk on a spare disk, returning the system to its original state, but the rebuild process may be unsuccessful due to unreadable sectors. This has led to two disk failure tolerant arrays (2DFTs), such as RAID6 based on Reed-Solomon (RS) codes. EVENODD, RDP (Row-Diagonal-Parity), the X-code, and RM2 (Row-Matrix) are 2DFTs with parity coding. RM2 incurs a higher level of redundancy than two disks, while the X-code is limited to a prime number of disks. RDP is optimal with respect to the number of XOR operations at the encoding, but not for short write operations. For small symbol sizes EVENODD and RDP have the same disk access pattern as RAID6, while RM2 and the X-code incur a high recovery cost with two failed disks. We describe variations to RAID5 and RAID6 organizations, including clustered RAID, different methods to update parities, rebuild processing, disk scrubbing to eliminate sector errors, and the intra-disk redundancy (IDR) method to deal with sector errors. We summarize the results of recent studies of failures in hard disk drives. We describe Markov chain reliability models to estimate RAID mean time to data loss (MTTDL) taking into account sector errors and the effect of disk scrubbing. Numerical results show that RAID5 plus IDR attains the same MTTDL level as RAID6, while incurring a lower performance penalty. We conclude with a survey of analytic and simulation studies of RAID performance and tools and benchmarks for RAID performance evaluation. Categories and Subject Descriptors: D.4.2 [Operating Systems]: Storage Management—Secondray Storage; D.4.5 [Operating Systems]: Reliability—Fault-tolerance; B.8.1 [Performance and Reliability]: Fault-tolerance—Coding methods
Authors’ addresses: A. Thomasian, Thomasian and Associates, 17 Meadowbrook Road, Pleas´ antville, NY 10570; email:
[email protected]; M. Blaum, Grupo de Analisis, Seguridad ´ y Sistemas (GASS), Facultad de Informatica, Despacho 431, Universidad Complutense de Madrid (UCM) C/ Profesor Jos´e Garc´ıa Santesmases s/n, 28040 Madrid, Spain; email: mario.blaum@fdi. ucm.es. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or
[email protected]. C 2009 ACM 1553-3077/2009/11-ART7 $10.00 DOI 10.1145/1629075.1629076 http://doi.acm.org/10.1145/1629075.1629076 ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
7:2
•
A. Thomasian and M. Blaum
General Terms: Design, Algorithms, Reliability Additional Key Words and Phrases: Disk array, RAID, reliability evaluation, disk failure studies, performance evaluation ACM Reference Format: Thomasian, A. and Blaum, M. 2009. Higher reliability redundant disk arrays: organization, operation, and coding. ACM Trans. Storage, 5, 3, Article 7 (November 2009), 59 pages. DOI = 10.1145/1629075.1629076 http://doi.acm.org/10.1145/1629075.1629076
1. INTRODUCTION Redundant Arrays of Inexpensive/Independent Disks (RAID) is a popular classification for disk arrays [Patterson et al. 1988], which was developed as part of a paradigm to replace large form-factor expensive disks used in conjunction with mainframe computers of the late 1980s with an array of inexpensive small capacity and form-factor disks used in personal computers (PCs). While the same storage capacity could be achieved at a reduced overall cost, the increased number of components resulted in a lower reliability. This argument was true in the 1980s, but large form factor disks are not manufactured anymore and have been replaced by small form-factor disks with very high capacities. Techniques based on replication and erasure coding have been introduced to allow the recovery of failed disks to attain sufficiently high reliability levels [Patterson et al. 1988; Gibson 1992; Chen et al. 1994]. (1) Replication, which is used in mirrored disks or RAID level 1 (RAID1). (2) Erasure coding via a single parity, which is used in RAID3, RAID4, and RAID5. (3) The Hamming erasure code, which is used solely in RAID2. (4) Erasure coding via the Reed-Solomon (RS) code [MacWilliams and Sloan 1977] was introduced to protect against two disk failures [Chen et al. 1994]. Specialized parity codes, such as EVENODD and RDP, provide protection against two disk failures at a lower computational cost. The disk loads in RAID may be unbalanced when hot files are allocated on some of the disks in RAID. Striping deals with access skew by partitioning large files into fixed size stripe units (SUs) or strips, for example, 64 KB in size, which are allocated in round-robin manner over the disks. Most RAID levels incorporate striping, while RAID0 has no redundancy, but only striping. Disk arrays can be classified as k-disk failure tolerant (kDFT). RAID0 is a 0DFT, RAID1 and RAID5 are 1DFTs, and RAID6 is a 2DFT. Coding techniques have also been developed for 3DFTs (see Section 2). This tutorial extends and updates [Chen et al. 1994] with emphasis on 2DFTs. We allow some level of redundancy to make this presentation selfcomplete. We emphasize parity based coding techniques applicable to 2DFTs, but not RS codes which are discussed in texts and tutorials [Chen et al. 1994; Plank 1997; Plank and Ding 2005; Plank 2005]. The discussion of reliability modeling in Section 4 is complementary to Chen et al. [1994]. The emphasis is on recent studies of whole disk reliability, but also of media errors. We review ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
Higher Reliability Redundant Disk Arrays
•
7:3
recent work on extending RAID reliability modeling to incorporate sector errors, the effect of intra-disk redundancy (IDR) [Dholakia et al. 2008], and the effect of disk scrubbing to preemptively fix sector errors. This section starts with a discussion of trends in storage technology, techniques to improve the performance of disk drives, and alternative storage technologies. This is followed by an introduction to the different RAID levels. We conclude with a review of the organization of the paper. 1.1 Storage Technology We start with a review of trends in disk technology. This is followed with a discussion of the components of disk service time and disk scheduling methods. We proceed with a discussion of memory hierarchy on the performance of database applications, since they have stringent performance requirements. We conclude with a brief review of alternative storage technologies. 1.1.1 Trends in Disk Technology. Hard disk drives (HDDs) are classified in Anderson et al. [2003] into Enterprise Storage (ES) used in servers and Personal Storage (PS) used in PCs. ES drives are more expensive than PS drives because of their more advanced technology. They provide lower seek times and higher rotations per minute (RPMs), for example, 10,000 or 15,000 RPM versus 7200 RPM. ES HDDs tend to have a higher data rate than PS HDDs due to their higher RPM, but this is not always so, since PS disks may have a higher recording density and larger diameters. Power consumption which increases with the cube of RPM is reduced by adopting smaller diameter disks, say 2.5 for 15,000, 3.3 for 10,000 RPM versus 3.7 for 7200 RPM PS drives. This areal loss may be compensated by stacking disk platters on top of each other. PS drives are intended for use several hours a day, while ES drives run all the time. HDDs are also classified by their interface: Small Computer System Interface (SCSI) or Fibre Channel (FC) is used for ES, while Advanced Technology Attachment (ATA) or IDE is used for PS. There is also Serial Attached SCSI (SAS) and the Serial ATA (SATA). More recently HDDs with SATA are replacing magnetic tape storage for the backup of ES drives and are also used for archival storage. Catastrophic disk failures, such as the failure of disk electronics and disk head crashes, are identified by appropriate fault detection mechanisms, endto-end tests, and disk interface protocol violations. Disk failures in RAID are known as erasures [Gibson 1992], since the location of the failed disk is known. Check blocks at the RAID level are then used for error correction, rather than detection, across disks. Data is organized on disk as 512 or 520 byte sectors with an Error Correcting Code (ECC), which may be 40 bytes long [Jacob et al. 2008]. Longer 4096 byte sectors are also under discussion. The majority of soft bit errors, which amount to one bit in 105 bits are corrected by the ECC, so that the hard error rate is one in 1014 bits in PS drives and one in 1015 bits in ES drives (see Section 18.9 in Jacob et al. [2008]). Media failures are handled by relocating the contents of bad sectors to healthy disk sectors, The sectors on a drive are identified by logical black addresses (LBAs). Assigning the LBA of a faulty sector to the next physical sector is referred to as slipping, while sparing utilizes empty sectors. ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
7:4
•
A. Thomasian and M. Blaum
In online transaction processing (OLTP) transaction updates are logged by the database management system (DBMS) to ensure transaction durability and atomicity [Ramakrishnan and Gehrke 2003]. In the case of a system failure due to power outage or software or hardware error, which does not involve a disk failure, the server is restarted and the disk brought up-to-date using log records to reflect the effect of committed transactions. Nearline storage (a tape library enhanced with robotic arms), which was used for disk backup, is being replaced with less expensive disks for this purpose [Gray and Shenoy 2000]. Disk reliability has been increasing over time and the mean time to failure (MTTF) for disks exceeds a million hours. The time to the first disk failure is only a thousand hours for 1000 disks with an MTTF of a million hours. This is strictly so when the time to disk failure is exponentially distributed [Trivedi 2002]. which was shown a good approximation to the disk failure process in Gibson [1992]. Performance is degraded when operating with a failed disk. The cost of downtime varies with the line of business and can be quite high (see Figure 1.3 in Hennessy and Patterson [2006]). Remote synchronous backup is desirable for critical applications [Ji et al. 2003], where the secondary site can assume the role of the primary site. 1.1.2 Disk Arm Scheduling in Magnetic Disk Drives. While Solid State Disks (SSDs) are expected to replace magnetic disk drives, such drives are expected to remain the workhorse for data storage. A brief description of their operation is therefore provided to clarify some of the discussions in later sections. The access time to magnetic disks, which are mechanical in nature, has three components: seek time to move the disk arm to an appropriate track, rotational latency for the desired block to reach the read/write head, and transfer time to read or write it [Ruemmler and Wilkes 1994; Ng 1998]. The disk recording density has been increasing steadily: 29% per year till 1988, 60% 1988–1996, 100% 1997–2003, and 30% since then [Hennessy and Patterson 2006]. The increased linear recording density combined (to a lesser degree) with higher disk RPMs has resulted in an increased disk transfer rate (50% in the year 2000 [Gray and Shenoy 2000]). The rotational latency, which can be approximated by one half of disk rotation time for accesses to small blocks, is also improved by higher RPMs. Seek time is lower due to the increased track density, but there is an accompanying increase in head settling time (HST), which is higher for writes than reads [Ng 1998]. In fact it is recommended in Hsu and Smith [2003] that different seek time characteristics be used for read and write accesses. In modern disks the read/write head covers multiple tracks, so that the seek time amounts to head settling time [Schlosser et al. 2005]. This has implications on how sequential data is stored on disk. The disk arm positioning time (sum of seek time and rotational latency) was improving at the rate of 8% annually at the turn of the century [Gray and Shenoy 2000]. In view of rapidly increasing disk capacities, disk access bandwidth for randomly placed blocks is a limiting factor. Increasing cache sizes at different levels of memory hierarchy are beneficial in lowering the overall miss rate, so that the ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
Higher Reliability Redundant Disk Arrays
•
7:5
disk access rate per gigabyte (GB) is expected to decrease. Access bandwidth can be improved by disk arm scheduling. The shortest seek time first (SSTF) and SCAN methods minimize seek distance and hence disk seek time and access time [Denning 1967]. Since the disk rotation time is significant with respect to seek time, the Shortest Access/Positioning Time First (SATF/SPTF) policy, minimizes positioning time, which is the sum of seek and rotational latency. SATF outperforms the SSTF and SCAN methods at heavier loads [Worthington et al. 1994; Hsu and Smith 2004; Thomasian and Liu 2002]. It can be shown that SATF with 32 disk requests attains one half of the mean service time of the First-Come, First-Served (FCFS) policy [Anderson et al. 2003; Thomasian and Liu 2005]. The effect of up to 256 disk requests on disk throughput is given in Figure 6.2 in Hennessy and Patterson [2006]. Track aligned extents is another method to improve performance, by ensuring records are not split across tracks [Schindler et al. 2002]. A straightforward implementation of priorities for SATF results in a significant degradation in maximum throughput, since not all disk requests will be available for scheduling. Priorities can be implemented and performance improved by discounting the positioning time of higher priority requests by multiplying them by a discount factor 0 < t < 1 [Thomasian and Liu 2002]. Starvation associated with SATF can be avoided by increasing the priority of requests according to their waiting time w, that is, discounting the positioning time in proportion to waiting time, for example, multiplying disk access time by t i , with i = 0 for w < W and i ≥ 1 for W + (i − 1)σW ≤ w ≤ W + iσW (W and σW denote the measured mean and standard deviation of waiting time for disk requests) [Thomasian and Liu 2004]. More sophisticated disk scheduling policies are required for disks processing a stream of requests to large data blocks for multimedia applications, as well as accesses to small data blocks [Balafoutis et al. 2003]. The performance goal is maximizing the number of multimedia streams, while minimizing the mean response times for accesses to small blocks. More generally, in large scale storage systems there is the issue of isolating applications from each other, which is achieved via hierarchical I/O scheduling [Shenoy and Vin 2002]. In Wong et al. [2006] a long-term partitioning of disk bandwidth is accomplished via pools, for example, a pool supports up to ten media streams. The number of sessions started by active processes should not exceed the pool allocation. Fair sharing implies that all active sessions and pools in addition to their allocation will receive additional best effort resources. The disk scheduler takes into account when the release time or deadline for a disk request is exceeded. Disk performance can be improved by rearranging its data blocks, for example, the organ-pipe organization reduces the mean seek time for disk accesses by placing the most frequently accessed blocks on middle disk cylinders [Jacob et al. 2008]. Even further improvements are possible by taking into account the sequence of file accesses, which results in reducing the rotational latency as well. The Automatic Locality Improving Storage (ALIS) scheme relocates frequently accessed data and co-locates blocks accessed in a sequence [Hsu et al. 2005]. ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
7:6
•
A. Thomasian and M. Blaum
At a higher level we have data centers serving multiple independent rapidly changing workloads, some of which have Quality of Service (QoS) requirements. The Facade controller, placed between the host computers and the storage system, throttles requests from appropriate sources to prevent overload [Lumb et al. 2003]. This is especially important when the storage system is operating in degraded mode. RAID performance is analyzed in several studies assuming that arrivals are Poisson and FCFS disk scheduling [Chen and Towsley 1993; Menon 1994; Merchant and Yu 1996; Thomasian and Menon 1994; Thomasian and Menon 1997; Thomasian et al. 2004] so that the service of requests at each disk can be analyzed as an M/G/1 queueing system [Kleinrock 1975; Lavenberg 1983; Takagi 1991]. The performance analysis of zoned disks in Thomasian and Menon [1997] is generalized in Thomasian [2006b]. Most other disk scheduling disciplines are not work-conserving and are difficult to analyze. The FCFS policy provides a lower bound to performance compared to SCAN and SATF policies [Worthington et al. 1994; Thomasian and Liu 2002]. Performance analyses of RAID are surveyed in Section 5. 1.1.3 Effect of the Memory Hierarchy on Performance. OLTP is a challenging workload, since it incurs accesses to randomly placed small disk blocks [Ramakrishnan et al. 1992] that is, the positioning time is considerable with respect to the transfer time of small blocks. Transaction Processing Council (TPC)1 benchmarks compare OLTP throughput at a given percentile of transaction response time reaching a certain threshold. This response time is heavily influenced by the number of disk accesses carried out on behalf of transactions. In fact, a large fraction of disk accesses may be obviated by the database and file buffers in main memory, for example, the higher two or three levels of a B+ tree index may be cached [Ramakrishnan and Gehrke 2003]. The contents of the volatile portion of the disk array controller (DAC) cache also assists in reducing the number of disk accesses. Currently, the onboard disk, cache which is part of the disk drive enclosure [Ruemmler and Wilkes 1994; Ng 1998; Jacob et al. 2008] is rather small and does not contribute to the hit rate for random accesses. Part of the DAC cache is protected by uninterruptible power supply (UPS) and hence constitutes nonvolatile RAM/storage (NVRAM/NVS). The reliability of duplexed NVRAM is comparable to magnetic disk drives, hence the fast-write feature considers a write completed as soon as the block is written onto a duplexed NVRAM [Menon and Cortney 1993]. This feature allows the writing of modified data blocks to disk to be deferrable. Read requests can be given a higher priority than writes and this results in an improved response time for read requests, since they affect transaction response time in OLTP. Dirty blocks held in NVRAM are written out to disk or destaged when the buffer becomes full [Treiber and Menon 1995] or when it is being filled rapidly [Varma and Jacobson 1998]. Deferred destaging has two advantages [Menon 1994; Thomasian and Menon 1997]. (i) the overwriting of dirty blocks in NVRAM obviates 1 http://www.tpc.org.
ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
Higher Reliability Redundant Disk Arrays
•
7:7
unnecessary destages (writes) to disk (note similarity to the write-back policy for CPU caches [Hennessy and Patterson 2006]); (ii) batching requests allows the destage process to achieve lower disk utilization for destaging. RAID performance is determined by the performance of the cache, which can be quantified by disk trace analysis, so that they can be incorporated into simulations [Zabback et al. 1996] or analytic models [Menon 1994; Thomasian and Menon 1997]. There have been several studies of cache performance [Smith 1985; Treiber and Menon 1995], and a few analytical models have been developed [McNutt 2000; Tay and Zhou 2006]. 1.1.4 Alternatives to Hard Disk Drives. In spite of the availability of very large dynamic random access memory (DRAM) memories, magnetic disks are the preferred storage medium, because of their nonvolatility and the high storage capacity at a very low cost per bit. DRAM can be made nonvolatile using UPS (uninterruptible power supply). The projection in Gibson [1992] that the cost per bit for DRAM will be lower than the per bit disk cost has not materialized so far. The drop in the price of magnetic disk storage has led to the replacement of tape magnetic libraries for backup with inexpensive disk drives. There are other memory technologies challenging magnetic disks in different application domains [Hennessy and Patterson 2006]. Flash memories have the same bandwidth as disk, but latency-wise are 100–1000 times faster. The per GB price for flash memories is higher than disks, but it is dropping rapidly with respect to DRAM’s $52 per GB price in 2007, that is, flash prices were $42 in 2005, $15 in 2006, and $7 in 2007 [Mathews et al. 2008]. An early concern with flash memories was that they accept a limited number of writes. Hybrid disks are equipped with flash memories [Mathews et al. 2008], which allow write caching and disk spindown (stopping disk rotations in idle periods). There has been significant recent interest in MicroElectroMechanical Systems (MEMS) storage [Carley et al. 2000] with a significantly lower cost than DRAM and an access time which is an order of magnitude faster than disk. Approximately ten storage class memories, which are potential disk drive replacements, are reviewed in Freitas and Wilcke [2008]. In this tutorial we concentrate on magnetic disks, since they are currently the dominant secondary storage medium. 1.2 Overview of RAID Levels As noted earlier RAID0 does not provide redundancy, but only striping for load balancing purposes. SUs (stripe units) or strips in a row constitute a stripe and large datasets occupy multiple stripes. Techniques to determine the optimal SU size are discussed in Section 4.4 in Chen et al. [1994]. RAID1 or disk mirroring predates the RAID classification. It stores data redundantly to increase its availability, for example, in basic mirroring (BM) data is replicated on two disks. The doubling of data access bandwidth for read requests is beneficial from the viewpoint of rapidly increasing disk capacities and the fact that random disk access time is improving slowly. When a disk ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
7:8
•
A. Thomasian and M. Blaum
fails BM has the disadvantage that the read load of the surviving disk is doubled. There are several RAID1 organizations, which allow a more balanced load across disks when a disk failure occurs. The tradeoff between load balancing and reliability for these organizations is quantified in Thomasian and Blaum [2006] and Thomasian and Xu [2008]. In RAID1 arrays with numerous disk drives, the failure of two drives which shadow each other will lead to data loss, hence RAID1 is a 1DFT. RAID5 (resp. RAID6) disk arrays dedicate the capacity of one (resp. two) disks among N disks to check blocks to protect against as many disk failures. A block on a failed disk in RAID5 can be reconstructed by accessing all the corresponding blocks on surviving disks and exclusive-ORing (XORing) them. Assuming that disk loads are balanced due to striping the read load on surviving disks will be doubled when the system is operating in degraded mode. This may be unacceptable if disk utilizations in normal operating mode were already high. Upon a disk failure in RAID6, a data block can be reconstructed by reading the corresponding data blocks and one of the check disks. In the case of two disk failures, all the corresponding blocks need to be read to reconstruct blocks on failed disks. Clustered RAID or parity declustering is a solution to the disk overload problem, which was proposed for RAID5 arrays in Muntz and Lui [1990], but is also applicable to RAID6. Parity declustering disassociates the parity group size (denoted by G), from the number of disks in the array N , with G ≤ N . When all requests are reads then the increase in disk load is given by the declustering ratio: α = (G − 1)/(N − 1) < 1. Two methods to implement clustered RAID: Balanced Incomplete Block Designs (BIBDs) [Ng and Mattson 1994; Holland et al. 1994; Holland 1994] and Nearly Random Permutations (NRPs) [Merchant and Yu 1996] are described in Section 3.3. Parity striping allocates data blocks sequentially on disk, but leaves enough space on each disk for parity [Gray et al. 1990], so that the user has control over file allocation to gear it to the application. For example, relational tables to be joined via table scans may be placed on independent disks for parallel access [Ramakrishnan and Gehrke 2003]. Disk load imbalance or access skew must be dealt with by file placement. A technique to improve data placement, while minimizing data movement, is proposed in Wolf [1989]. A “disk cooling” algorithm for load redistribution uses “heat tracking” at the block level to balance disk loads [Scheuermann et al 1994]. Data allocation in disk arrays has been considered as a file placement problem with files modeled as vectors with size and access rate as the two coordinates [Hill 1994]. Forum [Borowsky et al. 1997], Minerva [Alvarez et al. 2001], and the Disk Array Designer (DAD) [Anderson et al. 2005] (renamed from Ergastulum) are three generations of design tools for the “attribute mapping problem” [Golding et al. 1995], to create a self-configuring and self-managing storage system. Data allocation is handled as a bin-packing problem. More specifically, the online best-fit bin packing with random order [Kenyon 1996] is utilized in DAD. The Aqueduct online data migration policy [Lu et al. 2002] implemented in the context of Minerva deals with changing access patterns, new applications ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
Higher Reliability Redundant Disk Arrays
•
7:9
and resources, and equipment failures, which require the load to be balanced. Data migration is done online using a control theoretic approach to meet QoS requirements. RAID5 with a single disk failure, in addition to having a poor performance in degraded mode operation, is susceptible to data loss if a second disk fails. This issue is dealt with by systematically reconstructing the contents of a failed disk on a spare disk. This requires the reading of successive rebuild units (RUs), for example, tracks, from surviving disks, XORing them, and writing them onto a spare disk. RAID6, which tolerates two disk failures, is motivated by the fact that the rebuild process in RAID5 might fail due to a secondary disk failure, but this is rare unless disk failures are highly correlated. The rebuild process is more likely to fail due to uncorrectable sector errors, which are in the form of latent sector failures (LSFs) [Blaum et al. 1995; Kari 1997]. Whole-disk/whole-disk and whole-disk/media failures may lead to disk array failure [Corbett et al. 2004]. Media/media are of little concern since they are rare [Corbett et al. 2004]. In fact the mean time to media failure is 30% of the time to disk failures, so that it is important to detect sector failures, before they lead to data loss due to an unsuccessful rebuild process. Media failures can be dealt with by disk scrubbing, which systematically reads disk sectors and reconstructs and relocates unreadable blocks [Kari 1997; Schwarz et al. 2004; Iliadis et al. 2008]. While disk scrubbing reduces the occurrence of LSFs, it does not entirely eliminate them, so that 2DFTs are still required. IDR is a low cost method to mask sector errors [Dholakia et al. 2008]. StorageTek (acquired by SUN in 2005, which is itself being acquired by Oracle in 2009) developed the Iceberg RAID6 array, in its Iceberg product utilized RS coding and the Log-Structured Array (LSA) paradigm, which is discussed in the following [Chen et al. 1994; Friedman 1995]. HP’s Surestore can be configured statically as either RAID 5DP (double parity) or RAID10 (mirroring with stripes). Surestore is a follow-on to HP’s AutoRAID, which utilized RAID1 as a cache to RAID5 data, with RAID5 following the LSA paradigm [Wilkes et al. 1996]. RAID6 can also be implemented using especially designed parity codes such as EVENODD [Blaum et al. 1995]. a variation of which is utilized in IBM’s DS8000 array (this systems can also be configured as RAID5 and RAID10), RDP (row-diagonal parity) [Corbett et al. 2004] utilized by NetApp, or the code in Blaum and Roth [1999]. All these codes are Maximum Distance Separable (MDS), defined in Section 2, with minimum distance 3 [MacWilliams and Sloan 1977]. This means that by using two parity columns (disks), they can recover the information lost in any two erased columns. The X-code which is also MDS places the parity in rows rather than in columns, but is only applicable to a prime number of disks [Xu and Bruck 1999]. While the EVENODD method also requires a prime number of disks, additional virtual disks can be added to achieve a prime number [Blaum et al. 1995]. RM2 disk arrays place parities in a row, but they are not always MDS, i.e., they might exceed the minimum level of redundancy [Park 1995]. ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
7:10
•
A. Thomasian and M. Blaum
Log-Structured File Systems (LFS) minimize disk access bandwidth in carrying out write requests [Rosenblum and Ousterhout 1992], since it is argued that with a large cache most read accesses are satisfied by the cache and it is the write requests that require a disk access. NetApp’s Write Anywhere File Layout (WAFL), is a similar concept [Hitz et al. 1994]. It provides snapshots, which are online backups, so that deleted files can be recovered right away. Data fragmentation is dealt with by scheduled de-fragmentation for randomly written data, which needs to be read sequentially. LSF at the level of disk arrays or Log-Structured Array (LSAs) is implemented as full stripe writes in RAID5 [Menon 1995; Wilkes et al. 1996] or RAID6, as in Iceberg [Chen et al. 1994], so that the check strips can be computed on-the-fly. This is especially attractive in RAID6, which require two check strips to be updated. Hierarchical RAID (HRAID) uses a hierarchy of RAID controllers [Baek et al. 2001]. RAID1/5 consists of mirrored RAID5s, while RAID5/1 is a RAID5 whose logical disks are mirrored. Multilevel RAID (MRAID) differs from HRAID in that it relies on logical associations among disk arrays [Thomasian 2006a], which may be storage nodes or bricks [Gray 2002; Fleiner et al. 2006]. Each brick consisting of a controller, cache, and multiple disks constitutes the smallest replaceable unit (SRU). MRAID can deal with the failure of storage nodes, but also with disk failures inside each node via replication or erasure coding. For example, in the case of RAID5/5 with N nodes and N disks per node, there are two sets of parities. P parities protect against single disk failures in each node, by accessing the surviving N − 1 disks in that node. Q parities are used to recreate the contents of the disks at a failed node by accessing corresponding blocks at the N −1 surviving nodes. The reliability of these RAID organizations is compared in Section 4.3. 1.3 Organization of the Article In Section 2 we describe coding techniques applicable to 2DFTs: RS, EVENODD, and RDP, and X-codes. RS has been traditionally been associated with RAID6. The computational complexity of EVENODD was compared with RS in Blaum et al. [1995], so that in this work we compare EVENODD with RDP. For the sake of completeness we describe the organization and operation of RAID5 in detail before proceeding to RAID6 disk arrays in Section 3. This is followed by a brief description of LSA. We next describe two clustered RAID implementations based on BIBD and NRP layouts and RM2, which is a 2DFTs and a clustered RAID [Park 1995]. Load imbalance in degraded mode operation in RAID6 and RM2 is briefly mentioned. We review recent studies to estimate overall disk drive reliability, as well as sector failures in RAID. Reliability analyses to study the effect of disk scrubbing, and the Interleaved Data Redundancy (IDR) scheme are next discussed. We conclude with a shortcut method for reliability analysis, which simplifies the comparison of certain disk array reliabilities. We review performance evaluation studies of disk systems and RAID5 and RAID6 disk arrays in Section 5. Conclusions are presented in Section 6. Acronyms and Abbreviations are given at the end of the article. ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
Higher Reliability Redundant Disk Arrays
•
7:11
2. CODING FOR MULTIPLE DISK FAILURES In this section, we describe different coding schemes to recover from multiple failures. From a coding point of view, the model corresponds to erasure correction, that is, known failed locations. This model is similar to the one of multitrack tape recording, which has been studied by different authors [Fuja et al. 1989; Patel 1985]. However, these last two references use a convolutional type of approach, which, among other problems, has infinite error propagation. The codes to be described in this section are of block type. Block codes do not have the problem of infinite error propagation, since an uncorrectable error is always limited by the size of the block. Another advantage of block codes over convolutional codes is that they do not have overhead parity at the end of the block. We say that a code is binary if its symbols are bits, that is, elements on the binary field G F (2), as opposed to nonbinary codes, whose symbols are over larger fields like G F (256) (bytes). Binary block codes correction erasures are presented in Hellerstein et al. [1994]. The problem with binary codes is that they are not Maximum Distance Separable (MDS). MDS codes meet the Singleton bound: in order to correct r erasures, they require r parity disks. So, Hellerstein et al. [1994] suggested that research was needed to study nonbinary codes that are MDS, and we review such techniques in this section. Let us point out also Newberg and Wolf [1994], which optimizes the layout of disks for some of the two-dimensional schemes described in Hellerstein et al. [1994]. The most common class of MDS codes are Reed-Solomon (RS) codes [MacWilliams and Sloan 1977]. RS codes are based on finite field arithmetic. Since they are described in numerous books in Coding Theory, we omit their description here, but for a tutorial, see Blaum [2005]. Another tutorial from the point of view of the systems programmer can be found in Plank [1997] although that paper has some errors and should be treated with caution (for instance, the dispersal matrix A does not have the desired properties). For a correction to Plank [1997], see Plank and Ding [2005] and also Plank [2005]. RS codes have been proposed and used for RAID architectures [Blaum and Ouchi 1994]. Some of the best implementations of RS codes for correction of erasures are based on Cauchy matrices [Plank and Xu 2006]. The problem of finding MDS block array codes with two parities has been solved [Blaum 1987; Goodman and Sayano 1990; Goodman et al. 1993]. There, it is shown that an (m − 1) × m array code with horizontal and diagonal parity lines (with a toroidal topology), can correct any (m − 1)-bit column in error if and only if m is a prime. Of course the codes can also correct two erased columns, making them very attractive for disk arrays technology. We will study these MDS codes in the next subsection. Although such codes are not optimal in the number of updates or of encoding operations, they are important historically since they introduce the concept of horizontal and diagonal parity: the rest of the codes to be presented use variations of this concept, and the proofs that they are MDS as well as their recursive decoding algorithms proceed along the same lines. In Section 2.2 we shall discuss the EVENODD family of block array codes. The EVENODD codes have two parity columns that are independent from each other. This makes updating the array less cumbersome. ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
7:12
•
A. Thomasian and M. Blaum
In Section 2.3 we present the Row-Diagonal Parity (RDP) family of codes, that is intermediate between those in the previous two sections and optimizes the number of XORs at the encoding. In Section 2.4 we discuss the problem of the optimal number of updates. In particular, we present a distributed parity scheme, the X-code. As we can see, we concentrate on the problem of correcting two erasures, since it is the most important problem in practice. However, correction of multiple erasures is also interesting both from a practical as well as a theoretical point of view. For instance, the codes of Section 2.1 have been extended to multiple parities by describing them as Reed-Solomon codes over particular polynomial rings [Blaum and Roth 1993]. The generalization to multiple parities of the EVENODD codes of Section 2.2 is treated in Blaum et al. [1996]. There, it is shown that the extension to three parities still gives an MDS code, but this is not true in general. The extension of the RDP codes presented in Section 2.3 to multiple parities, which remained as an open problem, has been recently solved [Blaum 2006; Blaum et al. 2002; Fujita 2006]. The problem of multiple erasures is also treated in two other papers [Feng et al. 2005a; Feng et al. 2005b], although the codes described there are not MDS in general. There are several other works that are relevant, although it is beyond the scope of this review to discuss each of them in detail. For instance, in Hafner [2005] the Weaver codes are presented; although such codes have several interesting properties from the implementation point of view, they are not MDS. Hafner et al. [2005] study the problem of failures that exceed the erasurecorrecting capability of the code due to hard errors in localized sectors, and they propose matrix methods to recover from these errors. See also Hafner et al. [2008] for further techniques on the subject. In Hafner [2006], the so-called HoVer codes, based on horizontal and vertical parity, allow for recovery of up to 4 erasures, although they are not MDS either. For an analysis of LDPC (Low Density Parity-Check) codes for erasure-correction, see Plank and Thomason [2007]. The so-called Liberation Codes [Plank 2008a] present good parameters for RAID6 applications, and in most cases approach optimal encoding. A special case of a minimum density with a codeword length of 8 is presented in Plank [2008b]. For a systematic code with a minimal number of 1s in its parity-check matrix, see Blaum and Roth [1999]. An experimental study to compare the performance of open source erasure codes for RAID6 is reported in Plank et al. [2009]. The performance metric is the encoding/decoding speed in MB/second, with RDP demonstrating the best performance. 2.1 A Family of MDS Block Array Codes with Two Parities Assume that we have m tracks or disks. Let the information in each disk be divided into symbols, where a symbol is a binary vector (bit, byte, sector, etc.). For simplicity, we assume that each symbol is a bit. The codes we want to construct are defined by means of (m − 1) × m arrays (ai, j ) 0≤i≤m−2 , such that ai, j represents the i-th symbol in the j -th column (disk). 0≤ j ≤m−1
ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
Higher Reliability Redundant Disk Arrays
0 1 2 3 4
0
1
2
3
4
0 1 2 3 4
0 1 2 3 4
0 1 2 3 4
0 1 2 3 4
0 1 2 3 4
Horizontal parity lines
0 1 2 3 4
0
1
2
3
4
0 1 2 3 4
1 2 3 4 0
2 3 4 0 1
3 4 0 1 2
4 0 1 2 3
•
7:13
Diagonal parity lines
Fig. 1. The two types of parity check lines of the code B 2 (5).
We will assume that the last two columns (disks) carry the parity. However, in order to avoid bottleneck effects when repeated “write” operations are involved, this distribution of the parity may be rotated for different blocks in the disk. We shall further add a fictitious zero-row to the array, that is, am−1, j = 0 for 0 ≤ j ≤ m − 1. This is merely done to make the notation easier in the encoding procedure. With this convention, the arrays are now square m × m arrays. We define the code B 2 (m) as the set of arrays (ai, j )0≤i, j ≤m−1 with am−1, j = 0 for all j , satisfying the following two equations (mn denotes the unique integer l such that 0 ≤ l < n and l ≡ m (mod n); for instance, 75 = 2 and −25 = 3): m−1
ai, j = 0, 0 ≤ i ≤ m − 2
(1)
j =0 m−1
a j −l m ,l = 0, 0 ≤ j ≤ m − 1.
(2)
l =0
Equations (1) and (2) define the encoding. There are two types of parity check equations: horizontal (as given by Equation (1)) and diagonal (as given by Equation (2)). The array has the topology of a torus, so the diagonals are wrapped around. The symbols 0, 1, 2, 3 and 4 in Figure 1 depict the horizontal and diagonal parity lines of the code when m = 5. Since a fictitious row has been added, we have a 5 × 5 array. Code B 2 (m) can trivially correct a column of erased symbols. For that the horizontal parity check symbols would already have been enough. Before we describe how to decode two erased columns and prove statements about it, we give one typical example. Example 2.1. Consider the array on the left in Figure 2, which is a 5 × 5 array such that the last row is a fictitious all-zero row. It contains two erased columns: columns 1 and 4. To decode, start with any of the two columns, say column 1, and consider the diagonal through the known entry (4, 1) in this column at the bottom. All elements on this diagonal are known, except where it meets the other erased column. So this entry, which is a1,4 , can now be computed. It must be 1. With the horizontal parity also a1,1 can now be computed. Its value is 0. This concludes the first cycle of the algorithm. So after row 4 was completely known, now also row 1 is completely known. Starting with the diagonal going through position (1, 1) we find a3,4 = 1 (notice that 4 − (2)(3) ≡ 3 (mod 5)) and subsequently ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
7:14
•
A. Thomasian and M. Blaum
0 1 2 3 4
0
1
2
3
4
0 1 0 1 0
? ? ? ? 0
1 1 0 0 0
0 1 1 1 0
? ? ? ? 0
0 1 2 3 4
Before decoding
0
1
2
3
4
0 1 0 1 0
1 0 0 1 0
1 1 0 0 0
0 1 1 1 0
0 1 1 1 0
After decoding
Fig. 2. An example of the decoding of two erasures.
a3,1 = 1; each time row j is known, then row j − 3 modulo 5 will be known). Continuing in this way we get a0,4 = 0 and a0,1 = 1, and finally, a2,4 = 0 and a2,1 = 1. We can now state the main theorem of this section. THEOREM 2.1. Code B 2 (m) can correct up to any two erased columns if and only if m is prime. The requirement that m is a prime number is not a serious limitation in multitrack magnetic recording and disk arrays applications, since the code can always be shortened to any number of columns by assuming that a certain number of information columns are zero. PROOF OF THE ‘ONLY IF’ PART. Suppose that m is not prime and let d be a nontrivial divisor of m. Consider the array with ones on positions (i, 0) and (i, d ), where i runs over all possible multiples of d , 0 ≤ d ≤ m − 1. This array can be viewed as the all-zero array in B 2 (m) with two erroneous columns added to it. However this array is itself in B 2 (m), as can be easily checked. So when replacing columns 0 and d by erasures, there is no unique decoding. We will prove the “if ” part of the theorem by presenting a decoding algorithm for two erased columns. Observe that the encoding algorithm is the special case of the decoding algorithm in which u = m − 2 and v = m − 1. Algorithm 2.1. (2-Erasure Decoding Algorithm for B 2 (m) Codes) Let m be a prime number and let (ri, j ) 0≤i≤m−2 be a received array, which is a word 0≤ j ≤m−1
(ai, j ) 0≤i≤m−2 in B 2 (m) with two erased columns, say columns u and v with 0 ≤ u < v ≤ 0≤ j ≤m−1
m − 1. First we add a fictitious bottom row of zeros, i.e., am−1, j = 0 for 0 ≤ j ≤ m − 1. Then proceed as follows: s ← v − u. for l = 1 to m − 1 do begin compute a−1−l sm ,v from the diagonal parity check equation through (−1 − (l − 1)sm , u). compute a−1−l sm ,u from the horizontal parity check equation through (−1 − l sm , v). end
ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
Higher Reliability Redundant Disk Arrays
•
7:15
PROOF OF ALGORITHM 2.1 AND THE “IF” PART OF THEOREM 2.1. The proof follows from the following two observations: first, at each stage of the computations of a−1−l sm ,v and a−1−l sm ,u by means of the parity check equations, all entries involved are known except for the one to be computed. Secondly, since m is prime, the greatest common divisor between s = v − u and m is one, that is, gcd(s, m) = 1, so the multiples of s (mod m) will cover all indices from 1 to m − 1. In other words, all erased values will be computed. The above also shows that the minimum (column) distance of B 2 (m) is 3 when m is prime. Since the code can correct any two erased columns, it can also correct a column in error. There is an algorithm for correcting one column in error, but since the RAID model deals with erasures only, we will omit it here [Blaum 1987; Blaum et al. 1998]. The codes described above are MDS but very short: their length is roughly equal to the size of the symbol, while Reed-Solomon codes, also being MDS, are much longer. If the symbols have length m, Reed-Solomon codes have length up to 2m . So, Reed-Solomon codes can always be used instead of B 2 (m) codes, but the converse is not necessarily true. However, B 2 (m) codes have less complexity, since they involve exclusive-OR operations only, avoiding the arithmetic over a Galois field. RAID architectures are an example of an application for which the length of the code has size comparable to or smaller than the size of the symbols. Another application is multi-track magnetic recording [Patel 1985; Prusinkiewicz and Budkowski 1976]. For more on complexity, see Sections 2.2 and 2.3. Let us point out that the construction with two parities described in this subsection can be extended to more parities [Blaum and Roth 1993] in order to do that, however, the simple description of the code with parity lines of different slopes is not enough. An algebraic description of these codes as Reed-Solomon codes over the ring of polynomials modulo 1 + x + · · · + x m−1 , m a prime, is required. For reasons of space, we omit this description here and we refer the reader to Blaum and Roth [1993] or to Blaum et al. [1998]. 2.2 Codes for Correction of Two Erasures with Independent Parities In Section 2.1, we presented a family of array codes that can correct any two erasures using horizontal and diagonal parity lines. The decoding of two erasures involved a simple recursion. The drawback of the construction is that the processes of encoding and of small write operations also involve a recursion. In applications like RAID architectures, the size of each individual symbol may be as big as a whole sector. As a consequence, we will want a minimal number of parity symbols affected by the update of a single information symbol. The scheme in Section 2.1 forces the updating of most of the parity symbols each time an information symbol is updated. In this section, we present an efficient encoding procedure, called EVENODD [Blaum et al. 1995], that is based on exclusive-OR operations and independent parities. Since the parities are independent, there is no recursion during this encoding procedure. The decoding algorithm is slightly more complex than the one presented in the previous section. Since it uses the same principle, we will ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
7:16
•
A. Thomasian and M. Blaum
0 1 2 3
0 1 2 3
0 1 2 3
0 1 2 3
0 1 2 3
0 1 2 3 ∞ 1 2 3 ∞ 0 2 3 ∞ 0 1 3 ∞ 0 1 2 ∞ 0 1 2 3
0 1 2 3
0 1 2 3 ∞
Fig. 3. Parity lines of EO2 (5).
omit it here. As a result of the simple encoding procedure a limited write action is greatly simplified, since any modified information symbol affects only two symbols in the parity most of the time. We will assume that the elements of the code are (m − 1) × (m + 2) arrays (ai, j ) 0≤i≤m−2 , m a prime number, with the information stored in columns 0 to 0≤ j ≤m+1
m − 1 and the parity stored in columns m and m + 1. We also assume, as in the previous subsection, that there is an imaginary 0-row after the last row, that is, am−1, j = 0, 0 ≤ j ≤ m + 1 (with this convention, the array is now an m × (m + 2) array). Next we define the code as follows: Definition 2.1. Given m prime, the code EO2 (m) is the set of arrays (ai, j ) 0≤i≤m−2 , such that for each i, 0 ≤ i ≤ m − 2, 0≤ j ≤m+1
ai,m =
m−1
(3)
ai,t
t=0
ai,m+1 = S ⊕
m−1
ai−tm ,t
(4)
t=0
where S is defined by
S =
m−1
am−1−t,t ,
(5)
t=1
Equations (3) and (4) define the encoding. As in the previous subsection, we have two types of parity: horizontal parity (as given by Equation (3)) and diagonal parity (as given by Equation (4)). Column m is simply the exclusive-OR of columns 0, 1, . . . , m − 1 (depicted on the left of Figure 3). Column (m + 1) carries the diagonal parity according to Equation (4). Notice that one diagonal is missing in (4). This diagonal consists of the entries (m − 2, 1), (m − 3, 2), . . . , (0, m − 1) (depicted by ∞ in Figure 3 on the right) and in fact it defines S. So it is this special diagonal that decides the parity of the other diagonals. In fact, if the special diagonal has EVEN parity then so do the others. Otherwise, they all have ODD parity. This is why this scheme is called the EVENODD scheme. ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
Higher Reliability Redundant Disk Arrays
•
7:17
It also follows from the previous discussion that m−2 m−2 m−2 m−1 ai,m = ai, j = S ⊕ ai,m+1 . i=0
i=0
j =0
i=0
In other words, the parity S of the special diagonal can also be computed as follows: S = the binary sum of the elements in columns m and m + 1.
(6)
Code EO2 (m) defined above can recover the information lost in any two columns, as the B 2 (m) code defined in Section 2.1. Therefore, the minimum distance of the code is 3, in the sense that any nonzero array in the code has at least 3 columns that are nonzero. The proof relies on the fact that m is a prime number and it is based on ideas similar to those in Section 2.1. Without this assumption, the resulting code does not have minimum distance 3, and it cannot retrieve any two erased columns. The next example illustrates the encoding for m = 5. Example 2.2. Let m = 5, and let the symbols be denoted by ai j , 0 ≤ i ≤ 3, 0 ≤ j ≤ 6. The parity symbols are in columns 5 and 6. A practical implementation of this example is to consider 7 disks numbered 0 through 6, each disk has 4 disk sectors, the data sectors are on disks numbered 0, 1, 2, 3 and 4, and the parity disk sectors are on disks numbered 5 and 6. Equation (5) gives S = a3,1 ⊕ a2,2 ⊕ a1,3 ⊕ a0,4 . According to Equations (3) and (4) the parity symbols are obtained as follows: al ,5 = al ,0 ⊕ al ,1 ⊕ al ,2 ⊕ al ,3 ⊕ al ,4 , 0 ≤ l ≤ 3 a0,6 = S ⊕ a0,0 ⊕ a3,2 ⊕ a2,3 ⊕ a1,4 a1,6 = S ⊕ a1,0 ⊕ a0,1 ⊕ a3,3 ⊕ a2,4 a2,6 = S ⊕ a2,0 ⊕ a1,1 ⊕ a0,2 ⊕ a3,4 a3,6 = S ⊕ a3,0 ⊕ a2,1 ⊕ a1,2 ⊕ a0,3 For instance, assume that we want to encode the 5 columns (we add the imaginary 0-row) below on the left. We have to fill up the last two columns with the encoded symbols. Notice that S = a3,1 ⊕ a2,2 ⊕ a1,3 ⊕ a0,4 = 1. Therefore, the diagonals will have odd parity. The encoding gives the array on the right: 1 0 1 0 0
0 1 1 1 0
1 1 0 0 0
1 0 0 1 0
0 0 0 1 0 0 0
1 0 1 0 0
0 1 1 1 0
1 1 0 0 0
1 0 0 1 0
0 0 0 1 0
1 0 0 1 0
0 0 1 0 0
Instead of giving the actual decoding algorithm of the EVENODD code EO2 (m) for correction of two erasures, we give an example that illustrates the idea behind it. ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
7:18
•
A. Thomasian and M. Blaum
Example 2.3. Consider EO2 (5), as in Example 2.2, and assume that we have the array below at the left, in which columns (disks) 0 and 2 have been erased (lost): ? ? ? ? 0
0 1 1 1 0
? ? ? ? 0
1 0 0 1 0
0 0 0 1 0
1 0 1 0 0
1 1 1 0 0
0 1 0 1
0 1 1 1
0 0 0 0
1 0 0 1
0 0 0 1
1 0 1 0
1 1 1 0
This is the main case for the decoding algorithm: two information columns have been erased. The cases in which at least one of the two parity columns has been erased are special cases that are easy to handle. The first step is finding the parity S of the diagonals. This value follows directly from (6). In the array above, we can see that the exclusive-OR of the bits in the two parity columns is 1, therefore the diagonals have odd parity. Once S has been established, the decoding proceeds very much like Algorithm 2.1 for B 2 (m) codes does. Next, the algorithm starts a recursion to retrieve the missing bits al ,0 and al ,2 , 0 ≤ l ≤ 3. We first need an entry where we can start. Either entries (4, 0) or (4, 2) would serve this purpose. For instance, taking the diagonal through (4, 0) (i.e., entries (4,0), (3,1), (2,2), (1,3) and (0,4) together with the “fictitious” parity entry (6,0)), we see that only one entry is unknown: a2,2 . Since the diagonal has parity S = 1, we find a2,2 = 0. Using the horizontal (even) parity, we obtain a2,0 = 0. Next, we consider the diagonal going through entry (2,0), etcetera. In this way we find successively: a0,2 = 0, a0,0 = 0, a3,2 = 0, a3,0 = 1, a1,2 = 0, a1,0 = 1. The result is depicted above on the right. The codes EO2 (m) can be extended to codes EOr (m) that consist of (m − 1) × (m + r) arrays with m information columns and r parity columns, each parity column independently encoded. More precisely, given m, we say that the code EOr (m) is the set of arrays (ai, j ) 0≤i≤m−2 , such that, for each l , 0 ≤ l ≤ m − 2, 0≤ j ≤m+r−1
and for each s, 1 ≤ s ≤ r − 1, al ,m =
m−1
(7)
al ,t
t=0
al ,m+s = Ss ⊕
m−1
al −stm ,t ,
(8)
t=0
where
Ss =
m−1
a−1−stm ,t .
(9)
t=1
Therefore, EOr (m) has even parity over horizontal lines as given by Equation (7), and even or odd parity over the lines of slope s, 1 ≤ s ≤ r − 1, as ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
Higher Reliability Redundant Disk Arrays
0 1 2 3
0 1 2 3
0 1 2 3
0 1 2 3
0 1 2 3
0 1 2 3 1 2 3 2 3 0 3 0 1
0 1 2
•
7:19
0 1 2 3
Fig. 4. Parity lines of code RDP2 (5).
given by Equations (8). Bit Ss given by Equation (9), determines the parity of the line of slope s. It can be proven that if m is prime, then code EO3 (m) is MDS, that is, it has minimum (column) distance 4. However, for r ≥ 4, this is in general not true any more. For details on the generalization of EVENODD, we refer the reader to Blaum et al. [1996]. For a specific architecture that generalizes EVENODD to three erasures, see Huang and Xu [2005]. Regarding complexity, it was shown in Blaum et al. [1995] that by counting the number of encoding operations, the EVENODD code has less complexity than a corresponding RS code. There are several methods to implement RS codes that improve complexity, but they are always more complex than array codes based on simple parity. The RDP codes to be presented in the next section minimize the number of encoding operations. 2.3 The Row-Diagonal Parity (RDP) Codes The RDP codes [Corbett et al. 2004] are an intermediate family of codes between the codes of Subsections 2.1 and 2.2. Given a prime number m, the codes take m − 1 information columns of length m − 1 each to which again two parity columns are added, an horizontal parity column and a diagonal parity column. Thus, the code is an (m − 1) × (m + 1) array. The horizontal parity column takes horizontal parity only (without including the diagonal parity column). The diagonal parity column takes the diagonal parity including the horizontal parity column, i.e., the parity lines are not independent, as is the case with the EVENODD code, the horizontal parity has to be computed first. The horizontal and diagonal parity lines of an RDP code with m = 5 are illustrated in Figure 4. There is a diagonal that can be ignored at the encoding. The diagonal parities are always even, as opposed to EVENODD, which has diagonal parities either even or odd. This allows for some saving in the number of XORs at the encoding. Moreover, it was proven in [Corbett et al. 1994] that this number of XORs (for the dimensions of the code) is minimal under the assumption that the code is MDS. Let us write explicitly the encoding equations. Given m prime, the code RDP2 (m) is the set of arrays (ai, j ) 0≤i≤m−2 , such that (assuming, as usual, that 0≤ j ≤m
am−1, j = 0) for each i, 0 ≤ i ≤ m − 2,
ai,m−1 =
m−2
ai,t
(10)
ai−tm ,t
(11)
t=0
ai,m =
m−1 t=0
ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
•
7:20
0 1 2 3
0 1 2 3
A. Thomasian and M. Blaum
0 1 2 3
0 1 2 3
0 1 2 3
0 1 2 3 1 2 3 2 3 0 3 0 1
0 1 2
0 1 2 3
0 2 1 1 3 0 2 2 1 3 3 0 2
3 0 1
0 1 2 3
Fig. 5. Parity lines of code RDP3 (5).
The decoding of two erased columns proceeds similarly to the decoding of the codes described in Sections 2.1 and 2.2. As stated above, the RDP codes minimize the number of XORs required for the encoding operation. A generalization of RDP codes was obtained independently in Blaum [2006] and Fujita [2006]. This generalization defines a code RDPr (m)), m prime, r ≥ 2, of length p + r − 1, where the horizontal parity line (slope 0) is given by (10), while the parity line of slope s, 1 ≤ s ≤ r − 1, is given by ai,m+s−1 =
m−1
ai−stm ,t
(12)
t=0
The parity lines of code RDP3 (5) are illustrated in Figure 5. Interestingly, it turns out that the codes RDPr (m) are MDS when the generalized EVENODD codes EOr (m) defined in the previous subsection are. For instance, when r = 3, RDP3 (m) is MDS for any prime number m. However, for r ≥ 4, this is not true in general, but depends on the prime number m considered. For details of the connection between EOr (m) and RDPr (m) see Blaum [2006]. 2.4 Short Write Operations An important feature of codes for disk arrays is the short write operation. As stated at the beginning of Section 2.2, we assume that a short write is an updating of only one information bit in the information columns. The EVENODD code EO2 (m) requires updating of two additional bits in the parity columns in most cases, for a total of 3 updates. However, when the bit to be updated occurs in the special diagonal, then all of the bits in the second parity column need to be updated. We can easily verify that on average, EO2 (m) requires 4 − 2/m updates [Blaum et al. 1996], The RDP code described in Section 2.3 requires also an average of 4 − 2/m updates. However, a lower bound on the number of updates for systematic codes is 3 + 1/m [Blaum and Roth 1999]. Moreover, this last reference shows a construction achieving this lower bound. Notice that in the codes defined so far, information bits and parity bits are in different columns. In coding theory, such codes are called systematic. If we drop the condition that the codes are systematic, then the conditions change. There are MDS codes meeting the lower bound of 3 updates if we allow the parity to be distributed among all columns. In other words, each column contains both information and parity bits, there are no columns dedicated exclusively to parity. To the best of our knowledge, the first paper presenting a construction of MDS codes with distributed parity and 3 updates is Zaitsev et al. [1983]. This result and its generalizations were also studied in Blaum and Roth [1999], Xu ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
Higher Reliability Redundant Disk Arrays
2 3 4 0
3 4 0 1
4 0 1 2
0 1 2 3
1 2 3 4
•
7:21
3 4 0 1 2 2 3 4 0 1 1 2 3 4 0 0 1 2 3 4
Lines of slope 1
Lines of slope -1
Fig. 6. The two types of parity check lines of the X-code.
and Bruck [1987], and Xu et al. [1999]. In particular, the X-code [Xu and Bruck 1999], provides an easy array code for distributed parity using two diagonal parity lines. Many ad-hoc constructions are described in Baylor et al. [1987], Park [1995], and Xu et al. [1999]. In particular, this latest reference shows that finding optimal distributed parity codes for correction of two columns is equivalent to a well known problem in graph theory, finding graphs that have perfect factorization. Distributed parity has advantages over dedicated parity. Certainly, there are no bottlenecks in a distributed parity scheme, no disk gets more access than others, so there is no need to rotate the parity columns. However, distributed parity codes cannot be shortened in general. That forces us to choose a code for a certain number of disks. The code will have to be changed if this number of disks changes in the future. In other words, the system is not scalable. For instance, the X-code consists of m columns, m a prime. If we want to implement the X-code in a real system, we have to demand that the system contains a prime number of disks, a requirement that may be too restrictive. Let us finish this section by describing the X-code, which will illustrate the concepts behind distributed parity and is much in the spirit of the codes that have been described so far. Consider the X-code with m = 5. The parity-lines are described in Figure 6. Notice that the parities are independent. Now, assume that we want to encode the array 0 1 1 0 1 1 0 0 1 0 0 0 1 0 1
As we can see, each column contains 3 bits of information and two bits of parity, so the code is not systematic (on columns, it is systematic on bits) and the parity is distributed. Using the lines of slope 1 following Figure 6, we obtain 0 1 0 0
1 0 0 1
1 0 1 0
0 1 0 1
1 0 1 1
ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
7:22
•
A. Thomasian and M. Blaum
Similarly, using the lines of slope -1, the second parity row gives 0 1 1 0 1 1 0 0 1 0 0 0 1 0 1 1 0 0 1 1 The final encoded array is then 0 1 0 0 1
1 0 0 1 0
1 0 1 0 0
0 1 0 1 1
1 0 1 1 1
Regarding short write operations, each time a single information bit is modified, then exactly two parity bits need to be modified as well, achieving the lower bound. The decoding of two erased columns is done similarly to the other algorithms presented. For details, see Xu and Bruck [1999]. 3. RAID DISK ARRAYS Operation of RAID5 and RAID6 disk arrays in normal and degraded modes is discussed, followed by a description of LSA. We describe two techniques to allocate data and parity blocks in clustered RAID: BIBD [Holland et al. 1994; Ng and Mattson 1994] and NRP [Merchant and Yu 1996]. RM2 in addition to being a 2DFT is a clustered RAID [Park 1995]. Load imbalance in RAID6 and RM2 disk arrays in degraded mode operation is identified and approaches to alleviate it are discussed. We conclude with a discussion of rebuild processing in RAID5. 3.1 RAID5 and RAID6 Organization and Operation A RAID4 disk array with N disks utilizes a single parity SU (stripe unit) per stripe (row), so that the redundancy level is 1/N . The RAID4 organization, which dedicates one disk to parity, out of N disks, has two disadvantages: (i) The bandwidth of the parity disk is not available for read accesses. (ii) The parity disk may become a bottleneck for a write intensive workload. These two shortcomings of RAID4 are alleviated in RAID5 by distributing the parity blocks evenly over all N disks. Of the numerous data layouts considered in Lee and Katz [1993] the left-symmetric layout, which has desirable properties over others, is shown in Figure 7 for both RAID5 and RAID6 disk arrays. Note that SUs in RAID5 are numbered in such a way that maximum parallelism can be achieved in data access, i.e., N consecutive SUs can be read from N independent disks. Two out of N SUs in a stripe are utilized as check disks in RAID6 and the redundancy level is 2/N . A check block allocation method similar to RAID5 is utilized in RAID6, except that there are two right-to-left diagonals. ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
Higher Reliability Redundant Disk Arrays
•
7:23
SUs holding data are denoted by D. The parity SU is denoted by P in RAID5 and the two check SUs in RAID6 are denoted by P and Q. Data and parity blocks are denoted by d , p, and q, for example, d 0 ∈ D0. For RAID5 with N = 6 disks, the parity SU for the first row, which is stored on the sixth disk is computed as: P0 − 4 = D0 ⊕ D1 ⊕ D2 ⊕ D3 ⊕ D4. RAID1 may be considered a special case of RAID5 with 50% redundancy level and even parity. Updates to small data blocks are a matter of concern in RAID5, but also RAID6, because of the high overhead of keeping the corresponding parity blocks up-to-date. This is referred to as the small write penalty [Chen et al. 1994]. Several techniques to deal with this problem are listed in Section 4.1 in Chen et al. [1994]: (i) buffering and caching allows fast writes [Menon and Cortney 1993], but in addition the updating of the disk blocks can be deferred; (ii) floating parity allocates extra space for parity blocks, so that parity blocks can be written with minimal rotational latency [Menon et al. 1993]; (iii) parity logging logs the difference between the old and modified data blocks onto disk, sorts the log according to increasing logical block addresses (LBAs) of parity blocks to be updated, and then updates the parities [Stodolsky et al. 1994]. Variable scope parity protection is yet another method to reduce the overhead of the small write penalty [Franaszek et al. 1996]. The system keeps track of empty disk blocks, which are assumed to hold zeroes, so that there is no need to read them for updating parities. The effect of this method on performance is quantified in Franaszek and Robinson [1997]. Full stripe writing is another method to minimize the parity update overhead. For example, P0 − 4 can be computed on-the-fly as the corresponding data SUs are written onto consecutive disks. Full stripe writes are usually limited to batch applications, but may be induced by the LSA paradigm [Menon 1995] (see Section 3.2). The small write penalty in RAID5 can be explained by considering the updating of block d 0 ∈ D0 from d 0old to d 0new , when the old data and parity blocks are not cached. This requires four disk accesses to update the parity p0−4 : (i) read old data: d 0old ; (ii) write new data: d 0new to replace d 0old ; (iii) read old parold old new ity: p0−4 ; (iv) compute the new parity as p0−4 = p0−4 ⊕ d 0old ⊕ d 0new and write it. If the parity can be computed at the disks, then the value of a modified data block, say d 0new , and its address and length (LBA and number of sectors) can be sent to the respective disk. The disk reads the old data block (d 0old ) and XORs it diff with the new block to compute the difference block: d 0 = d 0old ⊕ d 0new , which is sent to the parity disk. The disks holding data and parity blocks should be placed on different buses for higher reliability [Gibson 1992] (see Option 2 in Figure 7 for orthogonal RAID in [Chen et al. 1994]. A more reliable interconnection structure where each disk is connected to multiple buses for higher reliability new is described in [Ng 1994b]. The parity is updated at the disk according to p0−4 = diff
old p0−4 ⊕d 0 . Higher data integrity is attained by carrying out parity calculations at the DAC, but this requires four disk accesses [Thomasian et al. 2007b]. RMW (read-modify-write) requests are more efficient than two individual requests to read and update the data and parity blocks, since the updating is done one disk rotation after the block is read. RMW for data and parity can be diff issued simultaneously, but this has the disadvantage that d ∗ (the difference ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
•
7:24
0
A. Thomasian and M. Blaum
Disk 0
Disk 1
Disk 2
Disk 3
Disk 4
Disk 5
D0
D1
D2
D3
D4
P0-4
0
Disk 0
Disk 1
Disk 2
Disk 3
Disk 4
Disk 5
D0
D1
D2
D3
P0-3
Q0-3
1
D6
D7
D8
D9
P5-9
D5
1
D5
D6
D7
P4-7
Q4-7
D4
2
D12
D13
D14
P10-14
D10
D11
2
D10
D11
P8-11
Q8-11
D8
D9
3
D18
D19
P15-19
D15
D16
D17
3
D15
P12-15
Q12-15
D12
D13
D14
4
D24
P20-24
D20
D21
D22
D23
4
P16-19
Q16-19
D16
D17
D18
D19
5
P25-29
D25
D26
D27
D28
D29
5
Q20-23
D20
D21
D22
D23
P20-23
RAID Level 5 with Left-Symmetric Layout
RAID Level 6: P+Q parity
Fig. 7. Data and redundancy organization in RAID5 and RAID6. Shaded blocks are parities. D0 to D4 are the five data SUs and P0-4 is the corresponding parity SU. In the case of RAID6 P0-4 and Q0-4 are the two check SUs corresponding to D0 - D3.
block) to update the parity block may not be available in time to update the parity block. This may result in an unnecessary rotations on the parity disk and also delay in completing the update. Opportunistic or freeblock accesses to other blocks are possible, while the disk rotates [Lumb et al. 2002]. The after read-out policy [Chen and Towsley 1996] (as shown in Figure 8) diff initiates the access to the parity disk only after d ∗ is computed [Menon 1994; Thomasian and Menon 1994; Thomasian and Menon 1997]. This method is efficient in that there is no wasted processing time, but may require more time to complete than the first method. A third method is the before service policy [Chen and Towsley 1993], which initiates the updating of the parity block when the request for the data block reaches the head of the queue. As in the diff first method it is possible for d ∗ not to be available on time to update the parity. new The Reconstruct Write (RCW) method computes the parity as: p0−4 = d 0new ⊕ d 1 ⊕d 2 ⊕d 3 ⊕d 4 . The RMW and reconstruct write methods of processing updates are compared in Thomasian [2005a]. Operation with one failed disk in RAID5 is referred to as degraded mode operation, as opposed to normal mode operation [Menon 1994; Thomasian and Menon 1994]. Small blocks on the failed disk can be reconstructed on demand by reading and XORing the corresponding blocks on surviving disks. For example, if disk 0 fails, block d 0 can be reconstructed as: d 0 = d 1 ⊕ d 2 ⊕ d 3 ⊕ d 4 ⊕ p0−4 . To reconstruct d 0 on demand, the DAC initiates an N − 1-way fork-join request to the corresponding blocks on surviving disks. All N − 1 disk accesses need to be completed before d 0 can be reconstructed by XORing them in the join phase of a fork-join request. Each surviving disk processes its own read requests, plus fork-join requests to reconstruct blocks on the failed disk, so that the read load at the surviving disks is doubled. The higher disk utilization results in an increase in disk response time, which is quantified in Thomasian et al. [2007b]. There are two cases for small writes in degraded mode: (i) The disk at which the parity block corresponding to the data block to be updated has failed. For example, d 0 is to be updated and the sixth disk at which p0−4 resides has failed, then only d 0 needs to be updated. (ii) The disk at which the data block to be updated resides has failed. For example, if the first disk fails then d 0old to ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
Higher Reliability Redundant Disk Arrays
•
7:25
compute the new parity is not available. d 0old can be reconstructed by invoking an N − 1-way fork-join request, but it is more efficient to compute p0−4 directly as an N − 2-way fork-join request: p0−4 = d 0new ⊕ d 1 ⊕ d 2 ⊕ d 3 ⊕ d 4 . The rebuild process systematically reconstructs the contents of a failed disk on a spare disk, while the processing of user requests is in progress. If the first disk fails, then its first SU is reconstructed as follows: D0 = D1 ⊕ D2 ⊕ D3 ⊕ D4 ⊕ P0 − 4 (capital Ds refer to RUs (rebuild units) rather than SUs), Rebuild processing in RAID5 is discussed in detail in Section 3.7 The check SUs P and Q in RAID6 are placed in parallel right-to-left repeating diagonals across all disks, as shown in Figure 7. To update a small block two check blocks need to be updated, which is the minimal level of overhead for 2DFTs, but double the overhead in RAID5. The reconstruct write method may be used to minimize disk accesses in this case. The intra-disk redundancy (IDR) coding scheme, which can be utilized in addition to RAID5 and RAID6 is proposed in Dholakia et al. [2008]. The redundancy is applied to segments consisting of n data and m redundant sectors with = n + m, so that the storage efficiency is se = 1 − m/. With single parity check (SPC) coding the check sectors in the segment should be updated when a sector in the segment is modified. Two sectors need to be updated with RS coding. In interleaved (IPC) coding consecutive sectors are log√ parity-check √ ically placed in an × square matrix and the parities are computed one column at a time. Check sectors are placed in the center of the segment and the whole segment is read in one access, so that the number of sectors to be read is reduced. A simulation study is reported in Dholakia et al. [2008] to compare the performance of IDR with RAID5 and RAID6 methods. 3.2 Log-Structured Arrays (LSA) The LSA paradigm, which was first utilized in StorageTek’s Iceberg RAID6 disk array, alleviates the small write penalty for updating two check blocks [Chen et al. 1994; Friedman 1995]. LSA is used for the RAID5 component of AutoRAID [Wilkes et al. 1996], but the term was coined in Menon [1995]. also in the context of a RAID5 array. LSA is an extension of LFS (log-structured file system) [Rosenblum and Ousterhout 1992], which buffers modified files in main memory rather than writing them to disk individually. As a write buffer fills up, its contents are written to disk with minimum overhead, for example, a single seek to write a whole cylinder. While this makes writing to disk efficient, there is the overhead of garbage collecting the “holes” created by the old versions of the files and updating the file directories. The WAFL (Write Anywhere File Layout) has similarities to LFS, which was discussed in Section 1. LFS attains efficiency by sequentially writing large chunks of data and associated metadata to eliminate seeks for writing data on disk, since it is postulated that few read accesses are required, because such accesses are eliminated due to cache hits. A comparison of LFS with the fast UNIX file system [McKusick et al. 1984] has shown that either file system can be superior to the other [Seltzer et al. 1993]. but there is a disagreement about the conclusions of this ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
7:26
•
A. Thomasian and M. Blaum
study. This problem is studied using the linear and hierarchical reuse model in McNutt [1994]. LSA carries out full stripe writes after the DAC cache is filled with N −1 SUs in RAID5 and N − 2 SUs in RAID6 to compute the single or two check SUs onthe-fly efficiently. To create empty stripes live data blocks are extracted from stripes and intermixed with newly created data blocks [Friedman 1995]. An analytic study in Menon [1995] concludes that LSA outperforms basic RAID5, but it is argued in McNutt [1994] that stripe (segment) reads and writes for garbage collection should be interruptible. This will however affect LSA efficiency. Garbage collection in the Iceberg is addressed in Friedman [1995], but also analyzed in McNutt [1994]. LSA, as described in Menon [1995], allows data to be stored in compressed form, for example, using the Lempel-Ziv (LZ) algorithm. This is possible because data is not written in place, since updated data may not compress as well as it did before it was modified. The relationship between average segment occupancy (ASO) and best segment occupancy (BSO), that is, segment with the lowest utilization is given as: ASO = (1 − BSO)/(−l n(BSO)) (l n denotes the natural logarithm). This implies that the garbage collection overhead decreases with the LSA occupancy. The age-threshold LSA garbage collection algorithm [Menon and Stockmeyer 1998; Stockmeyer 2001] is compared in these works against LFS’s greedy and age threshold algorithms [Rosenblum and Ousterhout 1992]. It is argued that segments that have reached the age threshold are not likely to become emptier due to future writes. The fitness algorithm has two parts [Butterworth 1999]: (i) a segment filling algorithm sorts tracks into segments during destage; (ii) an algorithm for free-space collection based on the criterion: segment age × free space2/user space. Simulation results have shown that the new algorithm is more efficient than age-threshold from the viewpoint of free space collected per segment [Butterworth 1999]. 3.3 Clustered RAID Data Layouts Clustered RAID uses a parity group of size G, which is smaller than the number of disks in the array (G < N ). The motivation for clustered RAID is that in order to reconstruct a block in degraded mode (with one failed disk) we need to access G − 1 rather than N − 1 surviving disks, so that compared to a regular RAID5 the increase in disk load in processing read requests is given by the declustering ratio α = (G − 1)/(N − 1) [Muntz and Lui 1990]. Access costs in clustered RAID are obtained in Thomasian [2005b]. The check blocks for RAID5 and RAID6 are distributed evenly across all disks by using the left-symmetric data layout [Lee and Katz 1993], but more sophisticated techniques are required for clustered RAID (other requirements are given in Section 3.4). For a given N and G there are N different groups multiplied by N (for G the position of the parity block), so that a rather large table will be required to obtain the addresses of the parity and buddy blocks [Muntz and Lui 1990]. Muntz and Lui [1990] suggest that this a combinatorial design problem, but ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
Higher Reliability Redundant Disk Arrays
•
7:27
Table I. BIBD Data Layout with N = 10 Disks and Parity Group Size G = 4 Disk #
1
2
3
4
5
6
7
8
9
10
Parity Group #
1 2 3 4 8 12
1 4 6 7 9 13
2 4 5 7 10 14
3 4 5 6 11 15
1 5 8 10 11 13
2 6 8 9 11 14
3 7 8 9 10 15
1 5 9 12 14 15
2 6 10 12 13 15
3 7 11 12 13 14
no solution is provided. A data layout based on BIBD (balanced incomplete block designs) [Hall 1986] was proposed by two independent groups: one at IBM at Almaden [Ng and Mattson 1994] and another at CMU [Holland et al. 1994; Holland 1994]. The NRP (nearly random permutation) data layout is an alternative to BIBD to balance the load for updating parities [Merchant and Yu 1996]. In Section 3.5 we describe the RM2 disk array, which is a clustered RAID as well. 3.3.1 Balanced Incomplete Block Designs (BIBDs). A BIBD data layout is a grouping of N distinct objects into b blocks, such that each block contains G objects, each object occurs in exactly r blocks, and each pair of objects appears in exactly L blocks [Hall 1986]. Only three out of five variables are free, since bG = Nr and r(G − 1) = L(N − 1). For N = 10, G = 4, the number of parity groups is b = 15, the number of domains (different parity groups) per disk is r = 6, and the number of parity groups common to any pair of disks is L = 2. This is shown in Table I, adopted from Ng and Mattson [1994]. BIBD tables can be derived from the BIBD designs given in Hall [1986], but these designs are not available for all values of N and G, for example, a layout for N = 33 with G = 12 does not exist, but G = 11 and G = 13 can be used instead [Holland et al. 1994]. 3.3.2 Nearly Random Permutations (NRPs). Disk array space is organized as an M × N matrix, with M rows corresponding to stripes and N columns corresponding to disks. Parity groups of size G < N are placed sequentially row-first - on the M × N elements of the matrix, so that parity group i occupies SUs iG through iG + (G − 1). In the case of RAID5 (resp. RAID6) the P (resp. P and Q) parities can be consistently assigned as the last two SUs in each parity group. The initial allocation with N = 10 and G = 4 is shown in Table II, where Pi-j stands for the parity of SUs Di through Dj. The parity SUs in the example appear on only one half of the disks, so that the parity update load is not balanced. The problem is alleviated by randomizing the placement of blocks on the disks, so that approximately the same number of parity blocks are allocated per disk. The row number I is used as a seed to a pseudo-random number generator to obtain a random permutation of {0, 1, . . . , N − 1}, given as PI = {P0 , P1 , . . . , PN −1 }. The permutation may be generated using Algorithm 235 [Durstenfeld 1964], which is given in simplified form. Consider an array A with N elements. Set n = N . ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
•
7:28
A. Thomasian and M. Blaum Table II. Allocation of Parity Groups Before Permutation (N = 10, G = 4)
Disk # Parity groups (initial allocation)
0
1
2
3
4
5
6
7
8
9
D0 D8 D15 D23
D1 P6-8 D16 P21-23
D2 D9 D17 D24
P0-2 D10 P15-17 D25
D3 D11 D18 D26
D4 P9-11 D19 P24-26
D5 D12 D20 D27
P3-6 D13 P18-20 D28
D6 D14 D21 D29
D7 P12-14 D22 P27-29
Table III. Permuted Data Blocks with Nearly Random Permutation Method (N = 10, G = 4). Note that Only Two Rows are Shown Disk #
0
1
2
3
4
5
6
7
8
9
Final allocation
D0 D8
D4 P9-11
D3 D11
P3-6 D13
D6 D14
D5 D12
P0-2 D10
D2 D9
D7 P12-14
D1 P6-8
L: Pick a random number k between 1 and n. If k = n then An ↔ Ak . Set n = n − 1 and go back to L if n < 2. If N mod (G) = 0 then the same random permutation is repeated once and otherwise the random permutation is repeated K = LCM (N , G)/N times, where LCM(N , G) denotes the least common multiple of N and G. For example, given the random permutation P1 = {0, 9, 7, 6, 2, 1, 5, 3, 4, 8} for the first row in Table II and since K = LCM(10, 4)/10 = 2, the same permutation is repeated on the second row as well, as shown in Table III. Note that SUs of a parity group straddling two rows are mapped onto different disks, since initially all SUs were on different disks and the same permutation is applied to both rows (this is the case for D6, D7, D8, and P6-8). This makes it possible to apply the RAID5 paradigm for recovery and to access all the data SUs in the parity group in parallel. Given N , the SU size, and the RAID level, we can determine the row number I for a given block number. The SU in which the block resides is obtained by applying the permutation using I as a seed. 3.4 Discussion and Other Designs Six properties for ideal layouts are given in Holland et al. [1994]: (i) Single failure correcting, the SUs in the same stripe are mapped to different disks. (ii) Balanced load due to parity, all disks have the same number of parity SUs mapped onto them. (iii) Balanced load in failed mode, the reconstruction workload should be balanced across all disks. (iv) Large write optimization, each stripe should contain N − 1 contiguous SUs, where N is the parity group size. (v) Maximal read parallelism is attained, that is, the reading of n ≤ N disk blocks entails in accessing n disks. (vi) Efficient mapping, the function that maps physical to logical addresses is easily computable. Some data layouts addressing the above problems are as follows. The Permutation Development Data Layout (PDDL) is a mapping function described in Schwarz et al. [1999]. It has excellent properties and good performance both in light and heavy loads like the PRIME [Alvarez et al. 1998] and the DATUM data layout [Alvarez et al. 1997], respectively. ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
•
Higher Reliability Redundant Disk Arrays
7:29
3.5 The RM2 Disk Array RM2 is a 2DFT proposed in Park [1995] which poses and solves the following problem. “Given a redundancy ratio p and N disks, construct N parity groups each of which consists of 2(M −1) data blocks and one parity block such that each data block should be included in two groups, where M = 1/ p.” The redundancy ratio is the fraction of disk blocks or equivalently rows dedicated to parity to all blocks or rows. Given one row of parities per M − 1 rows of data blocks per segment, hence p = 1/M . The parity blocks fill the row and are hence distributed evenly over the disks. An algorithmic solution to this problem is based on an N ×N redundancy matrix (RM), where each column corresponds to a disk and each row corresponds to a parity group. The columns of RM are called placement vectors. Values of the elements of RM, RMi, j , are defined as follows: RMi, j = −1. A parity block of disk j belongs to parity group i. RMi, j = 0. Nothing (none of the blocks in disk j belongs to parity group i). RMi, j = k, 1 ≤ k ≤ M − 1. The kth data block of disk j belongs to group i. An RM2 data layout is constructed as follows: — (i) select the target redundancy ratio p and set M = 1/ p. — (ii) select the number of disks N so that N ≥ 3M −2 if N is odd or N ≥ 4M −5 if N is even. — (iii) construct a seed placement vector for M and N (the T here stands for transpose): −1 M −1 M −2 . . . 2 1 1 2 . . . M −2 M −1 0 . . . 0
T
a total o f N el ements
— (iv) construct the N × N RM matrix column-by-column by rotating the seed placement vector. For example, given p = 1/3 and M = 3. N = 7 is the smallest number satisfying the inequalities. The seed placement vector is (−1, 2, 1, 1, 2, 0, 0)T . The RM matrix and data layouts are shown in Figure 8. The parity group size for RM2 is 2M − 1 blocks, since each parity block protects 2M − 2 data blocks. The inequalities imply that p = 1/M ≥ 3/(N + 2) or p ≥ 4/(N + 5), which means that RM2 has a higher redundancy ratio than 2/N for RAID6, e.g., for N = 7: 1/3 > 2/7. A data block on a failed disk can be reconstructed by accessing the surviving data blocks and one of the parities, which are guaranteed not to be on the failed j disk. For two disk failures we need a recovery path. In Figure 9 bi refers to the block in column or disk i and row j . To access block d 2,3 or b01 assuming that disks 0 and 2 have failed we proceed as follows. Utilizing p2 and the associated parity group we have d 2,3 = p2 ⊕d 2,5 ⊕d 6,2 ⊕d 1,2 . However, if disk 2 fails instead of disk 3 then d 2,3 cannot be reconstructed using p3 directly, since d 3,6 is not available. However, d 3,6 can be reconstructed using p6 , so we have the following two steps: (1) d 3,6 = p6 ⊕ d 5,6 ⊕ d 0,6 ⊕ d 6,2 . (2) d 2,3 = p3 ⊕ D3,4 ⊕ d 3,6 ⊕ d 0,3 . ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
7:30
•
A. Thomasian and M. Blaum D0 D1 D2 D3 D4 D5 D6 -1 0 0 2 1 1 2 2 -1 0 0 2 1 1 1 2 -1 0 0 2 1 1 1 2 -1 0 0 2 2 1 1 2 -1 0 0 0 2 1 1 2 -1 0 0 0 2 1 1 2 -1 (a) D0 D1 D2 D3 D4 D5 D6 P0 P1 P2 P3 P4 P5 P6 D23 D34 D45 D56 D06 D01 D12 D14 D25 D36 D40 D51 D62 D03 (b)
PG0 PG1 PG2 PG3 PG4 PG5 PG6
Fig. 8. A Sample RM2 Layout. (a) RM (redundancy matrix) with M=3 and N=7, (b) Corresponding Disk Layout. D0 . . . D6 are data disks, PG0 . . . PG6 are parity groups. Block Di, j is protected by two parity blocks Pi and P j .
Fig. 9. Recovering from double disk failures in RM2. The recovery of block b01 , which is covered by parity groups 2 and 3, can proceed as follows:: (1) Read blocks b31 , b41 , b52 , b60 . (2) Reconstruct block b22 by equation b22 = b31 ⊕ b41 ⊕ b52 ⊕ b60 . (3) Read blocks b11 , b30 , b62 . (4) Reconstruct block b01 by equation b01 = b22 ⊕ b11 ⊕ b30 ⊕ b62 .
Therefore, the recovery path for block d 2,3 has a length of two. With disks D0 and D1 failed, the path to reconstruct d 1,4 is d 2,5 → d 2,3 → d 3,4 → d 1,4 . 3.6 Load Imbalance in Degraded Mode It was observed via simulation that the load of the RM2 disk array with one and two disk failures is unbalanced. The same observation could be made by an enumeration of various configurations. Several methods to deal with load imbalance are given in Thomasian et al. [2007b] for example, load balancing using the NRP method described in Section 3.3. ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
Higher Reliability Redundant Disk Arrays
•
7:31
Simulation results of RAID6 with a single disk failure shows a load imbalance in processing reads and writes. This load imbalance can be verified by a simple analysis, assuming all blocks are accessed uniformly [Thomasian et al. 2007b]. The load imbalance due to reads can be alleviated by selecting N − 2 out of N − 1 blocks to rebuild a missing block from surviving disk randomly. The small load imbalance in processing write requests can be eliminated by using NRP. 3.7 Rebuild Processing in RAID5 Rebuild is the systematic reconstruction of the contents of a failed disk on a spare disk. The rebuild process in RAID5 reads consecutive Rebuild Units (RUs), which may correspond to tracks, from surviving disks, XORing them to reconstruct missing RUs, which are then written onto a spare disk drive. Reading a track is advantageous, because of the zero latency read (ZLR) capability [Ng 1998]. That is, the reading of a track can be started almost immediately, at the first sector encountered, and takes one disk rotation. The self-monitoring analysis and reporting technology (SMART) [Smartmontools 2008] monitors disk errors. The contents of a disk, whose failure is imminent, are copied onto a spare disk, before it fails. This is especially helpful in RAID5 and RAID6, because the reading of all surviving disks is bypassed. The recent increase in disk drive capacities has resulted in an increase in the time required to read the contents of a disk for rebuild, but also backup. Assuming that other factors, such as the bandwidth of buses and the bandwidth of the hardware/software mechanism for computing parities do not affect the time to compute the contents of the spare disk, then rebuild time can be approximated by the time to read each one of the disks. The time to read a disk can be approximated as the ratio of disk capacity (say 1 TB) and its average transfer rate (say 200 MB/second), which amounts to an hour and a half for an idle disk. In spite of smaller drive diameters the larger disk capacity is made possible by the increased areal recording density [Gray and Shenoy 2000]. The areal density is a product of the linear recording density and the number of tracks per inch. Rebuild in offline mode takes the least possible time, since the contents of a disk can be read in an uninterrupted manner. Rebuild time is the product of the total number of tracks per disk and disk rotation time, where the number of tracks is the number of (active) disk surfaces times the number of tracks per surface. High data availability is a RAID5 requirement, so that the processing of user requests is continued in degraded mode, even when rebuild is in progress. Rebuild in online mode takes longer than rebuild in offline mode, because the disk arm movement to process external requests disallows the efficient reading of successive blocks. In the worst case, the reading of each rebuild unit (RU), for example, track, requires a seek, but more than one RU can be read per seek for lower disk utilizations in processing external requests. A rather complex analysis is required to estimate the reading time of each track, which varies as rebuild progresses and is affected by the intensity of external requests [Thomasian and Menon 1997; Thomasian et al. 2007a]. ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
7:32
•
A. Thomasian and M. Blaum
A hot spare allows the rebuild process to be started right away, thus reducing the window of vulnerability to data loss due to a second disk failure. Given that the MTTF (mean time to failure) of modern drives is over a million hours, the possibility of a second failure during rebuild is rather small, unless disk drive failures are correlated, for example, they belong to the same manufacturing batch. The sparing overhead may be reduced by allowing the spare disk to be shared among several RAID5 arrays. Markov chain modeling is used in Ng [1994a] to show that sharing spare disks has little effect on Mean Time to Data Loss (MTTDL) (see Section 4), but has significant effect on the mean time to service calls, which affects the operational cost of the array. The wasted bandwidth of the spare drive in dedicated sparing can be eliminated by adopting distributed sparing, that is, assigning enough empty space on each drive to hold the contents of a failed drive [Menon and Mattson 1992; Thomasian and Menon 1997]. The rebuild process is slowed down due to the fact that rebuild reads contend with rebuild writes for disk bandwidth of surviving disk, but disk loads remain balanced. With read redirection materialized data blocks are read directly from the spare drive or spare areas, rather than reconstructed on demand by accessing surviving disk drives [Muntz and Lui 1990]. In dedicated sparing the utilization of surviving disks is lowered and this results in speeding up the rebuild process. In this case the utilization of the spare disk remains below the utilization of surviving disk during rebuild, until the rebuild process is completed at which point all disk utilizations become equal. Note that the spare disk needs to be accessed when materialized data blocks are updated, even when read redirection is not in effect. The spare disk may become a bottleneck in clustered RAID [Muntz and Lui 1990], because of the parallelism available to the rebuild process, since more than one RU may be rebuilt when the next stripe is read. The resulting backlog of RUs to be written can be reduced by modifying the fraction of reads redirected to the spare disk, so that disk loads are balanced [Muntz and Lui 1990; Merchant and Yu 1996; Fu et al. 2004b]. In parity sparing there are two parity groups (RAID5 arrays) with N1 and N2 disks. When a disk fails, the SUs of one of the parities are used as spare areas for the other array. RAID6 with N = N1 + N2 drives is a better starting point than two RAID5 disk arrays, because it provides a higher reliability. When the first disk fails in RAID6 (see Figure 7), the array first reverts to RAID5 by using the Q parities as spare areas. After a second disk failure, RAID5 can revert to RAID0 by using the P parities as spare areas. The RU is the minimum amount of data read by the rebuild process for reconstruction. The larger the RU size the smaller the rebuild time, since fewer seeks will be required to complete rebuild processing. The larger RU size results in longer disk access times, during which external requests may be delayed. The time to process user request is increased by the mean residual (remaining) time to process rebuild reads, which is determined by the analysis in Thomasian and Menon [1997] and Thomasian et al. [2007a]. The increase in the waiting time of user requests should equal one half of rebuild time, for example, one half of ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
Higher Reliability Redundant Disk Arrays
•
7:33
disk rotation time. The analysis is more complex since the first in a sequence of rebuild reads requires a seek, while there is no seek associated with the rebuild reads that follow (see discussion of the vacationing server model that follows). This tradeoff is quantified via simulation in Hou et al. [1993]. Setting the RU size to a track results in variable RU sizes in zoned disks with Zoned Bit Recording (ZBR) [Ng 1998; Jacob et al. 2008], which violates the definition for RU and complicates buffer space management for rebuild processing. The onboard cache or track buffer [Ruemmler and Wilkes 1994; Ng 1998]. may be used to hold sectors read from a track, so as to obviate the rereading of track data. In fact the whole track may be read into the buffer, but only the next consecutive RU be made available to the rebuild process implemented at the DAC (disk array controller). When ZBR is in effect it is preferable to start the rebuild process with outer tracks, since in this manner more data is rebuilt per disk rotation. There are many options for rebuild processing [Holland et al. 1994]. In diskoriented rebuild the reading of data from disks is carried out independently, but may be constrained by the buffer space available at the DAC. Successive RUs are read from surviving disks, XORed to rebuild the RU, which is then written to the spare disk. Partially reconstructed RUs are held in a rebuild buffer at the DAC. The lack of synchronization in reading from disks, when disk loads are unbalanced, may result in buffer overflow. If so rebuild reading from a disk running ahead of others should be throttled. In stripe-oriented rebuild, rebuild is carried out one stripe at a time. Because of the fork-join synchronization delay associated with each stripe, this method does not utilize disk bandwidth as efficiently as disk-oriented rebuild and is outperformed by it [Holland et al. 1994; Holland 1994]. Only disk-oriented rebuild is considered in the following discussion. The vacationing server model (VSM) for rebuild is based on this model in queueing theory [Kleinrock 1975; Takagi 1991]. Ordinarily, a disk (server) alternates between the processing of user requests, which corresponds to a busy and an idle period. The busy period resumes when a new external request arrives. With VSM the server takes one vacation after another while there are no external requests in the queue. Vacations correspond to the reading of successive RUs from a disk, until an external request arrives. The first rebuild read in a sequence requires a seek, while the following rebuild read reads the successive track (or RU) with no seek [Thomasian and Menon 1994, 1997]. We have a multiple vacation model with variable durations [Takagi 1991]. While RAID5 is in rebuild mode each disk processes its own requests as well as fork-join requests on behalf of the failed disk. By completing the processing of user requests first, VSM is in effect gives external requests a higher (nonpreemptive) priority over rebuild requests. This is justified since the response time of user requests affects application performance directly. Figure 10 shows the processing of external requests and rebuild processing during idle periods on a surviving disk in a RAID5 disk array according to VSM. Rebuild processing is stopped when an external request arrives. The reading of the first RU requires a seek, while successive RUs can be read without incurring a seek. The queueing analysis of VSM with multiple vacations with ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
7:34
•
A. Thomasian and M. Blaum
Fig. 10. Key parameters associated with rebuild processing [Thomasian et al. 2007a].
variable durations [Takagi 1991] is applied this analysis to RAID5 with dedicated sparing in Thomasian and Menon [1994] and with distributed sparing in Thomasian and Menon [1997], respectively. More accurate analyses for rebuild time in RAID5 with larger RU sizes and ZBR is given in Thomasian et al. [1997]. The permanent customer model (PCM) was proposed and analyzed in Boxma and Cohen [1991]. PCM was applied to rebuild processing in RAID5 in Merchant and Yu [1996]. Each drive processes one rebuild read access at a time, that is, as soon as a rebuild request is processed it is immediately replaced by a new rebuild read at the tail of the queue of disk requests. A simulation study of rebuild processing shows that VSM outperforms PCM, not only from the viewpoint of rebuild time, but also in disk response time in processing external requests during rebuild [Fu et al. 2004b]. The reason for the latter is that PCM carries out rebuild accesses at the same priority level as external requests, while VSM initiates rebuild requests only when a drive is idle and the processing of rebuild requests is stopped when an external request becomes available. Rebuild time is lower with VSM than than PCM, since more consecutive RUs are read without incurring seeks. This can be explained by the fact that rebuild reads in PCM, which spend time in the queue of user requests, have a higher probability of being intercepted by user requests, so that there are fewer opportunities for the processing of consecutive rebuild requests. The probability that rebuild requests are intercepted by user requests arriving with rate λ with VSM and PCM rebuild are given by Fu et al. [2004b] PVSM = 1 − exp(−λx RU ), PPCM = 1 − exp(−λ(x RU + WRU )), xRU and x RU are the times to read an RU using the VSM and PCM methods, which are treated as constants set equal to their mean values: x RU and x RU to ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
Higher Reliability Redundant Disk Arrays
•
7:35
simplify the discussion. WRU is the mean waiting time in the queue for rebuild request in PCM. Since x RU + WRU > x RU , it follows that PVSM < PPCM . Rebuild processing in mirrored disks with ordered and greedy rebuild policies with PCM and VSM is investigated in Bachmat and Schindler [2002]. The greedy policy selects the track closest to the last completed user request (note similarity to free-block scheduling [Lumb et al. 2002]). It is shown via simulation and analytic results that the greedy policy provides little improvement in response time. As far as rebuild time is concerned: (i) VSM greedy outperforms PCM greedy in rebuild time, except at very high disk utilizations; (ii) the greedy policy outperforms the ordered policy for both PCM and VSM and the difference increases with the drive utilization. It should be noted that greedy policy is not applicable to RAID5, since it would result in excessive buffer requirements for rebuild. Several variations of rebuild processing are discussed in Muntz and Lui [1990]. Baseline rebuild reconstructs the contents of a failed disk on a spare disk, but also updates the spare disk when a block has already been materialized on it. Improvements to rebuild performance with respect to baseline rebuild are described in Muntz and Lui [1990]. Rebuild with read redirection sends read requests intended for the failed disk directly to materialized disk blocks on the spare disk, while baseline rebuild uses fork-join requests to reconstruct a block, which is already materialized on the spare disk. There are two positive effects: (i) the load on surviving disks is lowered, which results in improved response times; (ii) the response time for accesses to the failed disk improves, since fork-join requests tend to take more time due to the synchronization effect. Piggybacking is a method to improve rebuild time by writing data blocks reconstructed on demand on the spare disk [Muntz and Lui 1990]. The problem with this method is that rebuilding one of many blocks on a disk track does not result in a reduced rebuild time. In fact for low α = (G − 1)/(N − 1) when the spare disk is a bottleneck piggybacking has a detrimental effect on rebuild time. This effect which contradicts the analysis in Muntz and Lui [1990]. has been shown via simulation in Holland et al. [1994] and Holland [1994]. A variation of piggybacking is considered in Fu et al. [2004b], where instead of rebuilding one block, the whole track on which the block resides is rebuilt. Simulation results show that this option results in an increased response time for user requests, which is due to increased disk utilizations. The preemption of rebuild reads (and writes) can be used to improve the response time of user requests. A split-seek option which abandons a rebuild read after a seek is completed is analyzed in Thomasian and Menon [1994]. Preemptions of rebuild read requests during the latency and transfer phases are evaluated via simulation in Thomasian [1995]. The improvement in user response times should be weighed against the increase in rebuild time, that is, even more disk requests will encounter an increased response time [Thomasian 1995]. One scheme to minimize rebuild time is to provide several anchor points, which serve as starting points for this process. For example, placing anchor points at cylinder C/4 and 3C/4 will initially reduce the mean seek distance ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
7:36
•
A. Thomasian and M. Blaum
from C/3 to C/8 cylinders. Simulation studies in Holland [1994] have shown negligible improvement in performance, but an increased buffer space requirement. The anchors for rebuild may be based on current hotspots in the workload. The Popularity-based multithreaded Reconstruction Optimization (PRO) method proposed in Tian et al. [2007] adjusts the anchors based on the hotspots. This has two advantages: (i) carrying out rebuild reads in disk areas being accessed by external workloads minimizes the seek distance; (ii) rebuilding the hot spot areas results in reducing the disk load on surviving disks more quickly than it would be possible otherwise. Rebuild time can be reduced by associating empty/full tags with RUs, so that the rebuilding of RUs in a stripe can be bypassed altogether or only nonempty RUs need be read. Heterogeneous Disk Arrays (HDAs) allow Virtual Arrays (VAs) at different levels to share space on a set of physical disks [Thomasian et al. 2005]. Rebuild processing in HDA can progress one VA at a time, starting with more critical (highly accessed) VAs. The time to read the contents of an idle disk at zero load, that is, Trebuild (ρ) with ρ = 0, equals the number of disk tracks multiplied by disk rotation time plus additional delays due to track and cylinder skews, faulty sectors, etc. Rebuild time is expanded due to the processing of user requests and to a first approximation Trebuild (ρ) ≈ Trebuild (0)/(1 − ρ). The analysis in Thomasian et al. [2007a] takes into account the variation in ρ as rebuild progresses. Curve-fitting against simulation results with track size RUs yields the following empirical equation for rebuild time when all external requests are reads to small blocks: Trebuild (ρ) = Trebuild (0)/(1 − βρ) with β ≈ 1.75 [Fu et al. 2004a]. I/O workload outsourcing for boosting RAID reconstruction performance by setting up a surrogate RAID was investigated in Wu et al. [2009]. The surrogate RAID may be a RAID1 or RAID5. Hot mirroring was proposed in Mogi and Kitsuregawa [1996] to improve RAID5 performance in normal mode as well as rebuild mode. Hot data blocks are stored on mirrored disks, so that this method is similar to HP’s AutoRAID array [Wilkes et al. 1996]. RAID6 rebuild with two failed disks is expected to be rare, unless disk failures are correlated. In the rare case when a second disk fails after a fraction of the first disk has been rebuilt, it is best to complete the processing of the first disk, while concurrently rebuilding the corresponding blocks on the second disk. 4. RAID RELIABILITY EVALUATION Reliability in an important system attribute. Downtime in computer systems results in a loss of revenue, which depends on the platform being supported [Hennessy and Patterson 2006]. Our discussion in this section is restricted to HDDs, which in addition to causing a computer failures may result in data loss. If an HDD is not part of a redundancy group or backed up, the loss of data resulting from its failure is especially critical if the data is irreproducible (images from a mission to another planet) or very expensive to reproduce (rerunning a complex physics experiment). ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
Higher Reliability Redundant Disk Arrays
•
7:37
In this section, we first review studies to determine HDD reliability. We next discuss reliability models of RAID systems, concluding with a shortcut method to compare reliabilities [Thomasian 2006c]. 4.1 Failures in Hard Disk Drives Although individual disk drives are highly reliable, with an MTTF (mean time to failure) exceeding a million hours, many computer installations have hundreds and even thousands of drives. Disks are the most frequently replaced hardware component in two out of three computer installations considered in Schroeder and Gibson [2007], which is due to the large number of disks with respect to other components in a computer system. HDD reliability is a difficult characteristic to measure, because it depends on disk utilization (duty hours), temperature, altitude, operating environment, spindle start/stops, etc. The reliability reduction due to some of these effects is quantified in Anderson et al. [2003]. It is shown for example, that the reliability of ES drives with a SCSI/FC interface is higher than PS and nearline HDDs with a SATA/IDE interface [Anderson et al. 2003], that is, unrecoverable sector errors for SATA are ten times more likely than for SCSI/FC drives [Iliadis et al. 2008]. The HDD failure rate versus time is in the form of a bathtub curve [Gibson 1992; Trivedi 2002; Schroeder and Gibson 2007], which is initially high in the infant mortality period. The burn-in process is used to eliminate disks that fail in this period. The failure rate decreases as the infant mortality period comes to an end, it remains flat during the useful life period, and then increases in the wearout period. The early failure period lasts one year and the wearout period starts in years 5-7 [Schroeder and Gibson 2007]. The International Disk Drive Equipment and Materials Association (IDEMA)2 proposed a new standard for reporting HDD reliability, that is, the MTTF (mean time to failure) is specified as a step function for months 1-3. 4-6, 7-2, and 13-60. This effect is studied in Xin et al. [2005] using a Markov Chain model and simulation to estimate the data loss probability versus time. The “cohort effect” arises when a batch of new drives are introduced into a system simultaneously to replace disks, which have reached their End of Designed Life (EODL). A higher redundancy level is proposed to protect disks in their enfancy. A statistical analysis of returned HDDs in Gibson [1992] showed that the time to disk failure can be approximated by an exponential or a Weiball distribution [Trivedi 2002]. Continuous Time Markov Chain Models (CTMCs) of RAID are reported in [Gibson 1992; Trivedi 2002]. It is assumed that disk repair time, which is tantamount to rebuild time is exponentially distributed, but this is known not to be true. A semi-Markov chain model is used in [Malhotra and Trivedi 1993] and simulation with a constant repair (rebuild) time is used in Gibson and Patterson [1993] to show that the assumption that repair is exponentially distributed yields accurate results. Reliability modeling results in 2 http://www.idema.org.
ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
7:38
•
A. Thomasian and M. Blaum
Gibson [1992] are summarized in Gibson and Patterson [1993] and Chen et al. [1994]. Most data on HDD reliability in the past was based on field studies or accelerated aging experiments carried out by disk manufacturers [Pinheiro et al. 2007]. A repository of disk failure data at Google using Self-Monitoring Analysis and Reporting Technology (SMART) was used as an indicator of HDD health. The following observations are made: (i) The Annualized Failure Rate (AFR), which differs from the manufacturer Annualized Return Rate (ARR), is not affected by disk utilization, but rather by age. (ii) Older drives tend to fail more with increasing temperature, etc. Similarly to Gibson [1992], Schroeder and Gibson [2007] deals with the time to replacement, which is denoted by F (t). The following results are reported in this study: (i) the hazard rate h(t) = f (t)/(1 − F (t) with f (t) the density function, for example, h(t) = λ for the exponential distribution; (ii) the AutoCorrelation Function (ACF); and (iii) Long Range Dependencies (LRD) quantified by the Hurst coefficient. Interestingly, the replacement rates for SATA disks are not worse than SCSI or FC disks. The Computer Failure Data Repository (CFDR)3 originated with a two year study of a large number of HDDs [Schroeder and Gibson 2007]. This study shows neither an infant mortality rate, nor a constant failure rate, but instead that the failure rate varies with the make of the drive, vintage or batch, and slightly increases with time. One key observation is that while disk MTTFs are stated as one to 1.5 million hours, which corresponds to an AFR of at most 0.88%, annual disk replacement rates between 2 and 4% are common. Rather than catastrophic disk failures, Kari [1997] deals with sector errors leading to LSFs. Since LSFs can significantly reduce RAID5 reliability due to unsuccessful rebuilds, “auditing” algorithms are proposed to fix sector errors. Disk scrubbing reads successive disk sectors, computes their checksums, and compares it with the Data Integrity Segment (DIS) [Bairavasundaram et al. 2008]. In the case of an inconsistency, the data block is reconstructed using the RAID5 paradigm. The advantage of disk scrubbing is that sector errors are detected early, before they hinder the completion of the rebuild process. Four disk scrubbing methods are proposed in Kari [1997]. The original Disk Scanning Algorithm (DSA) starts a scrubbing request when the disk is idle after a certain waiting time wt has expired, so that user requests will be delayed at most by one scrubbing request. The adaptive DSA adjusts wt based on system activity; that is, wt is increased for heavier loads and decreased for lighter loads. Simplified DSA maintains two queues, one for user and one for scanning requests. After the user queue has been empty for a sufficiently long interval a scanning request is issued. Rather than starting DSA right away it can be issued periodically, for example, once a day. DSA with the VERIFY command transfers read data into internal disk buffers where the consistency of data is checked. The following deficiences are identified in earlier studies of HDD failure processes in Elerath [2007] and Elerath and Pecht [2007]: (1) Failure rates are not constant in time. (2) Failure rates change based on production vintage. 3 http://cfdr.usenix.org
ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
Higher Reliability Redundant Disk Arrays
•
7:39
(3) Failure distribution can be a mixture of multiple distributions, because of production vintages. (4) Repair rates are not constant, with a minimum time to complete. (5) Permanent errors can occur at any time. (6) Latent defects must be considered in the model. Sector errors are analyzed from several viewpoints in Bairavasundaram et al. [2007]: (i) dependence on age and disk size; (ii) the number of errors per disk for disks with at least one error, spatial and temporal locality of errors; (iii) types of requests that encounter these errors; (iv) their correlation with recovered errors and not-ready conditions. This study was carried out over 1.53 million enterprise and nearline (archival) disks at NetApp over a 32-month period. The conclusions of this study are summarized in Table I. A related study analyzes data corruption in the storage stack: checksum mismatches, identity and parity discrepancies [Bairavasundaram et al. 2008]. Cyclic redundancy checksum (CRC), result of dividing a binary string by a polynomial, is in addition to ECC [Jacob et al. 2008]. A 4 KB block is stored using eight 520 byte sectors in an enterprise drive, which leaves 64 bytes for DIS (data integrity segment).. It takes nine 512 byte sectors in nearline drives to store a 4 KB sector and a 64 byte DIS, with some space left empty. The DIS holds the checksum of the data block, identity of the data block, which can be used to ensure that the block being accessed belongs to the appropriate inode (a UNIX data structure which holds basic information about a file). HDD failures logged by NetApp’s Autosupport system are used in this study and it is observed that the probability of developing checksum mismatches is higher for nearline than enterprise HDDs. This probability varies with HDD models and age, but not HDD size and the workload. Additional publications in this area are available.4 Disk scrubbing in a Massive Array of (mainly) Idle Disks (MAID) is considered in Schwarz et al. [2004]. The obvious advantage of MAID is reduced power consumption, but each power off/on has about the same effect as running the disks all the time. Random, deterministic, and opportunistic scrubbing, that is, scrubbing disks that are powered up for access anyway, are considered in this study. Opportunistic scrubbing is motivated by the fact that mandatory powerups required for deterministic scrubbing result in a reduced disk reliability. Simulation results show that the expected number of data losses decreases sharply with at least one scrub per year, but more frequent scrubs result in a higher data loss, since the disk reliability decreases with the number of times disks are powered on. A more careful analysis of disk scrubbing than Schwarz et al. [2004] is reported in Iliadis et al. [2008]. The probability of an error due to a write is Pw and writes constitute a fraction rw of disk accesses with rate h. The probability of error in reading a sector without scrubbing is Pe = rw Pw [Iliadis et al. 2008]. The probability of sector failure for deterministic scrubbing and random (exponentially distributed) scrubbing period are Ps = [1 − (1 − e−hTs )/(hT S )] pe and Ps = [hT S /(1 + hT S )]Pe , respectively, where TS denotes the mean scrub period. It follows that deterministic scrubbing is preferable, since random scrubbing has double the value for Ps . Furthermore Ps ≤ Pe ≤ Pw . 4 http://www.cs.wisc.edu/wind/Publications/.
ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
7:40
•
A. Thomasian and M. Blaum
4.2 Reliability Analysis in RAID5 Disk Arrays Most analytic reliability studies utilize the CTMC model, which requires the time to HDD failure and RAID rebuild time to be exponentially distributed [Gibson 1992; Gibson and Patterson 1993; Malhotra and Trivedi 1993; Schwarz 1994; Kari 1997; Elerath 2007]. Advanced analytic methods and simulation can be adopted otherwise. A unified technique for reliability analysis of repairable systems with spares under Markovian assumptions, which leads to an analytic expression for the distribution of time to system failure is developed in Ng and Avizienis [1980]. The System Availability Estimator (SAVE) package, which was developed at IBM Research [Blum et al. 1994], in addition to a similar analytic solution method incorporates a rare-event simulator for analytically intractable cases [Nicola et al. 1993]. Sharpe is a reliability modeling and performance analysis package developed at Duke University [Sahner et al. 1996].5 There are also commercial packages for this purpose. The CTMC for the reliability model of a RAID5 with N + 1 disks has three states: the fault-free (normal), critical (degraded), and failed state, which are denoted by S N +1 , S N , and S N −1 . The disk failure rate is λ and the repair rate is μ. so that the Mean Time to Repair (MTTR) equals 1/μ. The analysis determines the MTTDL, which is the mean passage time from the initial fault-free state (S N +1 ) to the final failure state (S N −1 ). The transition S N +1 → S N corresponds to the failure of N + 1 disks, which has a rate (N + 1)λ. The transition rate for repair via rebuild is: S N → S N +1 is μ. The second disk failure via the transition S N → S N −1 with rate N λ leads to data loss. The analysis leads to the following reliability expression [Gibson 1992] (also see Example 3.84 in Trivedi [2002]):
1 aebt − beat 2 2 RRAID5 (t)= , a, b = − ((2N + 1)λ+ μ) ± λ + μ + 2N (2N + 1)λμ , a−b 2 The MTTDL is given as: ∞ (2N + 1)λ + μ MTTF2 MTTDL = . RRAID5 (t)d t = ≈ N (N + 1)MTTR N (N + 1)λ2 0
(13)
a is much larger than b, since λ is much smaller than μ (λ μ). Consequently R(t) can be approximated by a single exponential [Gibson 1992]: R R AI D5 (t) ≈ e−t/M T T DL .
(14)
An alternate approach to obtain the MTTDL without deriving R R AI D5 (t) first is given in Gibson [1992] and is based on obtaining the passage time from S N +1 to S N −1 . An even simpler approach to derive the MTTDL is given in Thomasian [2006c] and also Dholakia et al. [2008]. At S N there are two competing transitions: if the repair process finishes first then rebuild is successful and the system recovers with probability ps = μ/(μ + N λ) = 1/(1 + N × MTTR/MTTF), otherwise there is data loss. The distribution of the number of successful rebuilds is given by the modified geometric distribution [Trivedi 2002]: Psucc = 5 http://www.ee.duke.edu/~kst.
ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
Higher Reliability Redundant Disk Arrays
•
7:41
(1 − ps ) psk , k ≥ 0. The mean number of successful rebuilds is given as: K = ps /(1 − ps ) = MTTF/(N × MTTR). The mean time to data loss is dominated by the time spent at S N +1 , which is exponentially distributed with mean MTTF/(N + 1). Multiplied by K this yields the MTTDL given by Equation (13). Since the number of visits to this state has a geometric distribution, the total time is exponentially distribution as given by Equation (14). The RAID5 analysis has two weaknesses. (1) Number of spares. It is assumed that there are an infinite number of spare drives, since the number of successful rebuilds is not bounded. This number is limited in some environments, such as storage bricks. The CTMC can be modified to reflect the fact that there are a finite number of spares. For example, a RAID5 with N + 1 active disks and two spare disks has five states: S N +3−i , 0 ≤ i ≤ 4. A failure at S N will lead to data loss. (2) LSFs. The possibility of unsuccessful rebuild due to LSFs is not taken into account. The effect of ignoring LSFs was quantified in Blaum et al. [1995]. A disk array with N = 96 disks consisting of six RAID5 arrays with G = 16 disks each, with MTTF = 200, 000 hours, MTTR = 1 hours, has an MTTDL = MTBF2 /(N (G − 1)MTTR) = 3, 000 years [Blaum et al. 1995]. With fifteen surviving 3 GB disks with 6 million sectors per disk, there are 90 million sectors. Given that one bit out of 1013 bits is uncorrectable, the probability of reading all 90 million sectors successfully is 0.96. In other words 4% of all disk failures lead to data loss. A similar example is given in Corbett et al. [2004]. Markov chain models for RAID5 and RAID6 reliability, which take into account the effect of uncorrectable errors during rebuild are given in Kari [1997], Rao et al. [2006], and Dholakia et al. [2008]. We review the analysis of RAID5 disk arrays with N drives in [Rao et al. 2006]. The probability of uncorrectable errors during rebuild is h = (N − 1)Cdisk HER, where Cdisk is the disk drive capacity in bytes and HER is the rate of hard (or uncorrectable) errors per number of bytes read. The transition from S N → S N −1 is (1 − h)N λ, but there is a direct transition to the failed state S N → S N −2 , which is N λh. As before we have two other transitions S N −1 → S N with repair rate μ and S N −1 → S N −2 for the second disk failure. Alternatively, the original Markov chain model can be modified with two transitions from S N −1 , one leading to a successful rebuild and the other to data loss: S N −1 → S N with rate μ(1 − h) and S N −1 → S N −2 with rate μh. Analysis of RAID5 and RAID6 reliability taking into account the effect of IDR (intradisk redundancy) in Section 1 is given in Dholakia et al. [2008]. Strips or SUs are divided into segments of length , with n data sectors and m checks sectors, so that = n + m. There are S bits per sector and the probability that a bit is in error is Pbit . The probability of an uncorrectable sector error is Ps = 1 − (1 − Pbit ) S . When no coding is applied (m = 0), the probability that a segment is in error when none of the codes discussed below is in effect is: none Pseg = 1 − (1 − PS )l . There are nd = Cd /(S) segments on a disk with Cd sectors. The probability of an uncorrectable failure for a k disk failure-tolerant ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
7:42
•
A. Thomasian and M. Blaum
array in critical mode with k failed disks is: Pu(k) f = 1 − (1 − Pseg )
(N −k)Cd S
,
where the exponent denotes the number of disk segments in critical rebuild mode, that is, k = 1 for RAID5 and k = 2 for RAID6. The Single Parity (SP), the Reed Solomon (RS), and Interleaved Parity (IP) codes are analyzed in Dholakia et al. [2008]. Probability of a segment encountering an unrecoverable error is given as:
( − 2)G 2 − 1 SP Pseg = 1 + PS , B RS Pseg
≈
IP Pseg
= 1+
( − m − 1)G m+1 −
M
j =1
Gj
B
PS .
The mean burst length is B = ∞ j =1 j b j , where b j is the probability of a burst ∞ of length j and G m = b . j =m j It is interesting to note that IP delivers the same probability of error as the more complex RS method. The probability that the critical mode in RAID5 ends because of another disk failure or unrecoverable error is denoted Pfhr . The probability of a successful 1 rebuild is therefore 1− Pfhr = (1− Pfr )(1− Puf ), where 1− Pfr is the probability of 1 no disk failure during rebuild and 1− Puf is the probability that no uncorrectable segment is encountered. The following new MTTDL is derived in Dholakia et al. [2008], which takes into account uncorrectable segment errors. MTTDL =
(2N − 1)λ + μ . 1 N λ[(N − 1)λ + μPuf ]
Numerical results have shown that that RAID5 plus IDR attains the same MTTDL as RAID6. While IDR incurs longer transfers, this overhead is negligible compared to the extra disk access required by RAID6 with respect to RAID5 [Thomasian et al. 2007b]. The reliability analysis in Dholakia et al. [2008] is extended in Iliadis et al. [2008] to incorporate the effect of disk scrubbing, where it is shown numerically that for realistic parameters IDR achieves an MTTDL very close to when there are no sector failures, while disk scrubbing performs well for a very low rate of requests. 4.3 Shortcut Method for Reliability Analysis The shortcut method utilize asymptotic expansions of reliability expressions presented in Thomasian [2006c]. Let r denote the reliability of each disk and = 1 − r its unreliability. For highly reliable disks 1. The reliability of n-way replicated drives is given by: Rn = 1 − (1 − r)n = 1 − n , It follows that the reliability of RAID1 with two disks is R2 = 1 − 2 . The reliabilities of RAID5 and RAID6 disk arrays with N disks per array are given ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
Higher Reliability Redundant Disk Arrays
•
7:43
as: RRAID5 = r N + Nr N −1 (1 − r) = Nr N −1 − (N − 1)r N ≈ 1 − RRAID6 = r
N
+ Nr
N −1
1 N (N − 1) 2 . 2
N N −2 1 (1 − r) + r (1 − r)2 ≈ 1 − N (N − 1)(N − 2) 3 . 6 2
Generally, the reliability of a RAID array tolerating all n − 1 disk failures is given by: RRAID ≈ 1 − a j j ≈ 1 − an n . (15) j ≥n
Only the smallest power of n should be retained since 1. Consider two multilevel RAID systems as described in Baek et al. [2001]: (i) RAID1/5 or mirrored RAID5s; (ii) RAID5/1, which is a RAID5 with its logical disks mirrored. Showing the inequality RRAID5/1 > RRAID1/5 is not a simple task, but resorting to the asymptotic expansion method we have [Thomasian 2006c]. RRAID1/5 = 1 − [(1 − ) N − N (1 − ) N −1 ]2 ≈ 1 −
1 2 N (N − 1)2 4 . 4
1 N (N − 1) 4 . 2 It takes at least four disk failures for a data loss to occur in both systems, but 2 the number of such cases is N2 for RAID1/5 and N2 for RAID5/1. It follows that RRAID5/1 > RRAID1/5 for 1, which is the reliability range of interest. RRAID5/1 = N (1 − 2 ) N −1 − (N − 1)(1 − 2 ) N ≈ 1 −
5. RAID PERFORMANCE EVALUATION SURVEY We give a brief overview of techniques applicable to the performance evaluation of disks and disk arrays. This is followed by a survey of analytical and simulation studies of disk arrays. Tools for simulating disks and disk arrays are finally discussed. It should be noted that the building of robust performance models remains an area of current research. A recent study gives three reasons why models tend to be brittle [Thereska and Ganger 2008]: (i) models are usually build by model designers rather than system designers; (ii) the model does not reflect system bugs or that it is misconfigured; (3) models can have bugs. The IRONModel developed as part of this study attempts to identify the source of discrepancy and to evolve the model to adjust to it. An analytical or simulation model, which predicts a superior performance to that of a system, can be used to fix system bugs to improve its performance. 5.1 Techniques for RAID Performance Evaluation Performance evaluation studies of disks and disk arrays can be classified into three categories: ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
7:44
•
A. Thomasian and M. Blaum
(i) Analytic models. These models are used in preliminary evaluations of new designs, comparison of alternative designs, and parametric studies. Numerical results from high level analytic models, as well as random-numberdriven simulations discussed below, provide relative rather than absolute performance metrics. Analytical queueing network models (QNMs) have been used to evaluate the overall performance of computer systems, especially for the purpose of capacity planning [Lazowska et al. 1984]. QNMs estimate the system throughput and device utilizations more accurately than response times. QNM parameters measured by software monitors are in the form of mean service demands (or total service times) of jobs utilizing a device [Lazowska et al. 1984]. Hardware monitors can be used to measure device utilizations. There is a validation step to compare the mean response time obtained by analysis versus measurement, among other things. After a QNM is validated, that is, shown to be acceptably accurate, it can be used for performance prediction, for example, the effect of a faster processor on performance. A model may be accurate for a certain range of parameters, for example, transaction arrival rates for which the main memory, which is a passive resource, does not constitute a bottleneck. Analytic solutions require favorable modeling assumptions for the sake of mathematical tractability. For example, it can be argued that the cumulative effect of disk requests generated by transactions running at a high degree of concurrency or multiprogramming level (MPL) at a computer system follows the Poisson arrival process with respect to the disk drives in the computer system. The delay encountered at each disk can be modeled as an M/G/1 queueing system with Poisson arrivals and general service times, with a mean waiting time W = λx 2 /((2(1 − ρ)), where λ is the arrival rate of disk accesses, x i the i th moment of disk service time, and the utilization factor ρ = λx. Product-form QNMs require miscellaneous simplifying assumptions, such as FCFS scheduling and exponential service times [Lazowska et al. 1984]. They can be analyzed rather efficiently using the convolution and mean value analysis algorithms [Lavenberg 1983]. Numerous approximate techniques have been developed to solve non-product form QNMs [Lazowska et al. 1984]. (ii) Simulations. A discrete event simulator may utilize a synthetic workload obtained from a pseudo-random number generator or a trace [Traeger et al. 2008]. Traces are obtained by monitoring a computer system or its subsystems while running a user workload or a benchmark. Various aspects of simulation are dealt in Chapters 5-8 of Lavenberg [1983]. (i) generating random numbers and the transformation from uniform to nonuniform distribution; (ii) statistical analysis of simulation results (output analysis); (iii) simulator design and programming; (iv) the RESQ simulation tool for extended QNMs. HyPerformix WorkbenchT M 6 is based on Performance Analyst’s Workbench System (PAWS) tool developed 6 http://www.hyperformix.com.
ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
Higher Reliability Redundant Disk Arrays
•
7:45
at University of Texas at Austin, which preceded RESQ. General purpose simulation packages such as ARENA [Kelton et al. 2006] may be unsuitable from the viewpoint of computational efficiency, but also the difficulty of specifying intricate queueing policies. The cost of developing a discrete event simulator in a high level language, such as JAVA, can be reduced significantly by utilizing a package such as CSIM,7 since it provides many useful simulation primitives, for example, event scheduling, statistics gathering, etc. The network simulator (ns2) has primitives to facilitate network simulation.8 Workload characterization, while a tedious task, should be a fundamental aspect of any performance study [Ferrari 1984]. Several synthetic workload generation schemes for modeling disks are proposed in Ganger [1995]. (i) Simple: Requests are independent and uniformly distributed, the read/write ratio is given, and the access rate is constant. (ii) Nonuniform: The starting point for requests is generated according to the distribution of interrequest distances. (iii) Aggressive: In addition to (ii) uses a 2-state Markov chain model for read and write requests. (iv) Interleaved: Uses multiple streams. As would be expected simulation experiments show that the accuracy improves from (i) to (iv). Self-similarity [Willinger at al. 1997] of the arrival process of disk workloads is investigated in several studies [Ganger 1995; Gribble et al. 1998]. It was observed and used in Gomez and Santonja [2000] for generating a synthetic workload. The ON/OFF process is based on periods of high and low activity. There is high variability in the length of the periods, which are given by a heavy tailed distribution. P (X > x) ∼ αe−α , x → ∞, 1 < α < 2. Less active processes are better represented by an M/G/∞ model, that is, requests arrive according to a Poisson process and service time follows a heavy tailed distribution. A combination of the two is used in Gomez and Santonja [2000]. Some traces analyzed in Zabback et al. [1996] exhibited self-similar behavior, but for the sake of mathematical tractability it was assumed that the arrival process is Poisson with an average arrival rate computed over a relatively long interval. We have used I/O traces at UMASS Trace Repository,9 to show that postulating Poisson arrivals underestimates the mean response time significantly [Thomasian and Liu 2004]. It can be argued in the case of batch arrivals of size n with a low arrival rate for batches that there is no interference among requests from different batches. Although the mean arrival rate may be very small, the requests in a batch interfere with each other, for example, the i th request in each batch has to wait for the i − 1 requests ahead of it. Even trace-driven simulation is unable to evaluate the performance of single disks very accurately with respect to measurement results [Ruemmler 7 http://www.mesquite.com. 8 http://www.isi.edu/nsnam/ns. 9 http://traces.cs.umass.edu
ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
7:46
•
A. Thomasian and M. Blaum
and Wilkes 1994; Worthington et al. 1994]. This is due to the difficulty of modeling the onboard disk cache, since disk drive manufacturers may not disclose the intricacies of the cache replacement and prefetching policies [Shriver et al. 1998; Barve et al. 1999]. (iii) Benchmarks. TPC (Transaction Processing Council) benchmarks developed for database and web applications,10 which are affected by the performance of the I/O subsystem. The Storage Performance Council (SPC) has developed benchmarks for performance evaluation of disk arrays.11 These and other benchmarks are reviewed in Traeger et al. [2008]. The SPC-1 benchmark represents an OLTP, database, and mail server environment. The ratio of read to write requests is 40% versus 60% and the ratio of random versus sequential requests is 61% versus 39%. SPC-2 emphasizes sequential processing: (i) simple sequential processing of large files; (ii) scans or joins of relational tables; (iii) video-on-demand (VOD) playback. It is worthwhile to note that SPC-1 uses the random walk model in McNutt [1994], while SPC-2 uses the multi-stream model proposed in Ganger [1995]. SPC-3 focuses on the performance aspects of “storage management, hierarchical storage management, content management, and information lifecycle management”. The first benchmark in that suite deals with backup/restore.
5.2 Analytic Models of Disk Arrays Analytic models of disks and disk arrays can be roughly classified based on the distribution of their service time, while the arrival process is usually postulated to be Poisson with exponential interarrival times, which has the memoryless property, referred to as Markovian (denoted by M). A queueing system with a single server with Poisson arrivals and an exponential service time is denoted by M/M/1, while M/G/1 is used to represent general service times. M/M/1 and M/G/1 models are attractive in that they have a closed form solution [Lavenberg 1983; Takagi 1991]. There have been several early queueing analyses of disk and drum systems [Coffman and Denning 1972], which are elaborated in Coffman and Hofri [1990]. Some of these studies are based on advanced queueing analysis. Markovian models have been used successfully in several studies to investigate the performance of RAID5 disk arrays in normal, degraded, and rebuild modes [Menon 1994] This study utilizes the approximate formula in Nelson and Tantawi [1988]. for the mean response time for fork-join arising in degraded mode operation. It is inaccurate to represent disk service times with an exponential distribution, because of its memoryless property. The Markovian model also requires all requests to have the same mean service time at a node with FCFS scheduling. More accurate performance evaluation of RAID5 is possible with an M/G/1 model. Performance of RAID5 with parity striping [Gray et al. 1990] is 10 http://www.tpc.org 11 http://www.storageperformance.org
ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
Higher Reliability Redundant Disk Arrays
•
7:47
compared with a striped RAID5 using the M/G/1 model in Chen and Towsley [1993]. The trade-off is between unbalanced load (with parity striping) and the load increase caused by single logical requests spanning multiple disks and hence processed as multiple physical requests. The choice of a small SU size, which requires multiple disk accesses to satisfy a single logical request is incompatible with the rule of thumb that the SU size should be larger than prevalent logical accesses. A comparison of the performance of RAID5 with different RAID1 organizations is given in Chen and Towsley [1996] and Thomasian and Xu [2008]. The latter also considers operation in degraded mode. An M/G/1 queueing model is used in Thomasian and Menon [1994, 1997] for the analysis of RAID5 disk arrays in normal, degraded, and rebuild modes with dedicated (resp. distributed) sparing. The analysis in Thomasian and Menon [1994] is extended in Thomasian et al. [2007a] to take into account disk zoning and to obtain a more accurate estimate of the mean residual service time of rebuild requests. The analysis of PCM (permanent customer model) for rebuild, which is given in Merchant and Yu [1996], is based on Boxma and Cohen [1991] (see Section 3.7). The analysis of RAID5 disk arrays is extended to RAID6 in Thomasian et al. [2004, 2007b]. Disk utilization for clustered RAID5 and RAID6 in normal and degraded mode are obtained in Thomasian [2005b]. This work can be easily extended to obtaining disk response times with an M/G/1 model. An analytic throughput model has been reported in Uysal et al. [2001], which includes a cache model, a controller model, and a disk model. The model is validated against a disk array, showing a 15% prediction error on the average. Mean Value Analysis (MVA) is applied to evaluate the performance of a mirrored disk system with the SCAN policy, which is subjected to a stream of requests from finite sources [Varki et al. 2004]. Measurement results show the analysis to be acceptably accurate. The analysis of the SCAN policy in this paper is improved in Thomasian and Liu [2005]. 5.3 Simulation Studies of Disk Arrays Analysis or a random number driven simulation is useful in preliminary studies of disks and disk arrays. More conclusive results can be obtained by using trace-driven simulations with multiple representative traces [Worthington et al. 1994; Thomasian and Liu 2004]. Random number driven simulation is used in evaluating the performance of parity sparing in Chandy and Narasimha Reddy [1993] and several disk arrays in Alvarez et al. [1997]. Simulation was also used [Thomasian and Menon 1994; Thomasian and Menon 1997; Thomasian et al. 2004; Thomasian et al. 2007b] to validate the corresponding analytic results. The performance of rebuild processing with VSM and PCM was compared in Fu et al. [2004a]. Simulation studies of various options in rebuild processing and clustered RAID are given in Holland et al. [1994 and Fu et al. 2004b]. The effect of preempting rebuild requests after the completion of a seek or even in the latency and transfer phase was studied via simulation in Thomasian [1995]. ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
7:48
•
A. Thomasian and M. Blaum
It is difficult to evaluate the performance of disk arrays via trace-driven simulation, because the trace addresses should cover a large logical disk space. A simulation study investigating the performance of a heterogeneous RAID1 system, MEMS-based storage mirrored by a magnetic disk, is reported in Uysal et al. [2003]. Specialized simulation and analytic tools have been developed to evaluate the performance of MEMS-based storage [Griffin et al. 2000]. A simulation study of RAID5 and RAID6 with and without the IDR (intradisk redundancy) method is reported in Dholakia et al. [2008]. An interesting aspect of this study is the comparison of the MTTDL in various systems. The MTTDL attained with disk scrubbing is compared with IDR in Iliadis et al. [2008], where it is shown that IDR outperforms disk scrubbing for an interesting range of parameters. A performance study of disk array caches was reported in Smith [1985]. A comprehensive trace-driven simulation study of a cached RAID5 design was presented in Treiber and Menon [1995], while in Varma and Jacobson [1998] this technique was used to investigate the performance of a destaging policy. A hybrid method was used in Menon [1994] and Thomasian and Menon [1997], where an I/O trace was first analyzed to determine the locality of blocks from the destaging viewpoint. This information was then used to calibrate the analytical model, for example, the number of blocks destaged on a track and their distance. 5.4 Tools for Simulating Disk Arrays We first report on several tools for modeling disk drives. A detailed simulation study to evaluate the performance of the HP 97560 disk drive was reported in Ruemmler and Wilkes [1994]. The probability distribution of I/O time via simulation and measurement is plotted for four different disk drive models. The demerit figure is defined as the root mean square of the distance between the two curves. The four models considered and their demerits (given in parentheses) are as follows: (i) setting I/O time to the mean value (35%). (ii) transfer time proportional to I/O size, seek time linear in distance, uniform rotational latency (15%). (iii) measured seek time profile and head switch time (6.2%). (iv) rotational position modeling and detailed disk layout (2.6%). A less detailed simulator for the same disk drive reported a respectable demerit figure of 3.9% in Kotz et al. [1999], The Pantheon simulator developed at HP Labs [Wilkes 1996] was used in the evaluation of AutoRAID [Wilkes et al. 1996.] RAIDframe is a simulation, as well as a rapid prototyping tool, for RAID disk arrays [Courtright et al. 1996]. It has been superseded by the DiskSim simulation package [Bucy et al. 2008]. This package is an extension of the concept of using system-level models to evaluate I/O subsystem designs [Ganger and Patt 1998]. The Disk Array Simulator (DASim) [Thomasian et al. 2004] was developed to simulate clustered and unclustered RAID5 and RAID6 in various operating modes, including rebuild. A RAID configuration tool (RAIDtool) is described in Zabback et al. [1996]. A three step methodology is followed: (i) collect I/O traces; (ii) analyze the traces to characterize the workload; (iii) run simulator with this workload to optimize ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
Higher Reliability Redundant Disk Arrays
•
7:49
parameters, such as SU size. Two reasons are given for not using a trace-driven simulation are: (i) it is slower than random number driven simulation and hence unsuitable for interactive use; (ii) it is difficult to vary parameters, e.g., increase the arrival rate of requests. 6. CONCLUSIONS Reliability and performance of storage systems constitutes an important aspect of overall computer system reliability and performance. Decreasing disk costs and increasing disk capacities makes disk arrays providing very large disk spaces affordable. The very large number of disks in computer installations for web search and e-commerce makes disk backup time consuming. This is an incentives to provide highly reliable storage systems, which require minimal manual intervention. RAID6 disk arrays which were introduced a dozen years ago are gaining popularity because of their higher reliability over RAID5. Multilevel RAID arrays are a more recent development which can cope with disk drive and array controller failures. Rather than concentrating solely on RAID6 disk arrays, we start the discussion with RAID5 disk arrays, since most techniques used in RAID5 are also applicable to RAID6. In fact RAID6 was introduced as a means to cope with correlated disk failures and LSFs (latent sector failures) resulting in unsuccessful rebuild in RAID5. Methods developed for dealing with the small write penalty in RAID5, are applicable to RAID6. Parity declustering, the reconstruct write method, and the LSA (log-structured array) paradigm are applicable to both RAID levels. We describe alternative methods for rebuild processing in RAID5, which is a critical stage in RAID5 operation, due to types of additional requests that need to be processed: (i) extra processing to deal with missing data blocks, (ii) rebuild requests. There is a significant increase in disk load, for example, disk load for read requests is doubled. The clustered RAID paradigm can be used to maintain the load increase at an acceptably low level, at the cost of an increased level of redundancy. The VSM (vacationing server model) processes rebuild requests at a lower priority than external requests and does not affect disk utilization, but adversely affects response times for external requests. There has been recent renewed interest in quantifying disk failures, including LSFs (latent sector failures). Disk scrubbing and IDR (intra-disk redundancy) are two methods to increase RAID reliability. Parametric studies with realistic parameters have shown that IDR outperforms scrubbing, which is intuitively appealing since IDR is enabled all the time. In addition, improved versions of IDR combined with RAID5 result in a MTTDL approaching RAID6, while obviating the need for updating two parities. In [Thomasian et al. 2004; Thomasian et al 2007b] a cost model for RAID is reported, which can be used to estimate the maximum throughput for given disk characteristics. The disk response time can be estimated analytically under favorable modeling assumptions, such as Poisson arrivals. Estimating the cache miss rate, which determines the rate of read requests and the effect of NVS (non-volatile storage) on destage processing can best be handled by trace-driven simulation. ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
7:50
•
A. Thomasian and M. Blaum
The RS code which is the standard technique to attain error correction for two disk failures in RAID6, requires specialized hardware. Parity based techniques to protect against two disk failures are described in [Blaum 1987; Blaum et al. 1995; Blaum and Roth 1993; Blaum and Roth 1999; Goodman and Sayano 1990; Goodman and Sayano 1993; Corbett et al 2004]. The RM2 data layout protects each data block with two parity blocks, but it does not always attain the minimum degree of redundancy. The X-code using two parities also incurs the minimum level of redundancy, but is restricted to a prime number of disks. Some of the coding techniques described in this paper can be extended to disk arrays tolerating more than two disk failures [Blaum et al. 2002; Blaum and Roth 1993]. APPENDIX
Acronyms and Abbreviations ACF - AutoCorrelation Function AFR - Annualized Failure Rate ARR - Annualized Return Rate BIBD - Balanced Incomplete Block Designs BM - Basic Mirroring CFDR - Computer Failure Data Repository DAC - Disk Array Controller DBMS - Database Management System DRAM - Dynamic Random Access Memory DSA - Disk Scanning Algorithm EODL - End of Designed Life FCFS - First-Come, First-Served HDA -Heterogeneous Disk Array HDD - Hard Disk Drive IDR - Intra-Disk Redundancy IP - Interleaved Parity (in conjunction with IDR) kDFT - k Disk Failure Tolerant LBA - Logical Block Address LCM - Least Common Multiple LFS - Log-Structured File System LSA - Log-Structured Array MAID - Massive Array of (mainly) Idle Disks MEMS - Micro-ElectroMechanical System MDS - Maximum Distance Seperable MTTDL - Mean Time to Data Loss MTTF - Mean Time To Failure MTTR - Mean Time To Repair LRD - Long Range Dependency NRP - Nearly Random Permutation ns2 - network simulator 2 NVRAM - Non-Volatile Random Access Memory NVS - Non-Volatile Storage ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
Higher Reliability Redundant Disk Arrays
•
7:51
OLTP - OnLine Transaction Processing PCM - Permanent Customer Model PDDL - Permutation Development Data Layout RAID - Redundant Array of Independent/Inexpensive Disks RCW - Reconstruct Write RDP - Row Diagonal Parity RM2 - Redundancy Matrix RMW - Read-Modify-Write RS - Reed-Solomon RU - Rebuild Unit SATA- Serial Advanced Technology Attachment SATF - Shortest Access Time First SCSI - Small Computer System Interface SMART - Self-Monitoring Analysis and Reporting Technology SP - Single Parity (in conjunction with IDR) SPC - Storage Performance Council SU - Stripe Unit (or strip) UPS - Uninterruptible Power Supply VA - Virtual Array VSM - Vacationing Server Model XOR - Exclusive-OR ZBR - Zoned Bit Recording ZLR - Zero Latency Read REFERENCES ALVAREZ, G. A., BURKHARD, W. A., AND CRISTIAN, F. 1997. Tolerating multiple failures in RAID architectures with optimal storage and uniform declustering. In Proceedings of the 24th Annual International Symposium on Computer Architecture (ISCA’97). 62–72. Alvarez, G. A., Burkhard, W. A., Stockmeyer, L. J., AND Cristian, F. 1998. Declustered disk array architectures with optimal and near optimal parallelism. In Proceedings of the 25th International Symposium on Computer Architecture (ISCA’98). 109–120. ALVAREZ, G. A., BOROWSKY, E., GO, S., ROMER, T. H., BECKER-SZENDY, R., GOLDING, R., MERCHANT, A., SPASOJEVIC, M., VEITCH, A., AND WILKES, J. 2001. Minerva: An automated resource provisioning tool for large-scale storage systems. ACM Trans. Comput. Syst. 19, 4, 483–518. ANDERSON, D., DYKES, J., AND RIEDEL, E. 2003. More than an interface—SCSI vs ATA. In Proceedings of the 2nd USENIX Conference on File and Storage Technologies (FAST’03). 245–257. ANDERSON, E., KALLHALLA, M., SPENCE, S., SWAMINATHAN, R., AND WANG, Q. 2005. Quickly finding near-optimal storage system designs. ACM Trans. Comput. Syst. 23, 4, 337–374. BACHMAT, E. AND SCHINDLER, J. 2002. Analysis of methods for scheduling low priority disk drive tasks. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems. 55–65. BAEK, S. H., KIM, B. W., JEUNG, E., AND PARK, C. W. 2001. Reliability and performance of hierarchical RAID with multiple controllers. In Proceedings of the 20th Annual ACM Symposium on Principles of Distributed Computing (PODC’01). 246–254. BAIRAVASUNDARAM, L. N., GOODSON, G. R., PASUPATHY, S., AND SCHINDLER, J. 2007. An analysis of latent sector errors in disk drives. In Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. 289–300. BAIRAVASUNDARAM, L. N., GOODSON, G. R., SCHROEDER, B., ARPACI-DUSSEAU, A. C., AND ARPACI-DUSSEAU, R. H. 2008. An analysis of data corruption in the storage stack. In Proceedings of the 6th USENIX Symposium on File and Storage Technologies (FAST’08). 223–238. ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
7:52
•
A. Thomasian and M. Blaum
BALAFOUTIS. E., PANAGAKIS, A., TRIANTAFILLIOU, P. NERJES, G., MUTH, P., AND WEIKUM, G. 2003. Clustered scheduling algorithms for mixed media disk workloads in a multimedia server. Cluster Comput. 6, 1, 75–86. BARVE, R., SHRIVER, E. A. M., GIBBONS, P., HILLYER, B. K., MATIAS, Y., AND VITTER, J. S. 1998. Modeling and optimizing I/O throughput of multiple disks on a bus. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems. 83–92. BAYLOR, S., CORBETT, P. F., AND PARK, C. 1999. Efficient method for providing fault tolerance against double device failures in multiple device systems. US Patent 5,862,158. BLAUM, M. 1987. A class of byte-correcting array codes. Res. Rep. RJ 5652 (57151). IBM Almaden Research Center, San Jose, CA. BLAUM, M. AND ROTH, R. M. 1993. New array codes for multiple phased burst correction. IEEE Trans. Inform. Theory 39, 1, 66–77. BLAUM, M. AND OUCHI, K. 1994. Method and means for B-adjacent coding and rebuilding data from up to two unavailable DASDs in a DASD array. U.S. Patent 5,333,143. BLAUM, M., BRADY, J., BRUCK, J., AND MENON, J. 1995. EVENODD: An efficient scheme for tolerating double disk failures in RAID architectures. IEEE Trans. Comput. 44, 2, 192–202. BLAUM, M., BRUCK, J., AND VARDY, A. 1996. MDS array codes with independent parity symbols. IEEE Trans. Inform. Theory 42, 2, 529–542. BLAUM, M., FARRELL, P. G., AND VAN TILBORG, H. C. A. 1998. Array codes. In Handbook of Coding Theory, V. S. Pless and W. C. Huffman, Eds., Elsevier Science, Amsterdam, The Netherlands, Chapter 22, 1855–1909. BLAUM, M. AND ROTH, R. M. 1999. On lowest-density MDS codes. IEEE Trans. Inform. Theory 45, 1, 46–59. BLAUM, M., BRADY, J., BRUCK, J., MENON, J., AND VARDY, A. 2002. The EVENODD Code and Its Generalizations. In High Performance Mass Storage and Parallel I/O: Technologies and Applications, H. Jin. T. Cortes, and R. Buyya, Eds., Wiley, New York, NY, 187–205. BLAUM, M. 2005. An Introduction to Error-Correcting Codes. In Coding and Signal Processing for Magnetic Recording Systems. B. Vasic and E. M. Kurtas, Eds., Chapter 9, CRC Press, Orlando, FL. BLAUM, M. 2006. A family of MDS array codes with a minimal number of encoding operations. In Proceedings of the International Symposium on Information Theory (ISIT’06). 2784–2788. BLUM, A., GOYAL, A., HEIDELBERGER, P., LAVENBERG, S. S., NAKAYAMA, M., AND SHAHABUDDIN, P. 1994. Modeling and analysis of system dependability using the system availability estimator. In Proceedings of the 24th IEEE Annual International Symposium on Fault-Tolerant Computing Systems (FTCS-24), 137–141. BOROWSKY, E., GOLDING, R., MERCHANT, A., SHRIVER, E., SPASOJEVIC, M., AND WILKES, J. 1997. Using attribute-managed storage to achieve QoS. In Proceedings of the 5th International Conference on Workshop on Quality of Service. 203–206. BOXMA, O. J. AND COHEN, J. W. 1991. The M/G/1 queue with permanent customers. IEEE J. Select. Areas Comm. 9, 2, 179–184. BUCY, J. S., SCHINDLER, J., SCHLOSSER, S. W., GANGER, G. R., AND CONTRIBUTORS. 2008. The DiskSim simulation environment version 4.0 reference manual. Tech. rep. CMU-PDL-08-101. BUTTERWORTH, H. E. 1999. The design of segment filling and selection algorithms for efficient free-space collection in a log-structured array. IBM Hursley, UK, unpublished manuscript. CARLEY, L. R., GANGER, G. R., AND NAGLE, D. F. 2000. MEMS-based integrated-circuit mass-storage systems. Comm. ACM 43, 11, 72–80. CHANDY, J. AND NARASIMHA REDDY, A. L. 1993. Failure evaluation of disk array organizations. In Proceedings of the 13th International Conference on Distributed Computing Systems (ICDCS’93), 319–326. CHEN, P. M., LEE, E. K., GIBSON, G. A., KATZ, R. H., AND PATTERSON, D. A. 1994. RAID: Highperformance, reliable secondary storage. ACM Comput. Surv. 26, 2, 145–185. CHEN, S.-Z. AND TOWSLEY, D. F. 1993. The design and evaluation of RAID 5 and parity striping disk array architectures. J. Parall. Distrib. Comput. 10, 1/2, 41–57. CHEN, S.-Z. AND TOWSLEY, D. F. 1996. A performance evaluation of RAID architectures. IEEE Trans. Comput. 45, 10, 1116–1130. COFFMAN, JR. E. G. AND DENNING, P. 1972. Operating Systems Principles. Prentice-Hall, 1972. ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
Higher Reliability Redundant Disk Arrays
•
7:53
COFFMAN, JR. E. G. AND HOFRI, M. 1990. Queueing models of secondary storage devices. In Stochastic Analysis of Computer and Communication Systems. H. Takagi, Ed., Elsevier Science, Amsterdam, The Netherlands, 549–588. CORBETT, P. F., ENGLISH, B., GOEL, A., GRCANAC, T., KLEIMAN, S., LEONG, J., AND SANKAR, S. 2004. Rowdiagonal parity for double disk failure correction. In Proceedings of the 3rd USENIX Conference on File and Storage Technologies (FAST’04). COURTRIGHT II, W. V., HOLLAND, M. C., GIBSON, G. A., REILLY, L. N., AND ZELENKA, J. 1996. RAIDframe: A rapid prototyping tool for RAID systems. Parallel Data Laboratory, CMU. http://www.pdl.cmu.edu/RAIDframe. DENNING, P. J. 1967. Effects of scheduling in file memory operations. In Proceedings of the AFIPS Spring Joint Computer Conference on (SJCC). DHOLAKIA, A., ELEFTHERIOU, E., HOU, X.-Y., ILIADIS, I., MENON, J., AND RAO, K. K. 2008. Analysis of a new intra-disk redundancy scheme for high reliability RAID storage systems in the presence of unrecoverable errors. ACM Trans. Storage 4, 1. DURSTENFELD, R. 1964. Algorithm 235: Random permutation. Comm. ACM 7, 7, 420. ELERATH, J. H. 2007. Reliability model and assessment of RAID incorporating latent defects and non-homogeneous Poisson process events. Tech. rep. Mechanical Engineering Department, University of Maryland. ELERATH, J. G. AND PECHT, M. 2007. Enhanced reliability modeling of RAID storage systems. Proceedings of the 37th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’07), 175–184. FENG, G.-L., DENG, R. H., BAO, F., AND SHEN, J.-C. 2005a. New efficient MDS array codes for RAID. Part I: Reed-Solomon-like codes for tolerating three disk failures. IEEE Trans. Comput. 54, 9, 1071–1080. FENG, G.-L., DENG, R. H., BAO, F., AND SHEN J.-C. 2005b. New efficient MDS array codes for RAID. Part II: Rabin-like codes for tolerating multiple (greater than or equal to 4) disk failures. IEEE Trans. Comput. 54, 12, 1473–1483. FERRARI, D. 1984. On the foundations of artificial workload design. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems. FLEINER, C., GARNER, R. B., HAFNER, J. L., RAO, K. K., HOSEKOTE, D. R. K., WILCKE, W., AND GOLDER, J. S. 2006. Reliability of modular mesh-connected intelligent storage brick systems. IBM J. Res. and Devel. 50, 2/3, 199-208. FRANASZEK, P. A., ROBINSON, J. T., AND THOMASIAN, A. 1996. RAID level 5 with free blocks/parity cache. US Patent 5,522,032. FRANASZEK, P. A. AND ROBINSON, J. T. 1997. On variable scope of parity protection in disk arrays. IEEE Trans. Comput. 46, 2, 234–240. FREITAS, R. F. AND WILCKE, W. W. 2008. The next storage system technology. IBM J. Res. Devel. 52, 4–5, 439–448. FRIEDMAN, M B. 1995. The performance and tuning of a StorageTek Iceberg RAID6 disk subsystem. Trans. Comput. Measure. Group. 77–88. FU, G., THOMASIAN, A., HAN, C., AND NG, S. W. 2004. Rebuild strategies for redundant disk arrays. In Proceedings of the 12th NASA Goddard, 21st IEEE Conference on Mass Storage and Technologies (MSST’04). FU, G., THOMASIAN, A., HAN, C., AND NG, S. W. 2004b. Rebuild strategies for clustered RAID. In Proceedings of the International Symposium on Performance Evaluation Computer and Telecommunication Systems (SPECTS’04). 598–607. FUJA, T., HEEGARD, C., AND BLAUM, M. 1989. Cross parity check convolutional codes. IEEE Trans. Inform. Theory 35, 6, 1264–1276. FUJITA, H. 2006. Modified low-density MDS array codes. In Proceedings of the International Symposium on Information Theory (ISIT’06). 2789–2793. GANGER. G. 1995. Generating synthetic workloads. In Proceedings of the 21st Computer Measurement Group, 1263–1269. GANGER, G. R. AND PATT, Y. N. 1998. Using system-level models to evaluate I/O subsystem designs. IEEE Trans. Comput. 47, 6, 667–678. GIBSON, G. A. 1992. Redundant Disk Arrays: Reliable, Parallel Secondary Storage. MIT Press, Cambridge, MA. ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
7:54
•
A. Thomasian and M. Blaum
GIBSON, G. A. AND PATTERSON, D. A. 1993. Designing disk arrays for high reliability. J. Parall. Distrib. Comput. 17, 1–2, 4–27. GOLDING, R., SHRIVER, E., SULLIVAN, T., AND WILKES, J. 1995. Attribute-managed storage. In Proceedings of the Workshop on Modeling and Specification of I/O. GOMEZ, M. E. AND SANTONJA, V. 2000. A new approach in the modeling and generation of synthetic disk workload. In Proceedings of the 8th Annual Meeting of the IEEE Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS’00). GOODMAN, R. AND SAYANO, M. 1990. Size limits on phased burst error correcting array codes. Electron. Lett. 26, 55–56. GOODMAN, R., MCELIECE, R. J., AND SAYANO, M. 1993. Phased burst error correcting codes. IEEE Trans. Inform. Theory 39, 2, 684–693. GRAY, J, HORST, B., AND WALKER, M. 1990. Parity striping of disk arrays: Low-cost reliable storage with acceptable throughput. In Proceedings of the 16th International Conference on Very Large Data Bases. 148–159. GRAY, J. AND SHENOY, P. J. 2000. Rules of thumb in data engineering. In Proceedings of the 16th Annual IEEE International Conference on Data Engineering (ICDE’00). 3–12. GRAY, J. 2002. Storage bricks have arrived (Keynote Speech), First USENIX Conference on File and Storage Technologies (FAST’02), 56–65. GRIBBLE, S. D., MANKU, G. S., ROSELLI, D. S., BREWER, E. A., GIBSON, T. J., AND MILLER, E. L. 1998. Self-similarity in file systems. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems. 141–150. GRIFFIN, J. L., SCHLOSSER, S. W., GANGER, G. R., AND NAGLE, D. 2000. Modeling and performance of MEMS-based storage devices. In Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. HAFNER, J. L. 2005. WEAVER codes: Highly fault tolerant erasure codes for storage systems. In Proceedings of the 4th USENIX Conference on File and Storage Technologies (FAST’05). HAFNER, J. L., DEENADHAYALAN, V. W., RAO, K. K., AND TOMLIN, J. A. 2005. Matrix methods for lost data reconstruction in erasure codes. In Proceedings of the 4th USENIX Conference on File and Storage Technologies (FAST’05). HAFNER, J. L. 2006. HoVer erasure codes for disk arrays. In Proceedings of the International Conference on Dependable Systems and Networks (DSN’06). 217–226. HAFNER, J. L., DEENADHAYALAN, V., BELLUOMINI, W. AND K. RAO. 2008. Undetected disk errors in RAID arrays. IBM J. Res. Develop. 52, 4/5. HALL, M. 1986. Combinatorial Theory, Second Edition, Wiley-Interscience, New York, NY. HELLERSTEIN, L., GIBSON, G. A., KARP, R. M., AND KATZ, R. H. 1994. Coding techniques for handling failures in large disk arrays. Algorithmica 12, 2/3, 182–208. HENNESSY, J. L. AND PATTERSON, D. A. 2006. Computer Architecture: A Quantitative Approach: 4th Ed. Morgan-Kaufman Publishers, San Mateo, CA. HILL, E. A. 1994. System for managing data storage based on vector-summed size-frequency vectors for data sets, devices, and residual storage on devices, U.S. Patent 5345584. HITZ, D., LAU, J., AND MALCOLM, M. 1994. File system design for an NFS file server appliance. In Proceedings of the USENIX Conference, 235–246. HOLLAND, M. C., GIBSON, G. A. AND SIEWIOREK, D. P. 1994. Architectures and algorithms for on-line failure recovery in redundant disk arrays. J. Distrib. Parall. Datab. 11, 3 295–335. HOLLAND, M. C. 1994. On-line data reconstruction in redundant disk arrays. Ph.D. Thesis, Department of Electrical and Computer Engineering, Carnegie-Mellon University, Pittsburgh, PA. HOU, R. Y., MENON, J., AND PATT, Y. N. 1993. Balancing I/O response time and disk rebuild time in a RAID5 disk array. In Proceedings of the 26th Hawaii International Conference on System Sciences (HICSS 26), Vol. I, 70–79. HSU, W. W. AND SMITH, A. J. 2003. Characteristics of I/O traffic in personal computer and server workloads. IBM Syst. J. 42, 2, 347–372. HSU, W. W. AND SMITH, A. J. 2004. The performance impact of I/O optimizations and disk improvements. IBM J. Res. Develop. 48, 2, 255-269. HSU, W. W., SMITH, A. J., AND YOUNG, H. C. 2005. The automatic improvement of locality in storage systems. ACM Trans. Comput. Syst. 23, 4, 424–473. ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
Higher Reliability Redundant Disk Arrays
•
7:55
HUANG, C. AND XU. L. 2008. STAR: An efficient coding scheme for correcting triple storage node failures. IEEE Trans. Comput. 57, 7, 899–901. ILIADIS, I., HAAS, R. HU, X.-Y., AND ELEFTHERIOU, E. 2008. Disk scrubbing versus intra-disk redundancy for high-reliability RAID storage systems. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems. 241–252. JACOB, B., NG, S. W., AND WANG, D. T. 2008. Memory Systems: Cache, DRAM, and Disk. Morgan Kaufmann Publishers. JI, M., VEITCH, A. C., WILKES, J. 2003. Seneca: Remote mirroring done write. In Proceedings of the USENIX Annual Technical Conference. 253–268. KARI, H. H. 1997. Latent sector faults and reliability of disk arrays. Doctor of Technology Thesis, University of Technology, Espoo, Finland. http://www.tcs.hut.fi/~hhk/. KELTON, W. D., SADOWKSI, R. P., AND STURROK, D. E. 2006. Simulation with Arena, 4th Ed., McGrawHill, New York, NY. KENYON, C. 1996. Best-fit bin-packing with random order. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA). 359–364. KLEINROCK, L. 1975. Queueing Systems, Vol. I: Theory. Wiley, New York, NY. KOTZ, D. F., ROH, S. B., AND RADHAKRISHNAN, S. 1999. A detailed simulation model of the HP 97560 disk drive. Dartmouth College, Hanover, NH. http://www.cs.dartmouth.edu/~dfk/diskmodel. LAVENBERG, S. S. 1983. Computer Performance Modeling Handbook. Academic Press, New York, NY. LAZOWSKA, E. D., ZAHORJAN, J., GRAHAM. G. S., AND SEVCIK, K. C. 1984. Quantitative Systems Performance: Computer System Analysis Using Queueing Network Models, Prentice-Hall, Upper Saddle River, NJ. http://www.cs.washington.edu/homes/lazowska/qsp/. LEE, E. K. AND KATZ, R. H. 1993. The performance of parity placements in disk arrays. IEEE Trans. Comput. 42, 6, 651–664. LU, C., ALVAREZ, G. A., AND WILKES, J. 2002. Aqueduct: Online data migration with performance guarantees. In Proceedings of the 1st USENIX Conference on File and Storage Technologies (FAST’02). 219–230. LUMB, C. R., SCHINDLER, J., AND GANGER, G. R. 2002. Freeblock scheduling outside of disk firmware. In Proceedings of the 1st USENIX Conference on File and Storage Technologies (FAST’02). 275– 288. LUMB, C. R., MERCHANT, A., AND ALVAREZ, G. A. 2003. Facade: Virtual storage device with performance guarantees. In Proceedings of the 2nd USENIX Conference on File and Storage Technologies (FAST’03). MACWILLIAMS, F. J. AND SLOANE, N. J. A. 1977. The Theory of Error-Correcting Codes. North Holland, Amsterdam, The Netherlands. MALHOTRA, M. AND TRIVEDI, K. S. 1993. Reliability analysis of redundant arrays of inexpensive disks. J. Paral. Distrib. Comput. 17, 1/2, 146–151. MATHEWS, J., TRIKA, S., HENSGEN, D., COULSON, R. AND GRIMSRUD, K. 2008. Intel Turbo Memory: Nonvolatile disk caches in the storage hierarchy of mainstream computer systems. ACM Trans. Storage 4, 2. MCKUSICK, M. K., JOY, W. N., LEFFLER, S. J., AND FABRY, R. S. 1984. A fast file system for UNIX. ACM Trans. Comput. Syst. 2, 3, 181–197. MCNUTT, B. 1994. Background data movement in a log-structured disk subsystem. IBM J. Res. Develop. 38, 1, 47–58. MCNUTT, B. 2000. The Fractal Structure of Data Reference: Applications to the Memory Hierarchy. Kluwer Academic Publishers, Norwell, MA. MENON, J. AND MATTSON, D. 1992. Distributed sparing in disk arrays. In Proceedings of the 37th Annual IEEE Computer Society Conference (COMPCON’92). 410–421. MENON, J., ROCHE, J., AND KASSON, J. 1993. Floating parity and data disk arrays. J. Parall. Distrib. Comput. 17, 1/2, 129–139. MENON, J. AND CORTNEY, J. 1993. The architecture of a fault-tolerant cached RAID controller. In Proceedings of the 20th Annual International Symposium on Computer Architecture (ISCA’93). 76–86. MENON, J. 1994. Performance of RAID5 disk arrays with read and write caching. J. Distrib. Parall. Datab. 11, 3, 261–293. ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
7:56
•
A. Thomasian and M. Blaum
MENON, J. 1995. A performance comparison of RAID5 and log-structured arrays. In Proceedings of the 4th IEEE International Symposium on High Performance Distributed Computing (HPDC’95). 167–178. MENON, J. AND STOCKMEYER, L. 1998. An age threshold algorithm for garbage collection in logstructured arrays and file systems. IBM Research Report RJ 10120, Almaden Research Center. 119–132. MERCHANT, A. AND YU, P. S. 1996. Analytic modeling of clustered RAID with mapping based on nearly random permutation. IEEE Trans. Comput. 45, 3, 367–373. MOGI, K. AND KITSUREGAWA, M. 1996. Hot mirroring: A study to hide parity upgrade penalty and degradations during rebuilds for RAID5. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 183–194. MUNTZ, R. R. AND LUI, J. C. S. 1990. Performance analysis of disk arrays under failure. In Proceedings of the 16th International Conference on Very Large Data Bases (VLDB). 162–173. NELSON, R. AND TANTAWI, A. 1988. Approximate analysis of fork-join synchronization in parallel queues. IEEE Trans. Comput. 37, 6, 736–743. NEWBERG, L. AND WOLF, D. 1994. String layouts for a redundant array of inexpensive disks. Algorithmica 12, 2/3, 209–224. NG, S. W. 1994a. Crosshatch disk array for improved reliability and performance. In Proceedings of the 21st Annual International Symposium on Computer Architecture (ISCA’94). 255–264. NG, S. W. 1994b. Sparing for redundant disk arrays. Distrib. Paral. Datab. 2, 2, 133–149. NG, S. W. AND MATTSON, R. L. 1994. Uniform parity distribution in disk arrays with multiple failures. IEEE Trans. Comput. 43, 4, 501–506. NG, S. W. 1998. Advances in disk technology: Performance issues. IEEE Comput. 40, 1, 75–81. NG, Y. W. AND AVIZIENIS, A. 1980. A unified reliability model for fault-tolerant computers. IEEE Trans. Comput. 29, 1, 1002–1011. NICOLA, V. F., SHAHABUDDIN, P., HEIDELBERGER, P., AND GLYNN, P. W. 1993. Fast simulation of steadystate availability in non-Markovian highly dependable systems. In Proceedings of the 23rd Annual International Symposium on Fault Tolerant Computing (FTCS-23). 38–47. PARK, C.-I. 1995. Efficient placement of parity and data to tolerate two disk failures in disk array systems. IEEE Trans. Parall. Distrib. Syst. 6, 11, 1177–1184. PATEL, A. M. 1985. Adaptive cross parity code for a high density magnetic tape subsystem. IBM J. Resear. Develop. 29, 5, 546–562. PATTERSON, D. A., GIBSON, G. A., AND KATZ, R. 1988. A case for Redundant Arrays of Inexpensive Disks (RAID). In Proceedings of the ACM SIGMOD International Conference on Management of Data. 109–116. PINHEIRO, E., WEBER, W. D., AND BARROSO, L. A. 2007. Failure trend in a large disk drive population. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’07). PLANK, J. S. 1997. A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems. Softw. Pract. Exper. 27, 9, 995–1012. PLANK, J. S. AND DING, Y. 2005. Note: Correction to the 1997 tutorial on Reed-Solomon coding. Softw. Pract. Exper. 35, 2, 178–194. PLANK, J. S. 2005. Erasure Codes for Storage Applications (Tutorial). In Proceedings of the 4th USENIX Conference on File and Storage Technologies (FAST’05). PLANK, J. S. AND XU, L. 2006. Optimizing Cauchy Reed-Solomon codes for fault-tolerant network storage applications. In Proceedings of the 5th IEEE International Symposium on Network Computing and Applications (NCA06). PLANK, J. S AND THOMASON, M. G. 2007. An exploration of non-asymptotic low-density, parity check erasure codes for wide-area storage applications. Paral. Process. Lett. 17, 103–123. PLANK, J. S. 2008a. The RAID-6 liberation codes. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08). PLANK, J. S. 2008b. A new minimum density RAID-6 code with a word size of eight. In Proceedings of the 7th IEEE International Symposium on Network Computing Applications (NCA-08). ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
Higher Reliability Redundant Disk Arrays
•
7:57
PLANK, J. S., LUO, J., SCHUMAN, C. D., XU. L., AND WILCOX-O’HEARN, C. 2009. A performance evaluation and examination of open-source erasure coding libraries for storage. In Proceedings of the 7th USENIX Conference on File and Storage Technologies (FAST’09). PRUSINKIEWICZ, P. AND BUDKOWSKI, S. 1976. A double-track error-correction code for magnetic tape. IEEE Trans. on Comput. 25, 6, 642–645. RAMAKRISHNAN, K. K., BISWAS, P., AND KAREDLA, R. 1992. Analysis of file I/O traces in commercial computing environments. In Proceedings of the Joint ACM SIGMETRICS/Performance’92 Conference on Measurement and Modeling of Computer Systems, 78–90. RAMAKRISHNAN, R. AND GEHRKE, J. 2003. Database Management Systems 3rd Ed., McGraw-Hill, New York, NY. RAO, K. K., HAFNER, J. L., AND GOLDING, R. A. 2006. Reliability for networked storage nodes. In Proceedings of the International Conference on Dependable Systems and Networks (DSN’06). 237–248. ROSENBLUM, M. AND OUSTERHOUT, J. K. 1992. The design and implementation of a log-structured file system. ACM Trans. Comput. Syst. 10, 1, 26–52. RUEMMLER, C. AND WILKES, J. 1994. An introduction to disk drive modeling. IEEE Comput. 27, 3, 217–228. SAHNER, R. A., TRIVEDI, K. S., AND PULIAFITO, A. 1996. Performance and Reliability Analysis of Computer Systems. Kluwer Academic Publishers, Norwell, MA. SCHEUERMANN, P., WEIKUM, G., AND ZABBACK, P. 1994. “Disk cooling” in parallel disk systems. Data Engin. Bul. 17, 3, 29–40. SCHINDLER, J., GRIFFIN, J. L., LUMB, C. R., AND GANGER, G. R. 2002. Track-aligned extents: Matching access patterns to disk drive characteristics. In Proceedings of the 1st USENIX Conference on File and Storage Technologies (FAST’02). 259–274. SCHLOSSER, S. W., PAPADIMANOLAKIS, S., SHAO, M., SCHINDLER, J. AILAMAKI, A., FALOUTSOS, C., AND GANGER, G. R. 2005. On multidimensional data and modern disks. In Proceedings of the 4th USENIX Conference on File and Storage Technologies (FAST’05). SCHROEDER, B. AND GIBSON, G. A. 2007. Understanding disk failure rates: What does an MTTF of 1,000.000 hours mean to you? ACM Trans. Storage Syst. 3, 3, Article No. 8. SCHWARZ, T. J. E. 1994. Reliability and performance of disk arrays. Ph.D. Thesis, University of California, San Diego, CA. SCHWARZ, T. J. E., STEINBERG, J., AND BURKHARD, W. A. 1999. Permutation development data layout (PDDL) disk array declustering. In Proceedings of the 5th IEEE Symposium on High Performance Computer Architecture (HPCA). 214–217. SCHWARZ, T. J. E., XIN, Q., MILLER, E. L., LONG, D. D. E., HOSPODOR, A., AND NG, S. W. 2004. Disk scrubbing in large archival storage systems. In Proceedings of the 13th IEEE Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS’04). 409–418. SELTZER, M. I., BOSTIC, K., MCKUSICK, M. K., AND STAELIN, C. 1993. An implementation of a logstructured file system for UNIX. In Proceedings of the USENIX Winter Technical Conference. 307–326. SHENOY, P. J. AND VIN, H. M. 2002. A disk scheduling framework for next generation operating systems. Real-Time Syst. 22, 1–2, 9–48. SMARTMONTOOLS. 2008. Self-Monitoring Analysis and Reporting Technology (SMART) disk drive monitoring tools. http://sourceforge.net/projects/smartmontools/. SMITH, A. J. 1985. Disk cache: Miss ratio analysis and design considerations. ACM Trans. Comput. Syst. 3, 3, 161–203. SHRIVER, E., MERCHANT, A., AND WILKES, J. 1998. An analytic behavior model for disk drives with readahead caches and request reordering. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems. 181–191. STOCKMEYER, L. 2001. Simulations of the age-threshold and fitness free space collection algorithms on a long trace. IBM Res. Rep. RJ 10222, Almaden Research Center, CA. STODOLSKY, D., HOLLAND, M., COURTRIGHT II, W. C., AND GIBSON. G. A. 1994. Parity logging disk arrays. ACM Trans. Comput. Syst. (TOCS) 12, 3, 206–235. TAKAGI, H. 1991. Queueing Analysis: Foundations of Performance Evaluation, Vol. 1: Vacation and Priority Systems, Part 1. North-Holland, Amsterdam, The Netherlands. ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
7:58
•
A. Thomasian and M. Blaum
TAY, Y. C. AND ZOU, M. 2006. A page fault equation for modeling the effect of memory size. Perform. Eval. 63, 2, 99–130. THERESKA, E. AND GANGER, G. E. 2008. IRONModel: Robust performance modes in the wild. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems. June, 253–264. THOMASIAN, A. AND MENON, J. 1994. Performance analysis of RAID5 disk arrays with a vacationing server model for rebuild mode operation. In Proceedings of the 10th IEEE International Conference on Data Engineering (ICDE’94). 111–119. THOMASIAN, A. 1995. Rebuild options in RAID5 disk arrays. In Proceedings of the 7th IEEE Symposium on Parallel and Distributed Systems. 511–518. THOMASIAN, A. AND MENON, J. 1997. RAID5 performance with distributed sparing. IEEE Trans. Parall. Distrib. Syst. 8, 6, 640–657. THOMASIAN. A. AND LIU, C. 2002. Some new disk scheduling policies and their performance. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems. 266–267. THOMASIAN, A. AND LIU, C. 2004. Performance evaluation for variations of the SATF scheduling policy. In Proceedings of the International Symposium on Performance Evaluation Computer and Telecommunication Systems (SPECTS’04). 431–437. THOMASIAN, A., HAN, C., FU, G. AND LIU, C. 2004. A performance tool for RAID disk arrays. In Proceedings of the Conference on Quantitative Evaluation of Systems (QEST’04). 8–17. THOMASIAN, A. 2005a. Read-modify-writes versus reconstruct writes in RAID. Inform. Process. Lett. 93, 4, 163–168. Access costs in clustered RAID disk arrays. Comput. J. 48, 6, THOMASIAN, A. 2005b. 702–713. THOMASIAN, A., BRANZOI, B. A., AND HAN, C. 2005. Performance evaluation of a heterogeneous disk array architecture. In Proceedings of the 13th IEEE/ACM Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS’05). 517–520. THOMASIAN, A. AND LIU, C. 2005. Comment on “Issues and challenges in the performance analysis of real disk arrays.” IEEE Trans. Parall. Distrib. Syst. 16, 11, 1103–1104. THOMASIAN, A. 2006a. Multi-level RAID for very large disk arrays—VLDAs. ACM Perform. Eval. Rev. 33, 4, 17–22. THOMASIAN, A. 2006b. Comment on “RAID performance with distributed sparing”. IEEE Trans. Parall. Distrib. Syst. 17, 4, 399–400. THOMASIAN, A. 2006c. Shortcut method for reliability comparisons in RAID. J. Syst. Soft. 79, 11, 1599–1605. THOMASIAN. A. AND BLAUM, M. 2006. Mirrored disk reliability and performance. IEEE Trans. Comput. 55, 12, 1640–1644. THOMASIAN, A., FU, G., AND NG, S. W. 2007a. Analysis of rebuild processing in RAID5 disk arrays. Comput. J. 50, 2, 1–15. THOMASIAN, A., HAN, C., AND FU, G. 2007b. Performance evaluation of two-disk failure tolerant arrays. IEEE Trans. Comput. 56, 6, 799–814. THOMASIAN, A. AND XU, J. 2008. Reliability and performance of mirrored disk organizations. Comput. J. 51, 6, 615–629. TIAN, L., FENG, D., JIANG, H., ZHOU, K., ZENG, L., CHEN, J., WANG, Z., AND SONG, Z. 2007. PRO: A popularity-based multi-threaded reconstruction optimization for RAID-structured storage systems. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST’07). TRAEGER, A., ZADOK, E., JOUKOV, N., AND WRIGHT, C. P. 2008. A nine year study of file system and storage benchmarking. ACM Trans. Storage 4, 2, Article No. 5. TREIBER, K. AND MENON, J. 1995. Simulation study of cached RAID5 designs. In Proceedings of the 1st IEEE Symposium on High Performance Computer Architecture (HPCA). 186–197. TRIVEDI, K. S. 2002. Probability and Statistics with Reliability, Queueing, and Computer Science Applications 2nd Ed. Wiley, New York, NY.
ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.
Higher Reliability Redundant Disk Arrays
•
7:59
UYSAL. M., ALAVEREZ, G., AND MERCHANT, A. 2001. Analytical throughput model for modern disk arrays. In Proceedings of the 9th IEEE Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS’01). 183–192. UYSAL, M., MERCHANT, A., AND ALVAREZ, G. 2003. Using MEMS-based storage in disk arrays. In Proceedings of the 2nd USENIX Conference on File and Storage Technologies (FAST’03). VARKI, E., MERCHANT, A., XU, J., AND QIU, X. 2004. Issues and challenges in the performance analysis of real disk arrays. IEEE Trans. Parall. Distrib. Syst. 15, 4, 559–574. VARMA, A AND JACOBSON, Q. 1998. Destage algorithms for disk arrays with non-volatile storage. IEEE Trans. Comput. 47, 2, 228–235. WILLINGER, W., TAQQU, M. S. SHERMAN, R., AND WILSON, D. V. 1997. Self-similarity through high variability: Statistical analysis of Ethernet LAN traffic at the source level. IEEE/ACM Trans. Netw. 5, 1, 71–86. WILKES, J., GOLDING, R. A., STAELIN, C., AND SULLIVAN, T. 1996. The HP AutoRAID hierarchical storage system. ACM Trans. Comput. Syst. 14, 1, 108–136. WILKES, J. 1996. The Pantheon storage-system simulator. Tech rep. HPL-SSP-95-14, HP Labs, Palo Alto, CA. WOLF, J. L. 1989. The placement optimization program: A practical solution to the disk assignment problem. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems. 1–10. WONG, T. M., GOLDING, R. A., LIN. C., AND BECKER-SZENDY, R. A. 2006. Zygaria: Storage performance as a managed resource. In Proceedings of the 12th IEEE Real-Time and Embedded Technology and Applications Symposium on (RTAS’06). 125–134 WORTHINGTON, B. L., GANGER, G. R., AND PATT, Y. L. 1994. Scheduling for modern disk drives and non-random workloads. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems. 241–251. WU, S., JIANG, H., FENG, D., TIAN, L. AND MAO, B. 2009. Workout: I/O workload outsourcing for boosting RAID reconstruction performance. In Proceedings of the 7th USENIX Conference on File and Storage Technologies (FAST’09). XIN, Q., SCHWARZ, T. J. E., AND MILLER, E. L. 2005. Disk infant mortality in large storage systems. In Proceedings of the 13th Annual Meeting of the IEEE Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS’05). 125–134. XU, L. AND BRUCK, J. 1999. X-Code: MDS array codes with optimal encoding. IEEE Trans. Inform. Theory 45, 1, 272–276. XU, L. BOHOSSIAN, V., BRUCK, J. AND WAGNER, D. G. 1999. Low-density MDS codes and factors of complete graphs. IEEE Trans. Inform. Theory 45, 6, 1817–1826. ZABBACK, P., RIEGEL, J., AND MENON, J. 1996. The RAID configuration tool. In Proceedings of the 3rd International Conference on High Performance Computing (HiPC’96). 55–61. ZAITSEV, G. V., ZINOV’EV, V. A. AND SEMAKOV, N. V. 1983. Minimum-check-density codes for correcting bytes of errors, erasures, or defects. Probl. Inform. Trans. 19, 197–204. Received October 2007; revised July 2008; accepted March 2009
ACM Transactions on Storage, Vol. 5, No. 3, Article 7, Publication date: November 2009.