Data Allocation in Heterogeneous Disk Arrays Alexander Thomasian and Jun Xu[1] Shenzhen Institutes of Advanced Technologies
[email protected] and
[email protected]
from the viewpoint of balancing disk loads, which results in an improved performance. We propose a Heterogeneous Disk Array (HDA), whose controller supports multiple RAID levels. HDA allows multiple Virtual Arrays (VAs) associated with different workloads and reliability requirements to share its disk space. Rather than providing multiple disk arrays, cost savings are accrued by combining multiple VAs into a single HDA. Heterogeneity at both the RAID and disk level was investigated in an earlier version of HDA [19], [20]. We no longer consider disk level heterogeniety, although a high-level of performance variability may exist among disk drives from the same family. In [19], [20] we placed files into RAID1 and RAID5 containers on disks based on their RAID level: RAID1 and RAID5. Container space at the two RAID levels is allocated after such space is exhausted. File allocations, unlike container allocations, take into account disk utilization as well as disk capacity. This version of HDA has a similarity to Object-Based Storage Devices (OSDs) as in [14], in that files are allocated into containers, with width W = 2 for RAID1 and 2 ≤ W ≤ N for RAID5. HP’s AutoRAID [28] and the RAID Engine and Optimizer (REO) [11] are examples of storage systems supporting multiple RAID levels. EVENODD, RDP, and X-code are 2DFTs which only require an XOR capability [25] HP’s AutoRAID implements a hierarchical disk array with RAID1 at the higher level and RAID5 at the lower level [28]. AutoRAID is initially filled with data formatted as RAID1, but as disk space is exhausted RAID1 data is demoted to the space-efficient, but update inefficient, RAID5 format. Thrashing is potentially a problem when the size of the active working set exceeds the size of RAID1 storage. HDA similarly to AutoRAID offers two RAID levels, but data migration between the two levels is not explored. Redundant Array of Independent Filesystems (RAIF) [10] and the Panasas parallel file system [14], [27] handle RAID at the file level. EMC’s Centera allocated small files as RAID1 and large files as RAID5 [7]. HP developed a succession of tools for self-configuring and self-managing storage systems: (i) Forum [4], (ii) Minerva [1], and (iii) Ergastulum renamed Disk Array Designer (DAD) [3]. Minerva selects a RAID level for the set of stores (allocation requests) and utilizes a complex solver to assign stores onto disk arrays. DAD uses best-fit bin-packing with randomization and backtracking. RAID level selection for DAD: RAID1/0 versus RAID5 is addressed in [2]. An analytic approach to select RAID levels based on the characteristics of allocation requests is proposed in [26]. This study implicitly postulates a RAID5 that allows the coexis-
Abstract—We consider the allocation of Virtual Arrays (VAs) with different RAID levels in a Heterogeneous Disk Array (HDA). This is expected to provide significant cost savings by consolidating multiple disk arrays into one. Disk load balancing across disk arrays is a by-product of HDA. We provide an example of potential performance benefits expected from HDA. The number of Virtual Disks (VDs) required to materialize a VA is determined by upper bounds on VD bandwidth and space per physical disk drive. Only RAID1 and RAID5 are considered in this paper, but we develop the analysis for other RAID levels. A VA allocation is successful if disks sharing VAs with a failed disk are not overloaded in degraded mode. Single-pass data allocation methods for HDA are evaluated with a synthetic stream of allocation requests. Experimental results show that allocation methods minimizing the maximum disk bandwidth and capacity utilization or their variance across all disks yield the maximum number of allocated VAs. When the disk bandwidth is the bottleneck resource we utilize clustered RAID5 arrays to increase the number of allocated VAs.
I. INTRODUCTION Data availability is important because of the high cost of downtime in applications such as e-commerce. The loss of certain data is unacceptable, because it is not reproducible or very costly to reproduce. The RAID paradigm is a solution to the disk failure problem [6]. The RAID level is determined by the application availability requirements, which may be specified as k Disk Failure Tolerant (kDFT), k ≥ 1 [25]. RAID level 0 (RAID0) uses striping but no redundancy, hence it is a 0DFT, while RAID1 is based on mirroring which tolerates all single disk failures, hence 1DFT. RAID(k + 4) corresponds to arrays tolerating k disk failures using parity and Reed-Solomon codes [25]. We concentrate on 1DFT arrays in this study: RAID1 and RAID5 (rotated parity disk arrays) [6]. RAID1 is appropriate for OnLine Transaction Processing (OLTP) associated with high access rates to small, randomly placed data blocks. RAID1 doubles the disk bandwidth available to read requests, which affect transaction response times. RAID1 processes small writes efficiently, while RAID(k + 4) incurs the small write penalty, i.e., the updating of a small block requires the reading and writing of k + 1 disks in kDFTs. In the case of RAID5 the reading of the old data and parity blocks from disk, eXclusive-OR (XOR)ing the old and new data blocks to compute the difference block, and writing the new data and parity blocks. A RAID level accommodating the most stringent availability requirements for some datasets would incur an unnecessarily high update overhead on others not requiring a high level of data protection [24]. Intermixing RAID arrays is beneficial 1 Former
PhD student at the CS Dept. at New Jersey Institute of Technology.
1
tence of updates to small blocks and large updates via the Log-Structured Arrays (LSA) paradigm [25]. The RAID level yielding the lower disk load is selected. The current paper does not rely on this work and assumes that the RAID level is specified a priori, i.e., at the time of data allocation request. Vertical and horizontal sharing of disk space among disk arrays with different RAID levels in HDA, as shown in 1, provides more flexibility than vertical sharing alone, i.e., dedicating n < N disks to RAID1 and N − n disks to RAID5. This is because resource demands for different RAID levels are not known a priori. Sharing disk space has a load balancing effect, which results in improved disk response time.
packing. Small rectangles representing VD bandwidth and space requirement are allocated into larger rectangles representing disk bandwidth and capacity. Various optimization methods to maximize the number of allocations are applicable. The problem is made more difficult by the fact that allocation requests are malleable, so that the larger the number of VDs for allocating a RAID5 VA, the smaller the bandwidth and space requirements per VD. An increased number of allocations is expected to be attainable with smaller VDs due to the higher bin-packing efficiency. Another instance of malleability is clustered RAID5 (see Section V-E). We evaluate several single-pass data allocation methods, which take into account both the bandwidth and capacity requirements for VDs constituting a VA. An allocation is viable if it does not overload a disk or exceed its capacity. A novel aspect of this study is that we take into account disk failures, i.e., VA allocations are carried out to accommodate the load with a single disk failure, since redundancy beyond k = 1 deals with media failures [25]. Experimentation with synthetic allocation requests show that methods which consider both disk bandwidth and capacity are more robust than methods taking into account disk bandwidth alone. Disk capacity is less of a contraint in high capacity modern disks drives, but then our data allocation method is also applicable to HDAs with heterogeneous devices, e.g., small capacity SSDs. Allocations balancing disk capacity utilizations may balance disk loads, since the access rate to datasets tends to be proportional to their sizes, but then the access rate per GB varies across different datasets. RAID1 and RAID5 arrays considered in this study are postulated to have significantly different access rates per GB. We also consider capacitybound, bandwidth-bound, and balance workloads, but do not consider mixtures of such workloads in allocation studies. The paper is organized as follows. Section II gives an example of how HDA can improve performance with respect to independent disk arrays. Section III specifies the characteristics of VA allocation requests and provides expressions for VA load in normal and degraded mode, which are used to determine the VA width and load per VD. Section IV is concerned with data allocation in HDA. Following the binpacking framework for data allocation, methods utilized in this study for allocating VDs are listed. Section V reports experimental results using a synthetic stream of allocation requests. Subsection V-A describes the setup of the experiment. Subsection V-B provides the modeling assumptions and parameter settings. Subsection V-C compares the efficiency of allocation methods. A sensitivity study of the parameters used in the allocation study is reported in Subsection V-D. Subsection V-E determines the RAID5 declustering ratio maximizing the number of VA allocations. Section VI summarizes the conclusions of this study and discusses areas of further research. II. ANALYTIC JUSTIFICATION FOR HDA
Fig. 1. HDA with four disks and five VAs with different RAID levels: two RAID1 VAs with 2 disks each, two RAID5 VAs with 3 disks each, and a RAID0 with one disk.
A synthetic stream of allocation requests for VAs is used in this study to compare the effectiveness of data allocation methods for HDA, since such data is not currently readily available. Each allocation request comprises the RAID level, the VA size, and the associated load, as if the VA was allocated as a RAID0, with no redundancy. Based on the VA load and its size we determine the VA width (denoted by W ), which is the number of Virtual Disks (VDs) required for VA allocation. This is accomplished by using two bounds restricting the VD bandwidth and capacity utilization at disk drives. W is set to the maximum of the two widths according to these criteria. RAID1 arrays with the basic mirroring data layout [23] are assigned a width W = 2. but the method used for RAID5 can be used to determine W . Upon a disk failure the read load at surviving disk is doubled in basic RAID1, but more sophisticated RAID1 organizations result in a lower load increase across surviving disks [23]. Such RAID1 organizations are expected to result in an increased number of allocations. RAID5 arrays with width W may have different strip sizes and parity group sizes (G). HDA is a clustered RAID [13], since W < N, but another level of clustering is attainable by setting G < W . Clustered RAID5 has the benefit that the load increase in degraded mode is reduced. In the case of read requests the load increase is given by the declustering ratio: α = (G − 1)/(N − 1) < 1 [13]. The allocation problem (at the VD level) may be formulated as a vector-packing problem [8], which is 2-dimensional bin-
A simple example is used to show that HDA, which allows disk space sharing by different RAID levels by horizontal partitioning of disk space, can improve performance with respect to a disk array, which only allows vertical partitioning. 2
When disk arrivals follow the Poisson distribution with rate λ and disk service times have an exponential distribution with mean xd , then we have an M/M/1 queueing system [12], which has been utilized successfully in analyzing RAID5 disk arrays. The mean disk response time is R = xd /(1−ρ), where ρ = λxd is the disk utilization factor. The arrival rate of read requests to RAID1 and RAID5 arrays is ΛR1 and ΛR5 , respectively. Two configurations are considered for N = 8 disks: C1 Vertical Disk Partitioning: n = 2 disks are dedicated to RAID1 and N − n = 6 disks to RAID5. C2 Horizontal Disk Partitioning: The RAID1 array is allocated over N = 8 disks, occupying 2/N of the space on each disk. The remaining space on the N disks is more than adequate to hold RAID5 data, which was previously allocated on six disks, because the wider RAID5 requires less parity space. The response time for RAID1 for C1 is: RR1 = xd /(1 − ρ), where the arrival rate per disk is λ = ΛR1 /2 and the disk utilization ρ = λxd . For C2 ρ′ = [(ΛR1 + ΛR5 )/N]xd and R′ R1 = xd /(1 − ρ′ ). If ρ′ < ρ or ΛR5 < (N/2 − 1)ΛR1 then R′ R1 < RR1 . In this case C2 will improve R′ R1 with respect to C1 for ΛR1 > ΛR5 /3. RAID1 response times in C2 can be improved by processing its accesses at a higher priority than RAID5 accesses, in which case: R′′ R1 = xd /(1 − ρ′′ ), where ρ′′ = (ΛR1 /N)xd , since only the disk utilization due to RAID1 accesses affects R′′ R1 [12]. For C1 with ρ = (ΛR1 /2)xdisk = 0.8, RR1 = xd /(1 − 0.8) = 5xd and for C2 ρ′′ = 2 × 0.8/N = 0.2 and R′′ R1 = xd /(1 − 0.2) = 1.25xd , i.e., a 4-fold improvement. We next apply the approximate reliability analysis in [22] it follows from the following two expressions that C1 is more reliable than C2 . The reliability of each disk is denoted by r = 1 − ε, ε ≪ 1, e.g., for an exponential distribution with failure rate 10−6 hours after three years ε = 0.0025. The reliability of RAID1/0 with p disk pairs is R1 (p) = [1 − (1 − r)2 ] p , while the reliability of a RAID5 with w disks is R5 (w) = rw + wrw−1 , hence:
allocation requests are not lost, but rather allocated to another HDA. As elaborated in Section VI given a fixed number of VAs, minimizing the number of arrays required to carry out a given number of VA allocations would then be the performance metric. Each VA is specified as follows: ∙ The RAID level of a VA. This is specified a priori as ℓ = 1 for RAID1 and ℓ = 5 for RAID5. The RAID level is determined probabilistically given the composition of RAID1 versus RAID5 arrays, e.g., RAID1:RAID5=1:3. Alternatively, the fraction of RAID1 and RAID5 allocation requests is f1 = 0.25 and f5 = 0.75, respectively. ∙ VA size. The size Vi of VAi is generated according to a RAID level dependent distribution, but as if the data was to be allocated as a RAID0 array (with no redundancy). The effective sizes of VAs, denoted by V ′ i for VAi , are utilized in allocation. The size of a RAID1 VA is V ′ i = 2Vi , since the disk space requirement is doubled. Since RAID0/5/6/7 VAs are kDFT arrays with k = 0/1/2/3, the effective size of the VAs is V ′ i = Vi (Wi /(Wi − k)). ∙ VA loads. We are interested in determining the disk utilizations, rather than disk response times, so that we need to specify the arrival rate of requests, the mean disk service time (over different sizes of requests), and the fraction of reads and writes. We assume all requests are to small, randomly placed disk blocks, as in an OLTP workload, so that the disk bandwidth is specified with respect to this workload. The arrival rate to VAi is Λi = Vi κℓ , where κℓ is the I/O intensity per GB for RAID level ℓ. The per GB rate for RAID1 is set to be higher than RAID5 (κ1 > κ5 ). These parameters are also varied in the experimental study to emulate I/O-bound, balanced, and capacity-bound workloads. The fraction of read and write requests to VAi is denoted by ri and wi = 1 − ri , respectively. Alternatively, R : W denotes the relative number of reads and writes, so that ri = R/(R +W ) and wi = W /(R +W ). An additional workload such as sequential accesses to 2 6 5 2 large files may also be considered. Disk service time in this RC1 = R1 (1)R5 (6) = [1 − (1 − r) ][r + 6r (1 − r)] ≈ 1 − 16ε . case is determined by disk transfer time. Such requests can RC2 = R1 (4)R5 (8) = [(1−(1−r)2 )4 ][r8 +8r7 (1−r)] ≈ 1−32ε2 . be accommodated using disk ulitilization rather than access bandwidth to limit the number of allocations. C2 is more vulnerable than C1 to data loss with two disk B. Estimating VA Widths Based on Load in Normal Mode failures. III. DATA ALLOCATION IN HDA In estimating VA loads we simplify the discussion by This section is organized as follows. In Section III-A we assuming that the rate of read requests is determined by misses specify the characteristics of the VA allocation requests. In at an appropriately sized cache and potential reduction in disk Sections III-B and III-C we provide analytic expressions for service time for write requests due to a Non-Volatile Storage system load in normal and degraded modes. The former load (NVS) cache [16]: (i) overwriting of dirty blocks, (ii) potential estimates are used to obtain the widths of the VAs and the locality in destaging dirty blocks. The load across the Wi disks of RAID0 is the arrival rate latter to estimate disk bandwidth requirements. of requests to the array multiplied by the mean disk service A. Virtual Allocation Requests time: xdisk = ri xSR + wi xSW [12]: We are concerned with allocating VAs in an HDA with N disks. Requests to allocate VAs are processed in strict FCFS ρ′ i = Λi xdisk = Λi [ri xSR + wi xSW ]. (1) order, i.e., VAi is allocated before VAi+1 , until a VA allocation fails because not all of its VDs can be allocated. The number The load divided by the number of VDs (Wi ) yields the of successful allocations is the metric used to compare the utilization per VD for RAID0: ρi = ρ′ i /Wi . efficiency of data allocation methods. In practice unsuccessful 3
VAi ’s designated as RAID1 are assigned Wi = 2 disks, but given balanced loads due to striping, the method developed in this study can be applied to W > 2. We ignore the effect of judicious routing of read requests on disk utilization [21]. The load on VAi provided it is a RAID1 is: ρ′ i = Λi [ri xSR + 2wi xSW ]. (2)
SW accesses after SR requests are completed. If there are no intervening disk access at data and check blocks an extra seek is not required, but almost a full disk rotation is incurred, so that: xSR + xSW ≈ xRMW in this case. The width of the VA to implement RAID0/5/6/7 arrays is determined by the maximum disk utilization allowed per VD on each disk (ρmax ) and a maximum capacity constraint per VD (vmax ), which is expressed as a fraction of all disk capacities.
The utilization of each VD due to writes and balanced read accesses is: ρ′ ri ρi = i = Λi [ xSR + wi xSW ]. 2 2 Logical read and write requests are processed as Single Read (SR) and Single Write (SW) accesses. In the case of RAID5/6/7 the updating of data and check blocks may be accomplished via Read-Modify-Write (RMW) requests, which read, modify, and then write data and check blocks after one disk rotation. The mean service times for SR, SW, and RMW requests are denoted by xSR , xSW , and xRMW , respectively. The mean disk service times for the three types of requests are given in Section V-B. Given an updated data block we consider three methods for updating check blocks. Method A: Compute the P/Q/R check blocks for RAID5/6/7, which are 1/2/3DFT arrays at the Disk Array Controller (DAC). Given the new block dnew , this is accomplished by accessing dold , if it is not already cached. The old check blocks pold , qold , and rold are also read and pnew , qnew , and rnew are computed and written to disk. A logical write request requires k + 1 = 2/3/4 SRs to read old blocks and as many SWs to write them. The resulting VA load is: ρ′ i = Λi [ri xSR + (k + 1)wi (xSR + xSW )].
Wibandwidth = ⌈ρ′ i /ρmax ⌉. capacity
Wi = ⌈Vi /vmax ⌉ + k. (6) Wi is given as: [ ( ) ] capacity Wi = min max Wibandwidth ,Wi ,N . (7) ) ( capacity > N VA may be allocated For max Wibandwidth ,Wi ( ) capacity bandwidth over ni = ⌈max Wi ,Wi /N⌉ HDAs. Higher values of ρmax and vmax can also be utilized for larger VAs. Limiting the per VD bandwidth utilization to a small fraction of disk bandwidth reduces the possibility of disk overload when VA loads are underestimated. Allocating a RAID5 VA across all N disks provides the maximum level of parallelism for read accesses and minimizes the space dedicated to check blocks, but has the disadvantage that if a single disk fails, then the read load at the N − 1 surviving disks is doubled. It also increases the load for reconstruction blocks, such as rebuild. For a RAID5 VA with allocation width W < N the per VD load is higher than for W = N, but fewer disk drives are affected by a disk failure, since effectively HDA is a clustered RAID5 [13].
(3)
Method B: This method postulates the capability to compute the new check blocks at the disks given the difference block for data. In the case of RAID5, an SR to read the old data block is followed by an XOR to compute the difference block: ddi f f = dold ⊕ dnew , which is then sent to the appropriate check disk via the DAC. The new data block is written to disk after one rotation. After receiving the difference block, the parity disk reads the old parity block and computes the new parity block, which is then written onto disk after one rotation. For RAID6 and RAID7 this postulates a capability at the disk level to compute the appropriate check block given ddi f f . The VA load using this method according to [16] is: ρ′ i = Λi [ri xSR + (k + 1)wi xRMW ].
(5)
C. Load in Degraded Mode In order to make an allocation safe with respect to disk failures, we need to make sure that disk bandwidths will not be exceeded in degraded mode. One method is to first carry out the allocations in normal mode utilizing a fraction f of disk bandwidth, but then consider the effect of disk failures one by one to ensure that no disk affected by the failure of other disks is overloaded. This approach is costly, since it would require multiple iterations with successively lower values of f . It is also not feasible when VAs are allocated, since it would require VA reallocations. Since VA loads increase over time, a VA causing overload would be reallocated to an HDA with sufficient available capacity. Since carrying out allocations with the just-mentioned method is costly, we instead allocate VAs operating in degraded mode, as if they have experienced a single disk failure. Only single disk failures are considered for kDFT VAs with k > 1, since in practice additional check blocks deal with media failures [25]. An HDA with a single disk failure is not expected to operate in degraded mode for lengthy periods of time, since the rebuild process in HDA is carried out at the level of VDs rather than disks and the rebuild process of highly active VAs can be prioritized. The load increase is expected to be small when the VAs sharing disk space are not active at the same time, but this effect may have been taken into account in the allocation step.
(4)
Method C: The DAC uses ddi f f to compute pdi f f , qdi f f , and rdi f f , for RAID5/6/7 disk arrays utilizing k = 1, 2, 3 check blocks, respectively. These difference blocks can then be XORed with pold , qnew , and rold at the check disks, to yield pnew , qnew , rnew , respectively. The advantage of Method C with respect to Method B is that only an XOR capability is required at the disks. but it incurs the same disk access cost as method B and like method B it is more susceptible to failures than Method A. This is because Method A logs its operations to recover from failures. Method A substitutes Method B’s disk rotation time with a seek and latency. By issuing SR accesses to check blocks after dold has been read will allow the almost immediate issuing of the 4
VAs are associated with different loads for different periods, e.g., a VA which has a heavy load in one period may have a light load in another. Allocations for successive periods are attempted, preferably starting with periods with the heaviest loads. This allocation is checked if it includes VAs for all other periods. Alternatively, VA allocation may be carried out over different time periods and the most stringent allocation with fewest VAs adopted. The surviving disk in RAID1 carries the read load, which would otherwise be shared by two disks, so that this load is doubled, while the write load remains the same. Both disks for VAi are allocated with increased loads. We use RAID/F1 to specify RAID1 disks with a single failure. (8) ρRAID1/F1 = Λi ri xSR + Λi wi xSW .
Fig. 2. Allocation vectors for three VDs on a physical disk drive. The x- and y-axis represent the disk bandwidth and capacity utilizations.
Disk utilizations for read requests in a RAID5 with a single failure is given below following the discussion in [18]. Using W = Wi to simplify the notation, the arrival rate per disk is λi = Λi /W . In degraded mode the read load on surviving disks is twice the load in normal mode, since each disk in addition to processing its own read requests is also accessed by fork-join requests to reconstruct data blocks on the failed disk. RAID5/F1 = 2λi ri xSR , (9) ρread
this is not possible in all cases. The disk vectors associated with the N HDA disks are: ⃗Dn = (Xn ,Cn ), 1 ≤ n ≤ N. Different values for Xn and Cn apply to heterogeneous disk drives [19], [20]. The allocation requests to the disks are at the level of VDs. Let VD j , 1 ≤ j ≤ Wi denote the Wi VDs of VAi . The resource requirements for VD j are specified as d⃗ j = (x j , c j ), where x j is the access rate and c j is the space requirement for the jth VD. The values of x j , 1 ≤ j ≤ Wi may be different for heterogeneous disk drives, but the same method is applicable. Let Unx and Unc denote the current utilization of the bandwidth and capacity of the nth disk in the array. Given that Jn denotes the set of VD allocations at the nth disk, we have: (13) Unx = ∑ x j /Xn , Unc = ∑ c j /Cn .
There are three cases for updates, which lead to the following equations (λ′ i = λi /(W − 1)). RAID5/F1
ρwrite
= λ′ i wi [2(W − 2)xRMW + 2xSW + (W − 2)xSR ] . (10)
For clustered RAID5 with parity group size G ≤ W and α = (G − 1)/(W − 1) we have the following loads [18], but the decision tree method in this paper can be easily extended to RAID6/7: CRAID5/F1
ρread
CRAID5/F1
ρwrite
= λ′ i ri [(W + G − 2)xSR ] = λi ri (1 + α)xSR .
j∈Jn
(11)
j∈Jn
The allocation of VAs onto HDA continues until one of the VDs of a VA cannot be allocated, i.e., disk bandwidth or capacity is exceeded. The residual bandwidth and capacity at the disks is initialized as: Xnr = Xn , Cnr = Cn , 1 ≤ n ≤ N and updated as Xnr = Xnr − x j and Cnr = Cnr − c j . A VD allocation at the nth disk is successful if Xnr ≥ 0 and Cnr ≥ 0. Given the rapidly increasing disk capacities and slow growth in disk bandwidth, especially for random disk accesses, the bandwidth of newer disks may be exhausted at very low disk capacity utilization [20]. Disk utilization is a more robust measure than disk bandwidth, in which case the disk bandwidth and capacity utilization are initialized as UnX = 0, UnC = 0, 1 ≤ n ≤ N. Allocations are taken into account by Unx = Unx + x j /Xn and Unc = Unc + c j /Cn , with the constraints Unx < 1 and Unc < 1. Given the loads in degraded mode for RAID1 and RAID5 VAs, which were given in Section III-C, and their disk space requirements, for RAID1 we have x j = ρ′ i /2 and c j = V ′ i /2 = Vi and for RAID0/5/6/7 we have x j = ρ′ i /Wi and c j = V ′ i /Wi . The following methods are considered for allocating VDs on an array with N disk drives. Round-Robin (RR): Allocate VDs on consecutive disk drives with wrap-around, i.e., modulo(N). This method has the advantage that it equalizes the number of allocated VDs across the disks, but does not equalize bandwidth or capacity
= λ′ i wi [2(W − 2)xRMW + 2xSW + (G − 2)xSR ] . (12)
IV. DATA ALLOCATION BACKGROUND We model disks and allocation requests as vectors in two dimensions, where the x-axis is the disk access bandwidth and the y-axis is its capacity [8]. In our discussion the disk bandwidth is determined by the maximum disk access rate to small randomly placed disk blocks. The transfer time for small blocks is negligibly short compared to the mean positioning time (the sum of mean seek time and mean rotational latency). Data allocation is modeled as vector addition. Taking into account disk space is straightforward, but the effect of additional allocations on disk bandwidth is determined by assuming an average seek time, although automatic methods to improve disk locality have been proposed (in single disk environments). The sum of all vectors should not exceed the disk vector: ⃗D = (X,C) in either dimension, where X denotes the maximum disk bandwidth and C the disk capacity, as shown by the diagonal in Figure 2. Three allocations are shown in the figure. Better allocations are achieved if the end-point of all allocation vectors remains close to the diagonal, since in this manner full disk bandwidth and capacity utilizations can be attained, but 5
utilizations. This method can also be referred to as Next Fit. Random: Allocate VDs on disk drives randomly, until an unsuccessful allocation is encountered. Further attempts for random allocation resulted in very few successful allocations, so that this option was not pursued. Best-Fit: Select the disk with the minimum remaining bandwidth or the maximum disk utilization. First-Fit: Starting with the disk with the smallest index a VD is allocated on the first disk that can hold it. Worst-Fit: Allocate requests on disks with minimum (bandwidth) utilization, provided that the disk capacity constraint is satisfied. Given the bandwidth and capacity utilizations of the nth disk Unx and Unc according to Equations (13, two more sophisticated allocation methods minimize the following objective functions: Min-F1: Allocate VDs on disk drives, such that the maximum disk bandwidth or capacity utilization at the disks is minimized. (14) F1 = max {Unx , βUnc } ,
We consider VA allocations in degraded mode of operation, as if one of the disks allocated to a VA has failed. In degraded mode of operation surviving disks may be highly loaded, but this is less so in HDA due to two reasons: (i) Each VA sharing space on the disk contributes to a small fraction of disk bandwidth. (ii) Few of the VAs sharing disk space may be active when the disk fails. The load on surviving disks in RAID5 and RAID1 is at its highest in the initial period of rebuild, but performance improves as rebuild progresses, e,g., due to read redirection for read requests [13]. The allocation experiment is specified by Algorithm 1. In summary, allocation requests for VAs are generated, the counters IR1 and IR5 are incremented based on the RAID level of allocated VA, until a VA allocation is unsuccessful, i.e., not all of its VDs can be placed. Alternatively, VA allocations can be continued on a new HDA. Another possibility is to set aside the VA and attempt additional VA allocations, until the number of unsuccessful allocations exceeds a threshold. Since VAs have different loads at different periods, rather using the peak load per VA, we may take load variability in the allocation step in further investigations.
Min-F2: Minimize the weighted sum of the variances of disk access bandwidths and capacity utilizations. F2 = Var (Unx ) + β Var (Unc ), (15)
B. Assumptions and Parameter Settings
1≤n≤N
1≤n≤N
We use the following parameter settings to run experiments. An HDA supporting RAID1 and RAID5 with N = 12 IBM model 18ES drives is considered. 1 Specifications for this disk drive are summarized in Table 4 in [24], which is based on the two pessimistic assumptions: FCFS scheduling and that disk accesses uniformly distributed over all disk blocks. Disk characteristics are extracted from the web site for the Parallel Data Laboratory (PDL), but a truly high capacity disk was only posted in 2007. 2 We assume that the workload consists of accesses to 4KB blocks, which are uniformly distributed over all disk blocks. We ignore delays at the DAC (disk array controller) and the disk controller, since they are negligible compared to disk access time, which is the sum of mean seek, latency, and transfer time: xSR = xseek + xlat + xx f er . The mean seek time xseek = 7.16 ms used in this study is based on the pessimistic assumption of FCFS scheduling. An SATF scheduling which results in a significant improvement in disk access time is the default disk scheduler in some disk drives, but a more sophisticated scheduler such as YFQ [5], is required for HDA. There is also the issue of improved data allocation, which can be done dynamically based on observed access patterns described in [9]. The mean rotational latency for small data block accesses is one half of disk rotation time: xlat ≈ Trot /2 ≈ 4.16 ms. The transfer time for 4KB blocks is a weighted sum over all tracks [24]. The maximum disk bandwidth for SR requests to 4KB blocks is 1000/xSR ≈ 87 accesses per second. It is slightly lower for write requests due to Head Settling Time (HST: xSW = xSR + THST . For RMW accesses xRMW = xSR + Trot . The size of the data portions of VAs is assumed to be exponentially distributed with a mean V 1 = 256 MB for
1≤n≤N
β ≥ 0 is an emphasis factor for disk capacity utilization, The sensitivity of its value is investigated in Section V-D. Var(.) is the variance computed over all N disks. Note that the RR and Random methods do not take into account either disk bandwidth or capacity utilization. The Best-Fit, Worst-Fit, and Worst-Fit methods take into account disk bandwidth utilization, and Min-F1 and Min-F2 take into account both disk bandwidth and capacity utilization. V. DATA ALLOCATION EXPERIMENTS This section is organized as follows. In Section V-A we describe the experimental method used to evaluate the aforementioned data allocation policies. In Section V-B we describe the assumptions made in the experimental study. In Section V-C we summarize the results of the experimental study to compare the allocation methods. Sensitivity of results with respect to parameter settings are studied in Section V-D. We conclude with an assessment of the effect of clustered RAID5 on the number of allocations in Section V-E. A. Description of the Experiment We are interested in maximizing the number of VA allocations on a single disk array with N disks: IR1&R5 = IR1 + IR5 , where IR1 and IR5 are counters for allocated RAID1 and RAID5 VAs. VAs become available for allocation one at a time and are processed in strict FCFS order. Optimizations possible when allocating in batches are an area of further investigation. The input parameters are the fraction of RAID1 versus RAID5 allocation requests in the input stream, the size and bandwidth requirements for VAs, which are sampled from different distributions for RAID1 and RAID5, depending on whether the workload is balanced, bandwidth-bound, or capacity-bound.
1 http://www.storage.ibm.com/hdd/prod/ultrastar.htm. 2 http://www.pdl.cmu.edu/DiskSim/diskspecs.html.
6
Algorithm 1 Simulation to estimate the efficiency of VA allocation methods. Initialize VA allocation counts: IR1 = IR5 = 0 and VA index i = 1. 1) Generate the RAID level ℓ for VAi according to the fraction of RAID1 (ℓ = 1) versus RAID5 (ℓ = 5) VAs. For u uniformly distributed in (0, 1), if u ≤ f1 then VAi is a RAID1 and otherwise it is a RAID5. 2) Generate VAi size: Vi , based on the size distribution for RAID level ℓ. 3) The access rate to VAi is Λi = κℓVi , where κℓ differs according to the RAID level and whether the workload is bandwidth-bound, capacity-bound, or balanced. 4) Calculate the load on VA disks in both normal and degraded mode using the appropriate expression for disk loads in Sections III-B and III-C. 5) For RAID5 allocation requests determine the VA width Wi based on load in normal mode and thresholds for maximum disk bandwidth (ρmax ) and capacity (vmax ) using Equation (7). Set Wi = 2 for RAID1 VAs. 6) Determine if all VDs of VAi can be allocated successfully satisfying disk space and bandwidth constraints in degraded mode. If not set IR1&R5 = IR1 + IR5 and stop. 7) Set IRℓ = IRℓ + 1, based on the RAID level ℓ of the allocated VAi . Increment the disk bandwidth and capacity utilization or decrement the residual bandwidth and capacity, set i = i + 1, and return to Step 1.
Number of disks RAID0/5/6/7 RAID1 allocations fraction RAID5 allocations fraction Per GB rate to RAID5 Per GB rate to RAID1 Size of VAi Mean RAID1 size Mean RAID5 size Load for VAi Fraction of (logical) reads Fraction of (logical) writes Read/write ratio Maximum bandwidth per VD Maximum capacity per VD RAID5 VA width Parity group size Declustering ratio Capacity emphasis factor TABLE I
RAID1 and V 5 = 768 MB for RAID5 VAs. The RAID sizes obtained by sampling are rounded to multiples of 256K. The metadata space requirements for HDA discussed in [19] is expected to be small, so that this space is ignored in this study. We set vmax = 1/50 of the capacity of all disks and ρmax = 1/20 the bandwidth of each disk. We set β = 1 based on the sensitivity analysis for the Min-F1 and Min-F2 allocation methods. The access rate for RAID1 VAs is ten times higher than the rate for RAID5 VAs, i.e., κ1 = 10κ5 . While RAID1 arrays are three times smaller on the average, the rate of accesses to RAID1 arrays is 3.3 times higher than RAID5 on the average. We consider three VA workloads with different access rates, expressed as I/Os per second (IOPS), for RAID5 VAs in normal mode. 1-Bandwidth-Bound: κ5 = 8.5 IOPS per GB. 2-Balanced: κ5 = 3.3 IOPS per GB. 3-Capacity-Bound: κ5 = 2.1 IOPS per GB. We consider three cases for the composition of read and write requests: (i) all requests are reads (r = 1), (ii) 75% of requests are reads (r = 0.75), and (iii) 50% of requests are reads (r = 0.5). Parameters used in the experimental study are summarized in Table I.
N = 12 k = 0/1/2/3 f1 f5 = 1 − f1 κ5 κ1 = 10κ5 Vi V 1 = 256MB V 5 = 768MB ρ′ i r w = 1−r R:W ρmax vmax W G G−1 α= W −1 β≥0
PARAMETERS USED IN THE EXPERIMENTAL STUDY.
The allocation experiments were repeated one hundred times, so as to obtain the average number of allocated VAs over these iterations. Increasing the number of iterations yielded similar results, so lengthier experiments are not reported here. The key metric in the comparison is the number of times an allocation method performed best, but it follows from the tables that there were many ties, where more than one method provided the best allocation. The following conclusions can be drawn from the tables. 1) The number of VAs allocated in normal mode (not reported here due to space limitations) is almost double the number of those in degraded mode for r = 1. This is because the read load on surviving disks in degraded mode is double the load in normal mode. 2) Min-F1 and Min-F2 are consistently the best methods in terms of the number of allocations in all configurations, although Min-F1 outperforms Min-F2. Both disk bandwidth and capacity need to be considered for robust
C. Comparison of the Performance of Allocation Methods The number of allocated RAID1 and RAID5 VAs is the key performance metric in evaluating data allocation methods. The number of allocated VAs in the two categories follows the fractions of requests in the input stream, since the allocations are carried out in strict FCFS order. Tables II, III, and IV are based on 100%, 75%, and 50% read requests in degraded mode. In each case the bandwidth-bound, balanced, and capacity-bound workloads are considered. The same pseudo-random sequence was used in experiments to generate the same synthetic sequence of allocation requests for a fair comparison. 7
Method Min-F1 Min-F2 Worst-Fit Best-Fit RR First-Fit Random
Best 71 71 56 54 19 10 10
Bandwidth-Bound Allocations R1 R5 R1 & R5 23.2 72.0 95.2 23.2 72.0 95.2 21.8 71.6 93.4 20.3 68.7 89.0 16.4 50.6 67.0 16.0 48.5 64.5 13.7 47.2 60.9
Balanced Allocations Best R1 R5 R1 & R5 87 28.2 84.6 112.8 73 27.3 84.6 111.9 14 23.7 75.2 98.9 12 22.0 70.1 92.1 10 21.1 66.4 87.5 0 18.7 56.8 75.5 6 20.2 61.7 81.9 TABLE II
Best 87 83 9 8 8 0 2
Capacity-Bound Allocations R1 R5 R1 & R5 34.2 105.3 139.5 33.4 102.3 135.7 29.7 93.4 123.1 27.6 90.2 117.8 30.8 95.1 125.9 19.3 59.7 79.0 28.0 86.9 114.9
C OMPARISON OF THE ALLOCATION METHODS WITH RAID1:RAID5=1:3, AND r = 1 IN DEGRADED MODE .
Method Min-F1 Min-F2 Worst-Fit Best-Fit RR First-Fit Random
Best 82 75 66 63 12 16 12
Bandwidth-Bound Allocations R1 R5 R1 & R5 17.3 64.0 81.3 17.3 64.0 81.3 16.5 59.6 76.1 15.0 58.8 73.8 12.8 45.6 58.4 13.7 44.7 58.5 14.0 38.7 52.7
Best 98 69 10 9 11 0 7
Balanced Allocations R1 R5 R1 & R5 23.2 72.6 95.8 23.1 73.5 96.6 21.3 66.9 88.2 19.6 62.4 82.0 16.8 56.6 73.4 17.9 53.8 71.7 17.8 53.6 71.4
Best 80 83 7 6 6 0 2
Capacity-Bound Allocations R1 R5 R1 & R5 33.0 104.0 137.0 32.1 101.0 133.1 29.0 92.0 121.0 27.1 89.0 116.1 30.0 92.1 122.1 18.1 58.0 76.1 27.0 85.0 112.1
TABLE III
C OMPARISON OF THE ALLOCATION METHODS WITH RAID1:RAID5=1:3 AND r = 0.75 IN DEGRADED MODE .
Method Min-F1 Min-F2 Worst-Fit Best-Fit RR First-Fit Random
Best 78 75 63 61 13 16 12
Bandwidth-Bound Allocations R1 R5 R1 & R5 15.3 57.6 72.9 15.3 55.7 71.1 14.7 51.9 66.6 13.3 51.2 64.5 10.7 39.4 50.1 12.0 39.6 51.6 11.2 32.9 44.1
Best 98 71 6 5 10 0 3
Balanced Allocations R1 R5 R1 & R5 21.7 66.6 88.3 20.9 67.4 88.3 19.9 63.8 83.7 18.2 59.5 77.7 14.0 47.5 68.7 16.9 51.8 61.5 15.2 47.2 62.4
Best 84 90 3 3 6 0 2
Capacity-Bound Allocations R1 R5 R1 & R5 33.0 104.0 137.0 32.1 101.0 133.1 29.0 92.0 121.0 27.1 89.0 116.1 30.0 92.1 122.1 18.1 58.0 76.1 26.1 83.2 109.3
TABLE IV
C OMPARISON OF THE ALLOCATION METHODS WITH RAID1:RAID5=1:3 AND r = 0.5 IN DEGRADED MODE .
resource allocation. 3) Worst-Fit and best-fit methods provide good allocations for bandwidth-bound workloads in normal and degraded modes. The number of allocations is lower than MinF1 and Min-F2, however. Otherwise, the number of allocations is rather poor. 4) First-Fit, Random, and RR are the worst among considered allocation methods.
Method Min-F1 Min-F2 RR
Bandwidth(%) Avg Std 90.7 1.6 90.4 2.3 63.8 20.1
Capacity(%) Avg Std 51.3 2.6 51.3 2.2 36.8 8.4
R1 23 23 16
No. of VAs R5 Total 72 95 72 95 51 67
TABLE V
D ISK UTILIZATIONS WITH BANDWIDTH - BOUND WORKLOAD IN DEGRADED MODE (r = 1,RAID1:RAID5=1:3).
Tables V, VI, and VII provide the comparison of average bandwidth and capacity utilization of all disks and number of RAID1 and RAID5 VAs allocated with RAID1:RAID5=1:3, r = 1 in degraded mode for Min-F1, Min-F2, and the less efficient RR allocation methods, for bandwidth-bound, balanced, and capacity-bound workloads, respectively. The following observations can be made: Min-F1 and Min-F2 attain high disk bandwidth utilization for bandwidth-bound and balanced workloads, while disk capacity utilizations are low for bandwidth-bound workloads, since the bandwidth bound is reached first. The reverse is true for capacity-bound allocations. Min-F1 and Min-F2 minimize the variation of disk utilizations and this results in an increase in the number of VA allocations. Poor allocation methods have high standard deviations for disk bandwidth and capacity utilization. We plotted disk bandwidth and capacity utilizations with
Method Min-F1 Min-F2 RR
Bandwidth(%) Avg Std 89.9 1.6 89.6 1.9 70.4 17.7
Capacity(%) Avg Std 95.3 0.6 94.4 0.7 74.6 13.1
R1 28 27 21
No. of VAs R5 Total 85 113 84 111 66 87
TABLE VI
D ISK UTILIZATIONS WITH BALANCED WORKLOAD IN DEGRADED MODE (r = 1, RAID:RAID5=1:3). Method Min-F1 Min-F2 RR
Bandwidth(%) Avg Std 47.8 0.8 47.7 0.5 45.0 11.7
Capacity(%) Avg Std 99.2 0.6 98.4 0.5 94.8 3.7
R1 34 33 31
No. of VAs R5 Total 105 139 102 135 95 126
TABLE VII
D ISK UTILIZATIONS WITH CAPACITY BOUND WORKLOAD IN DEGRADED MODE (r = 1, RAID1:RAID5=1:3).
8
bandwidth-bound, balanced, and capacity-bound workloads for Min-F1, Min-F2, and Round-Robin allocation methods (not given here due to space limitations). It is observed that for Min-F1 and Min-F2 all disks are saturated roughly equally as far as disk bandwidth or capacity or both are concerned for bandwidth-bound, capacity-bound, and balanced workloads, respectively. Little or no improvement in allocations is therefore expected with more sophisticated allocation methods.
and without clustering are given in Section III-C and are based on [18]. For a given disk drive and associated workload the Capacity/Bandwidth Ratio (CBR) is given as: γd = CxSR , where C is the disk capacity and 1/xSR is the maximum disk bandwidth. Note that γd is the inverse of the Bandwidth/Space Ratio (BSR), which has been utilized in performance studies of storage systems for Video-on-Demand (VOD). The capacity of IBM 18ES disk drives is C = 9.17 GB and their maximum bandwidth is 1000/xSR ≈ 87 accesses per second, so that γd = 9.17/87 = 0.105. This corresponds to the slope of the diagonal in Figure 2. The CBR for RAID5 VAs is denoted by γc (α). More RAID5 VAs can be allocated when γc ≈ γd , i.e., successive allocation vectors follow the diagonal in Figure 2, since neither the disk capacity nor the disk bandwidth is a bottleneck. The RAID5 access rate per GB with a bandwidth-bound workload is set to κ5 = 8.5 accesses per second. The capacity bandwidth ratio for a workload for clustered RAID5 is given in Table IX. For RAID5 without clustering γc (1) < γd , but γc (0.25) ≈ γd . To illustrate the workings of this table note that the bandwidth 15.22 for α = 1 is twice the bandwidth in normal mode, i.e., 7.61. The load increase for α = 0.125 is then 7.61 × 1.125 = 8.56 GB. Without clustering the size of the data portion of the VA is VD = 0.87 × 11/12 ≈ 0.8 GBs. For α = 0.125 and G = 1 + α(W − 1) ≈ 2.2375. A RAID5 with G = 2 has the same level of redundancy as RAID1 and in practice would be treated as RAID1, but this improvement is ignored at this point. The size of the clustered VA is then V (0.125) = 0.8 × (1 + 1/2.375) ≈ 1.14 GB.
D. Sensitivity Analysis The parameter β emphasizes capacity versus bandwidth utilization for Min-F1 and Min-F2 allocation methods. We obtained the number of allocations for β = 0, 0.5, ≥ 1 with N = 12 disks, RAID1:RAID5=1:3, the fraction of read requests is r = 1. VA sizes are exponentially distributed with V 1 = 256 MB for RAID1 and V 5 = 768 MB for RAID5, κ1 = 10κ5 . vmax = 1/50 and ρmax = 1/20. We consider all three cases: bandwidth-bound, balanced, and capacity-bound. The following conclusions can be drawn from this study. For balanced and capacity-bound workloads the number of VAs allocated increases more than 10% when β is increased from zero to one, but there is little improvement for β > 1. The effect on the bandwidth-bound workload is also less significant. Note that a large β would result in the effect of bandwidth utilizations being ignored. We have used β = 1.0 in the experimental results reported in this study. We study the sensitivity with respect to ρmax and vmax with the same parameters as before and setting β = 1. A more careful and costly study would consider all three control parameters together. The allocation method Min-F1 is used in this experiment, because it is one of the two best methods as shown in Section IV. We consider the bandwidth-bound, balanced, and capacity-bound workloads. The following conclusions can be drawn from Table VIII. ρmax affects the number of VDs for both balanced and bandwidth-bound workloads, but has no effect on capacitybound workloads, since the width is determined by vmax rather than ρmax . If we consider ρmax = 1/20 and a capacity-bound system, the increase in the number of allocations can be explained by the fact that for vmax = 1/25, 1/50, and 1/100 the mean disk capacity utilization was 96.5%, 98.5%, and 99.2%, respectively, while the standard deviation of capacity utilization was quite small in all cases. The improvement in allocation efficiency is due to the reduction in the sizes of allocation requests. The average RAID5 width was determined to be 4.7, 5.6, 7.3 for ρmax = 1/10, ρmax = 1/20, ρmax = 1/40, respectively.
α 0.125 0.25 0.375 0.5 0.625 0.75 0.875 1
G 2 4 5 7 8 9 11 12
Capacity 1.14 1.01 0.96 0.92 0.90 0.89 0.88 0.87
Bandwidth 8.56 9.51 10.46 11.42 12.37 13.32 14.27 15.22
γc (α) 0.133 0.106 0.091 0.081 0.073 0.067 0.061 0.057
TABLE IX
C HANGE OF CAPACITY / BANDWIDTH RATIO (γc ) FOR VA S WITH 8.5 ACCESSES PER SECOND PER GB VERSUS THE DECLUSTERING RATIO α IN CLUSTERED RAID5 WITH r = 1. W =N R/W 1:0 3:1 1:1
1 1 1
W ≤N W 1.45 5.8 1.30 5.7 1.22 5.6
1.57 1.49 1.34
G≤W W γ 3.9 0.106 2.03 0.086 2.8 0.06
TABLE X
C OMPARISON OF THE RELATIVE NUMBER OF ALLOCATIONS WITH
E. Clustered RAID5
AND WITHOUT CLUSTERING IN DEGRADED MODE WITH RESPECT TO THE BASE CASE : RAID5 WITH W = N ( SET TO 1).
Given that G is the parity group size and W ≤ N is the width of a VA, we study the effect of the declustering ratio: α = (G−1)/(W −1) on the number of clustered RAID5 arrays allocated. We shorten the discussion by postulating only read accesses (r = 0), since the increase in disk loads in degraded mode as given by 1 + α is highest in this case. and simply given by 1 + α at disks on which the VA is allocated. Disk loads in RAID5 disk arrays with read and write requests, with
The following conclusions can be drawn: 1) If the disk bandwidth is the bottleneck resource then more allocations can be made with smaller values of α, since the bandwidth requirement per VA is lower in degraded mode. 9
ρmax
vmax 1/10 1/20 1/40
Bandwidth-bound 1/25 1/50 1/100 46 46 46 49 49 49 50 50 50
Balanced 1/25 1/50 74 93 77 96 82 99 TABLE VIII
1/100 123 125 128
Capacity-bound 1/25 1/50 1/100 88 104 135 88 104 135 88 104 135
E FFECTS OF ρmax AND vmax IN VA S ALLOCATIONS . T HE SECOND ROW IS THE VALUES FOR vmax . T HE SECOND COLUMN IS THE VALUES FOR ρmax . T HE OTHER NUMBERS ARE THE TOTAL AMOUNT OF VA S ALLOCATED FOR A GIVE ρmax AND vmax IN DEGRADED MODE .
2) As α decreases the bottleneck resource may change from bandwidth to capacity. This is because as G = α(N − 1)+1 gets smaller, so does the capacity overhead, which is 1/G. The number of allocations increases until the capacity becomes the bottleneck resource. 3) If disk capacity is the bottleneck resource then more allocations can be made with larger values of α, because the capacity overhead per VA is lowest for G = N. We next determine the relative number of allocations with and without clustering with a bandwidth-bound workload with the Min-F1 method and β = 1. For each allocated VA the best α is computed to bring the capacity/bandwidth ratio (γc ) of the VA close to γd . Once the value of α is determined, the new width will be calculated accordingly. The following conclusions can be drawn from Table X: (i) Clustered RAID5 arrays increase the number of VA allocations significantly when all requests are reads (r = 1), (ii) Clustering has less effect for higher fractions of write requests. VI. CONCLUSIONS We have described HDA, proposed several data allocation methods, and reported experimental results with synthetic traces to compare the efficiency of several single-pass data allocation methods. In addition to cost savings due to consolidating RAID arrays into one, HDA balances disk load across arrays resulting in improved overall performance. In addition disk bandwidth is not wasted for data not requiring a high RAID level. Two of the allocation methods which take into account both disk access bandwidth and capacity, outperform other methods over a wide range of parameters for VA allocation requests. These two methods also result in disks with fully utilized bandwidth, capacity, or both bandwidth and capacity for bandwidth-bound, capacity-bound, and balanced workloads, respectively. We conjecture little or no improvement in allocation efficiency can be attained with more sophisticated allocation methods for VDs requiring a small fraction of disk resources. R EFERENCES
[6] P. M. Chen, E. K. Lee, G. A. Gibson, R. H. Katz, and D. A. Patterson, “RAID: High-performance, reliable secondary storage”, ACM Computing Surveys 26(2): 145-185 (1994). [7] EMC, http://www.emc.com/products/family/emc-centera-family.htm. [8] R. A. Hill, System for managing data storage based on vector-summed size-frequency vectors for data sets, devices, and residual storage on devices, US Patent 5345584, 1994. [9] W. W. Hsu, A. J. Smith, and H. C. Young, “The automatic improvement of locality in storage systems”, ACM Trans. Comput. Syst. 23(4): 424473 (2005). [10] N. Joukov et al. “RAIF: Redundant array of independent filesystems”, Proc. IEEE 24th Conf. on Mass Storage Systems and Technologies (MSST’07), San Diego, CA, Sept. 2007, 199-214. [11] D. Kenchemmana-Hosekote, D. He, and J. L. Hafner, “REO: A generic RAID engine and optimizer”, Proc. 6th USENIX Conf. on File and Storage Technologies (FAST’07), San Jose, CA, Feb. 2007, 261-276. [12] L. Kleinrock, Queueing Systems, Vol. 1: Theory/Vol. 2: Computer Applications, Wiley-Interscience. New York, NY, 1975/76. [13] R. R. Muntz and J. C.-S. Lui, “Performance analysis of disk arrays under failure”, Proc. 16th Int’l Conference on Very Large Data Bases (VLDB’90), Brisbane, Queensland, Australia, August 1990, 162-173. [14] D. Nagle, D. Sereneyi, and A. Mathews, “The Panasas ActiveScale storage cluster - Delivering scalable high bandwidth storage”, Proc. ACM/IEEE Supercomputing SC’2004 Conf., Pittsburgh, PA, Nov. 2006. [15] L. Tian and D. Feng, et al. “PRO: A popularity-based multi-threaded reconstruction optimization for RAID-structured storage systems” Proc. 5th USENIX Conf. on File and Storage Technologies (FAST’07), San Jose, CA, Feb. 2007, pp. 277-290 [16] A. Thomasian and J. Menon, “RAID5 performance with distributed sparing”, IEEE Trans. Parallel and Distributed Systems 8(6): 640-657 (June 1997). [17] A. Thomasian, “Reconstruct versus read-modify writes in RAID”, Information Processing Letters (IPL) 93(4): 163-168 (2005). [18] A. Thomasian, “Access costs in clustered RAID”, The Computer Journal 48(6): 702-713 (2005). [19] A. Thomasian and C. Han, “Heterogeneous disk array architecture and its data allocation policies”, Proc. Int’l Symp. on Performance Evaluation of Computer and Telecomm. Systems (SPECTS’05), Cherry Hill, NJ, July 2005, 617-624. [20] A. Thomasian, B. A. Branzoi, and C. Han, “Performance evaluation of a heterogeneous disk array architecture”, Proc. 13th IEEE Symp. on Modeling, Analysis, and Simulation of Computer and Telecomm. Systems (MASCOTS’05), Atlanta, GA, Sept. 2005, 517-520. [21] A. Thomasian, “Mirrored disk routing and scheduling”, Cluster Computing 9(4): 475-484 (2006). [22] A. Thomasian, “Shortcut method for reliability comparisons in RAID”, Journal of Systems and Software (JSS) 79(11): 1599-1605 (2006). [23] A. Thomasian and M. Blaum, “Mirrored disk reliability and performance”, IEEE Trans. Computers 55(12): 1640-1644 (2006). [24] A. Thomasian, G. Fu, and C. Han, “Performance of two-disk failuretolerant disk arrays”, IEEE Trans. Computers 56(6): 799-814 (2007). [25] A. Thomasian and M. Blaum, “Two disk failure tolerant disk arrays: Organization, operation, and coding methods”, ACM Trans. Storage 5(3): Article 7 (2009). [26] A. Thomasian and J. Xu, “RAID level selection for heterogeneous disk arrays”, Cluster Computing 14(2): 115-127 (2011). [27] B. Welch et al. “Scalable performance of Panasas parallel file system”, Proc. 7th USENIX Conf. on File and Storage Technologies (FAST’08), San Jose, CA, Feb. 2008, pp. 17-33. [28] J. Wilkes, R. Golding, C. Staelin, and T. Sullivan, “The HP AutoRAID hierarchical storage system”, ACM Trans. Computer Systems 14(1): 108136 (1996).
[1] G. A. Alvarez et al., “Minerva: An automated resource provisioning tool for large-scale storage systems”, ACM Trans. Computer Systems 19(4): 483-518 (2001). [2] E. Anderson, R. Swaminathan, A. Veitch, G.A. Alvarez, and J. Wilkes, “Selecting RAID levels for disk arrays”, Proc. USENIX Conf. on File and Storage Technologies (FAST’02), Monterey, CA, Jan. 2002, 189-201. [3] E. Anderson, S. Spence, R. Swaminathan, M. Kallahalla, and Q. Wang, “Quickly finding near-optimal storage designs”, ACM Trans. Computer Systems 23(4): 337-374 (2005). [4] E. Borowsky et al. “Using attribute-managed storage to achieve QoS”, Proc. Int’l Workshop on Quality of Service, Columbia University, New York, NY, June 1997. [5] J. L. Bruno, J. C. Brustoloni, E. Gabber, B. zden, and A. Silberschatz, “Disk scheduling with quality of service guarantees”, Proc. Int’l Conf. on Multimedia Computing and Systems (ICMCS), Vol. 2, Florence, Italy, June 1999, pp. 400-405.
10