PPM: A Partitioned and Parallel Matrix Algorithm to Accelerate Encoding/Decoding Process of Asymmetric Parity Erasure Codes Shiyi Li, Qiang Cao ∗ , Shenggang Wan, Wenhui Zhang and Changsheng Xie Wuhan National Laboratory for Optoelectronics Huazhong University of Science and Technology ∗ Corresponding Author:
[email protected]
Abstract—Erasure codes are widely deployed in storage systems and the encoding/decoding process is a common operation in erasure-coded systems. Parity-check matrix method is a general method employed in erasure codes to conduct encoding/decoding process. However, the process is serial and generates high computational cost in dealing with matrix operations, and hence, causes low encoding/decoding performance. Especially for some recently proposed erasure codes, including SD code, PMDS code, and LRC code, the disadvantages are more obvious. To address this issue, in this paper, we present an optimization algorithm, called Partitioned and Parallel Matrix (PPM) algorithm, to accelerate the encoding/decoding processes of these codes by partitioning the parity-check matrix, parallelizing the encoding/decoding operations, and optimizing the calculation sequence, so as to achieve the goal of fast encoding/decoding. Experimental results show that PPM can speed up the encoding/decoding process of these codes by up to 210.81%. Index Terms—Optimization Algorithm; Fault Tolerance; Erasure Codes; Storage system; Computational cost; Parallelism;
I. I NTRODUCTION Storage systems in modern datacenters have widely adopted erasure codes to guarantee high data availability under frequent component failures as they are comprised of a large number of disks and nodes [1]–[4]. In traditional, RAID-6 coding schemes including EVENODD [5] and RDP [6], can tolerate any double simultaneous disk failures. Erasure codes such as RS [7], Cauchy RS [8], and STAR [9], are able to tolerate multiple concurrent disk failures and are thus deployed in large storage systems. All the aforementioned erasure codes have a common feature that each parity block is generated by the same number of original data blocks. If two parity blocks are calculated by the same number of blocks, we call them symmetric parity. If all parity blocks are symmetric, we call such codes as Symmetric Parity erasure codes in this paper. Thus, all the aforementioned erasure codes are symmetric parity erasure codes. The symmetric parity erasure codes are designed for storage systems to protect against the same class of failures, such as device failures. However, storage systems are prone to a variety of failures such as sector, disk, machine and rack failures, as well as
Xubin He ∗ , Pradeep Subedi Department of Electrical and Computer Engineering Virginia Commonwealth University
[email protected]
the corresponding duration, possibilities and protection granularity are totally different. Thus, recent erasure codes with a specialized parity method were designed to ensure efficient resilience to various failures. In single machine environments, storage systems are vulnerable to both device failures and sector failures. One can design device-level redundancy to protect against device failures and sector-level redundancy to protect against sector failures. To tackle severe partial disk failures including latent sector errors [3], [10], [11] and data corruption [12], [13] in addition to the conventional faulttolerance against complete disk failures, researchers have presented new erasure codes, such as SD code [14], PMDS code [15], and so on, to avoid employing device-level redundancy to protect against sector failures and achieve space saving. In clouds, transient data unavailable (data unavailable with no permanent data loss) occupy for 90% of data center failure events and will trigger degraded reads [4], [16]–[18]. One has to design special parity to deal with degraded read. Therefore, Microsoft and Facebook recently proposed LRC codes ( [17], [18] respectively), which use local parity to reduce disk I/O, network overhead, and degraded read latency. These codes are widely adopted in practical storage systems such as WAS [19], Amazon’s Elastic Compute Cloud [20], and Facebook’s cluster [18]. Different from symmetric parity erasure codes, not all parity blocks in these codes are symmetric. We denote these codes as Asymmetric Parity erasure codes in this paper. With widespread application of erasure codes, research interests have grown significantly towards asymmetric parity erasure codes. The Asymmetric Parity erasure codes perform better than traditional symmetric parity erasure codes under the occurrence of various failures. However, current asymmetric parity erasure codes such as SD code, PMDS code, and LRC code still adopt the traditional parity-check matrix method [21], [22] to conduct encoding/decoding process. This method encodes the parity blocks or recovers the lost data/parity blocks by multiplying the large parity-check matrix by all surviving data/parity blocks, which is a strictly serial and coarse-grained process. According to our analysis in Section II, some faulty blocks in asymmetric parity erasure codes are
TABLE I S YMBOLS AND T ERMINOLOGIES Symbols n r s H RH CH u(M ) C Ci
Description the number of strips/chunks in a stripe the number of sectors in a strip the number of additional coding sectors in SD code the parity-check matrix the number of rows of H the number of columns of H the number of nonzero coefficients in matrix M the number of mult XORs() (defined in Section II-B) to evaluate the computational cost of encoding/decoding process the number of mult XORs() by using the ith calculation sequences
independent faulty blocks (denoted in Section II-B) and thus can be directly recovered in parallel. This finding provides a potential opportunity to effectively partition the large paritycheck matrix into several parts and recover the lost data in parallel. Furthermore, the new fine-grained and parallel process also can present multiple calculation sequences with different computational cost so that an optimized calculation sequence can be existed and further obtained to reduce the computational cost. To this end, we propose an optimization algorithm to improve the encoding/decoding performance of asymmetric parity erasure codes in this paper. The algorithm is called Partitioned and Parallel Matrix algorithm (or PPM for short), which automatically partitions the whole paritycheck matrix into several individual sub-matrices, determines an optimal calculation sequence at a minimum computational cost, launches threads for processing independent sub-matrices (denoted in Section III-A) and finally merges all recovered faulty data for processing the remaining sub-matrix (denoted in Section III-A). Compared to the traditional encoding/decoding algorithm, PPM not only exploits the independence of faulty data, but also benefits from the parallelism and optimizes the calculation sequence to reduce computational cost. In this paper, we make the contributions in this following: •
•
We propose PPM, an optimization algorithm which can improve the encoding/decoding performance by partitioning the parity-check matrix, parallelizing the encoding/decoding process, and optimizing the calculation sequence. PPM is able to decrease the computational cost and exploit the opportunities of computational parallelism of the encoding/decoding processes. We integrate the PPM algorithm into the encoding/decoding processes of SD code, PMDS code, and LRC code and evaluate its performance impacts. The experimental results show that it can achieve up to 210.81% improvement over SD code on encoding/decoding speed.
The rest of this paper is organized as follows. Section II presents the necessary background and gives the motivation of our paper. The design of the PPM algorithm is detailed in Section III. And then we conduct experiments and evaluate the performance improvement by PPM in Section IV. Section V discusses related work. Finally, we conclude this paper in Section VI.
II. BACKGROUND AND M OTIVATION In this section, we provide some background and key observations that motivate our work and facilitate our presentation of PPM in subsequent sections. To facilitate our discussion, we summarize the symbols used in this paper in Table I. A. Symmetric VS. Asymmetric Parity Erasure codes Erasure codes are widely used in storage systems to protect data against component failures. The key idea of erasure codes is to generate m parity strips from k data strips. The collection of n = k +m strips that encode together is called a stripe. One class of codes, called Maximum Distance Separable (MDS) codes, have the property that if any m strips fail, the original data can be reconstructed (called an (n, k)-MDS code). The entire system can be viewed as a collection of stripes and each stripe is encoded independently. Thus, we concentrate on a stripe of a storage system in this paper. Each strip comprises several basic blocks. While we refer to the basic blocks as sectors, they may constitute multiple sectors. Since various erasure codes have been presented, they can be classified into different categories in terms of storage cost, parity layout, and so on. We categorize erasure codes from a different perspective in this paper. In an erasure code, if each parity block is calculated by k blocks, it means that all parity blocks are symmetric in terms of the calculation. We denote these codes whose parity blocks are symmetric as Symmetric Parity erasure codes. Symmetric parity erasure codes nowadays are inefficient. Transient data unavailable occupy for 90% of data center failure events [4], [16]–[18]. The data-read will be degraded when upper users request these unavailable data. In this case, the requirement for unavailable data triggers a repair process, and once the unavailable data are recovered the results are returned to the upper users. This procedure is referred to as a degraded read. Symmetric parity erasure codes such as Reed-Solomon codes have high repair cost in terms of disk I/O and network traffic when dealing with degraded read. In addition, it will not be surprising to observe frequent occurrences of concurrent complete disk failures and latent sector errors in large-scale data centers. Plank et al. [14] have noted that simultaneous disk failures and any additional sectors error are how today’s storage systems actually fail. For symmetric parity erasure codes, it is overkill to dedicate entire disks to tolerate the failure of one or several sectors under the combinations of disk and sector failures. Thus, some new erasure codes, such as SD code, PMDS code, and LRC code were proposed recently to overcome the shortcomings of symmetric parity erasure codes. Different from symmetric parity erasure codes, these new erasure codes design specialized parity for dedicated failures. Thus, not all parity blocks are calculated by k blocks. As a consequence, parity blocks are categorized into at least two types. As shown in Figure 1, each parity block of the (6, 4)-MDS code is calculated by 4 data blocks. However, (4, 2, 2)-LRC code employed in cloud storage systems dedicates 2 parity blocks (called local parity) to protect data against degraded read,
Disk0
Strips
Disk1
...
Disk2
...
k data disks
Disk3
...
...
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Stripe
...
...
...
...
Strips
r rows
Disk0
Disk1
Disk2
Disk3
...
...
...
...
1
2
k data blocks
3
4
5
m parity blocks
(a) Two typical storage systems with symmetric parity erasure codes (both are erasure-coded by an (6, 4)-MDS code). Fig. 1.
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Stripe
...
...
...
...
m coding disks
0
k data disks
r rows
s coding sectors m coding disks
0
1
2
k data blocks
3
4
5
m parity blocks
6
7
l local parity blocks
(b) Two typical storage systems with asymmetric parity erasure codes (SD2,2 6,4 (8|1, 42, 26, 61) in top and (4, 2, 2)-LRC in bottom).
Storage system with two types of erasure coding. The parameterization of SD code SD2,2 6,4 (8|1, 42, 26, 61) are denoted in Section II-B.
and hence parity blocks are categorized into global parity and local parity. Global parity is similar to the traditional parity, and hence each global parity block is calculated by 4 data blocks. However, each local parity block is calculated by 2 data blocks. In single machine storage systems, which are erasure-coded with SD2,2 6,4 (8|1, 42, 26, 61), 2 additional blocks are devoted to coding, and hence parity blocks are categorized into disk parity and sector parity. Disk parity is the traditional parity and is calculated by 4 data blocks. However, each sector parity block is calculated by 22 blocks. In summary, these new erasure codes design specialized parity for dedicated failures, and hence have asymmetric parity in terms of the calculation. We denote these codes as Asymmetric Parity erasure codes in this paper. B. Traditional Encoding/Decoding Process Although researchers have presented these new asymmetric parity erasure codes, they don’t design optimized encoding/decoding method for these codes. SD code, PMDS code, and LRC code all use a parity-check matrix method [21], [22] to direct the encoding/decoding process. Therefore, we take SD code as an example to outline this process 1 . We parameterize the SD code as SDm,s n,r (w|a0 , ..., am+s−1 ) where n is the total number of disks, m is the number of coding disks, s is the number of coding sectors, r is the number of rows per stripe, w is the word size (the addition and multiply operations are over certain finite fields GF(2w )), and a0 , ..., am+s−1 are all coding coefficients. The traditional decoding process takes the following 4 steps: Step 1: Derive the parity-check matrix H directly from the definition. Assume H is a RH -row-CH -column matrix. For SD code, RH = m ∗ r + s, CH = n ∗ r. Each column of H corresponds to a dedicated sector in the stripe. The column i ∗ n + j of H corresponds to the sector bi∗n+j in row i and column j, 0 ≤ i < r, 0 ≤ j < n. 1 Since the encoding process of an erasure code is a special case of the decoding process, we only analyze the decoding process in this paper.
Step 2: Derive two new matrices F and S from H. The columns which correspond to the faulty blocks are extracted from H to create matrix F . The remaining columns are extracted from H to construct matrix S. Step 3: Invert matrix F to create matrix F −1 . Step 4: We denote B as the column vector including all blocks, BF as the column vector including all faulty blocks, and BS as the column vector including all surviving blocks. If there is no faulty block in the stripe, the product of H and B is equivalent to zero (H ∗ B = 0). Thus, BF is equal to F −1 times S times BS. The process from Step 2 to Step 4 is denoted as matrix decoding. 1,1 Figure 2 shows the decoding process of SD4,4 (8|1, 2). The disk array consists 4 disks and the five equations to define the code are shown in the middle of the figure. There are 16 sectors in the example stripe, including 5 faulty sectors (b2 , b6 , b10 , b13 , and b14 ) and eleven surviving sectors. Through analyzing the traditional decoding process of asymmetric parity erasure codes, we find that the traditional encoding/decoding algorithm treats all faulty blocks as a unity and recovers them together and executes the decoding process serially. When recovering a faulty block, the block needs to be recovered with all other faulty blocks together. Actually, not all faulty blocks have to be done so. Some faulty blocks can be independently recovered from surviving blocks or only need to be recovered with some other faulty blocks together. We denote these faulty blocks which do not need to be recovered with all other faulty blocks together as independent faulty blocks while the remaining faulty blocks as dependent faulty blocks, in this paper. The independent faulty blocks can be recovered first and processed in parallel. Then all the recovered independent faulty blocks participate in recovering the remaining faulty blocks. As shown in Figure 2, the 5 faulty sectors b2 , b6 , b10 , b13 , and b14 are recovered together in Step 4. Actually, the three faulty sectors b2 , b6 , and b10 are independent faulty sectors and can be independently recovered
SD41,,41 8 | 1, 2
H *B = 0
Parity symbol
Data symbol
BFT = b2 b6 b10 b13 b14
n disks total
r rows
b0 b1 b2 b3 b4 b5 b6 b7
15
2 *b i
i 0
Step 4
0
b4 b5 b6 b7 0 b12 b13 b14 b15 0
10
6
13
14
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
0
0
0
0
0
0
0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
1 0
1 0
1 0
1 0
0 1
0 1
0 1
21
22
23
24
25
26
27
28
29
210
211
212
213
214
Fig. 2.
i
Step 3
Step 1 0 0 0 1 215
9 144 245 155 9 144 244 155
1 0 0 -1 86 86
b0 b1 b2 b3 0 b8 b9 b10 b11 0
b12 b13 b14 b15 1 0 0 0 20
BF = F-1 * F * BF = F-1 * S * BS
F * BF = S * BS
Encoding Equations
b8 b9 b10 b11 2
BST = b0 b1 b3 b4 b5 b7 b8 b9 b11 b12 b15
0 1 0
0 0 1
0 0 0
0 0 0
Step 2 1 0 0 0 20
1 0 0
1 0 0
0 1 0
0 1 0
0 1 0
0 0 1
0 0 1
0 0 1
0 0 0
0 21
0 23
0 24
0 25
0 27
0 28
0 29
0 211
1 212
1 15 2 0 0 0
1 0 0 0 2 2
0 1 0 0 26
0 0 1 0 210
0 0 0 1 213
0 0 0 1 214
The encoding/decoding process of traditional encoding/decoding algorithm
from surviving sectors. The two faulty sectors b13 and b14 are dependent faulty sectors, which not only depend on each other but also on all other faulty blocks b2 , b6 , and b10 . Thus, the three faulty sectors b2 , b6 , and b10 can be recovered first and processed in parallel, then we use them to recover the remaining two faulty sectors b13 and b14 . Next, we analyze the computational cost of the traditional decoding algorithm. All the addition and multiply operations showed in the paper are arithmetic over GF(2w ), in which linear arithmetic can be decomposed into the basic operations mult XORs() [23]. Similar to [24], we define mult XORs(d0 , d1 , a) as an operation that first multiplies a region d0 of bytes by a w-bit constant a in Galois Field GF(2w ), and then applies XOR-summing to the product and the target region d1 of the same size. For example, R = a0 ∗ d0 + a1 ∗ d1 + a2 ∗ d2 can be decomposed into three mult XORs() (assuming R is initialized as zero): mult XORs(d0 , R, a0 ), mult XORs(d1 , R, a1 ), and mult XORs(d2 , R, a2 ). Clearly, fewer mult XORs() bring lower computational cost. To evaluate the computational cost of encoding/decoding process, we count its number of mult XORs() (per stripe), denoted as C. As shown in Figure 2, the computational cost of encoding/decoding process is equal to the cost of calculating the product of F −1 , S and BS. In order to calculate F −1 ∗S ∗BS, there are two calculation sequences. One is that we multiply S to BS first, and then multiply F −1 to the product of S and BS. This sequence is used in [25] and denoted as Normal sequence in this paper. The other is that we multiply matrix F −1 to matrix S first, and then multiply the result to BS. The calculation sequence is denoted as Matrix First sequence in this paper. The matrix first sequence corresponds to the generator matrix encoding/decoding method [26] [27]. The computational cost of the former sequence is denoted as C1 , which is decided by the number of nonzero coefficients
in matrix F −1 and S. The computational cost of the latter sequence is denoted as C2 , which is decided by the number of nonzero coefficients in matrix F −1 ∗ S. If we denote u(M ) as the number of nonzero coefficients in matrix M , we have2 : C1 = u(F −1 ) + u(S), C2 = u(F −1 ∗ S) Generally, C1 does not equal to C2 . For example, as shown in Figure 2, C1 = 35. Based on matrix F −1 and S shown in Figure 2, we have: 1 1 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 -1 F * S 0 0 0 0 0 0 1 1 1 0 0 205 125 250 76 131 27 180 216 173 143 2 205 125 250 76 131 27 180 216 173 142 3
Thus, we can see that C2 = 31 < C1 . Consequently, different calculation sequences result in different computational costs during the matrix decoding. We report two observations from analyzing the traditional encoding/decoding process of asymmetric parity erasure codes. First, some faulty blocks are independent faulty blocks and thus can be directly recovered in parallel. Second, one can reduce the computational cost by choosing a specific calculation sequence. Based on these observations, we present our PPM algorithm. III. D ESIGN OF PPM Since we have described the two observations which guide us to design the PPM algorithm, we will describe in detail how the PPM algorithm exploits the parallelism and optimizes the calculation sequence to reduce the computational cost in this section. 2 The computational cost of matrix multiplication is too low (w≤4bytes) compared to that of multiplying matrix to block (sizeof(block)≥ 512bytes), and thus can be ignored in the analysis.
1 0 0 0 0 2
13
14
1 0 0
1 0 0
2
1 0 0
0 1 0
0 1 0
6
0 1 0
0 1 0
0 0 1
0 0 1
10
0 0 1
0 0 1
0 0 0
0 0 0
0 0 0
0 21
0 22
0 23
0 24
0 25
0 26
0 27
0 28
0 29
0 210
0 211
1 212
1 213
1 214
Step 3.1
1
H 1 215
T=3
1 F0
1 F0
BF0 = F0 * S0 * BS0
1
1 1 S1
1 F1
1 F1
BF1 = F1 * S1 * BS1
1
1 1 S2
1 F2
1 F2-1
BF2 = F2 * S2 * BS2
-1
-1
-1
Log table
Thread 0
-1
0 22
0 23
0 24
0 25
0 26
0 27
Thread 2
Frest 0 28
0 29
0 210
0 211
Fig. 3.
1 212
1 1 215 213
H0
1
li (2) (6) (10) (13,14) (2,6,10,13,14)
1
1
Step 4.1
1 214
6
10
0
0
0
0
0
0
0
0
0
0
0
1
1
1
22
23
24
25
26
27
28
29
210
211
212
213
214
T
Step 3
T
BS0 =
-1
Frest
Step 4.2
245 155 244 155
1 215
T
BF2 = b10
T BS1 = b4 b5 b7
b0 b1 b3
T BS2 = b8 b9 b11
T BFrest = b13 b14
1
1 14
21
T
1
1
1
Hrest 13
10
1
1
1
H2
6
2
2
1
H1
BF0 = b2 BF1 = b6
-1
Srest 0 21
0 20
ti 1 1 1 2 5
Thread 1
T BSrest = b0 b1 b2 b3 b4 b5 b6 b7 b9 b9 b10 b11 b12 b15
Step 4
i 0 1 2 3 4
Step 2
Step 3.3
Step 3.2
1 1 S0
0 20
Step 1
0 0 0
SD41,,41 8 | 1, 2
Step 4.3 -1
BFrest = Frest * Srest * BSrest
The encoding/decoding process of the PPM algorithm
A. Independence Exploit and Partition As mentioned in Section II-B, some faulty blocks are independent faulty blocks. We will describe how to exploit the independence in this subsection. Once these independent faulty blocks are found, we can conduct the partition by extracting the corresponding rows from H. In order to exploit the independence, we introduce a data structure called Log Table. We use H(i, j) to represent the element in row i and column j of H. Each row of the log table has the form of (i, ti , li ), where i represents the row number in H, ti represents the number of nonzero elements located in these columns which correspond to the faulty blocks, and (j1 , ...jti ) (denoted as li ) represents the column numbers of the ti elements. Thus, the log table has RH rows. As shown in Figure 3, in row 0 only H(0, 2) is not equal to zero and located in column 2 which corresponds to faulty sector b2 . Thus, t0 = 1 and l0 = (j1 ) = (2). The first row of the log table is (0, 1, (2)). Now we exploit the independence based on the log table. For each row i of the log table, if ti = 1, the faulty block bj1 is an independent faulty block, 0 ≤ i < RH . Therefore, the row i is extracted from H to create a new matrix to recover the independent faulty block bj1 independently. The new matrix is denoted as independent sub-matrix in this paper. However, if ti = f > 1, we will find whether there are f − 1 rows r0 , ..., rf −2 matching tr0 = ... = trf −2 = ti and lr0 = ... = lrf −2 = li , 0 ≤ r0 , ..., rf −2 < RH , 1 < f < RH . If so, the f faulty blocks (bj1 , ..., bjf −1 ) are independent faulty blocks. Besides, each faulty block among this faulty block group depends only on the other f − 1 faulty blocks in the same group. The f rows are extracted from H to create an independent sub-matrix to recover the f faulty blocks independently. After traversing all rows, p independent sub-matrices can be generated. All remaining rows of H are
used to create matrix Hrest to recover the remaining faulty blocks. The matrix Hrest is denoted as remaining sub-matrix in this paper. Thus, H are partitioned into p + 1 sub-matrices in total and p of the p+1 sub-matrices (denoted as H0 ...Hp−1 ) are independent sub-matrices which can be decoded in parallel. As illustrated in Figure 3, based on the failure scenario and the parity-check matrix H, we can generate the 5 rows log table. Because t0 = t1 = t2 = 1, the three faulty sectors b2 , b6 , and b10 are all independent faulty sectors. Accordingly, the three rows 0, 1, and 2 are extracted from H to create three independent sub-matrices H0 , H1 , and H2 . No row matches ti = 2, li = (13, 14) except for row 3 while no row matches ti = 5, li = (2, 6, 10, 13, 14) except for row 4, 0 ≤ i < 5. Thus, the two rows 3 and 4 cannot be extracted from H and are used to create the remaining sub-matrix Hrest . The paritycheck matrix H are partitioned into four sub-matrices in total and p is equal to 3. Because the partition operation may arise all zero columns in some sub-matrices, all sub-matrices do not include the all zero columns. Aforementioned approach is a general method. For a particular erasure code such as SD code, PMDS code, and LRC code, the method to exploit the independence is much simpler. Let SD code as an example, we only need to count the number of faulty sectors v in each row i of the stripe, 0 ≤ i < r. If v = m, the v faulty sectors are independent faulty sectors and the corresponding rows are extracted from H to create an independent sub-matrix to recover the v faulty sectors. After traversing all rows, all remaining rows of H are used to create the remaining sub-matrix Hrest to recover the remaining faulty sectors. B. Calculation Sequences Now we analyze the computational cost of PPM. The computational cost of encoding/decoding process is the com-
C4/C1
1.6
1 0.8
1.3
0.6
1
0.4
0.7
0.4
6
11
16
21
C4/C1,s=1 C4/C1,s=2 C4/C1,s=3
C3/C1,s=1 C3/C1,s=2 C3/C1,s=3
C2/C1,s=1 C2/C1,s=2 C2/C1,s=3
6
11
16
21
6
11
16
z=1 6
11
16
z=2 6
21
z=3 11
16
21
6
11
16
21
4 Fig. 5. The value of C for different values of z (s = 3,r = 16). Only C1 when s = 3, the value of z can range from 1 to 3.
21
Fig. 4. The computational cost of various calculation sequences, Ci , 1 ≤ i ≤ 4, let C1 as baseline.
putational cost of the p + 1 matrix decoding operations to decode the p independent sub-matrices H0 ...Hp−1 and the remaining sub-matrix Hrest . Thus, it is equal to the cost of calculating the p + 1 products: F0−1 ∗ S0 ∗ BS0 ,..., −1 −1 Fp−1 ∗ Sp−1 ∗ BSp−1 , and Frest ∗ Srest ∗ BSrest . As shown in Figure 3, the computational cost of decoding SD1,1 4,4 (8|1, 2) is equal to the cost of calculating the 4 products: F0−1 ∗S0 ∗BS0 , −1 F1−1 ∗ S1 ∗ BS1 , F2−1 ∗ S2 ∗ BS2 , and Frest ∗ Srest ∗ BSrest . As mentioned in Section II-B, different calculation sequences result in different computational costs during the matrix decoding. There are 2 calculation sequences to decode the large parity-check matrix H in traditional encoding/decoding process. Because PPM has partitioned H into p + 1 submatrices, there are 2p+1 calculation sequences to decode the p + 1 sub-matrices. Since all elements in the p independent sub-matrices H0 ...Hp−1 which correspond to the independent faulty blocks are not equal to 0, we can derive the following equation, u(Fi ) + u(Si ) > u(Fi ∗ Si ). All the p matrix decoding operations on the p independent sub-matrices have lower computational cost by choosing matrix first sequence. Thus, the optional calculation sequences are reduced to 2, they are listed as follows. Pp−1 −1 C3 = i=0 u(Fi−1 ∗ Si ) + u(Frest ∗ Srest ), Pp−1 −1 −1 C4 = i=0 u(Fi ∗ Si ) + u(Frest ) + u(Srest ) Considering an instance of erasure codes, the parity-check matrix H is determined. For an erasure code, the computational cost can be obtained by numerical analysis. For example, if the s additional faulty sectors located in z rows in SD code, we can conclude that 3 : C1 = n ∗ r ∗ (m + s) + m ∗ (m ∗ r + s) ∗ (z − 1) + m2 ∗ (r − z) C2 = (n ∗ r − (m ∗ r + s)) ∗ (m ∗ z + s) + m ∗ (n − m) ∗ (r − z) C3 = (n ∗ r − (m + s)) ∗ (m ∗ z + s) + m ∗ (n − m) ∗ (r − z) C4 = n ∗ r ∗ (m + s) + m ∗ (m ∗ z + s) ∗ (z − 1) − m2 ∗ (r − z) with 4 ≤ n ≤ 24, 4 ≤ r ≤ 24, 1 ≤ m ≤ 3, 1 ≤ s ≤ 3, and 1 ≤ z ≤ s. We can see that: C1 − C4 = m2 ∗ (z + 1) ∗ (r − 1) > 0, C3 − C2 = m ∗ (r − 1) ∗ (m ∗ z + s) > 0 Thus, the values of C2 and C4 are smaller among C1 , C2 , C3 , and C4 . We only need to compare the value of C2 and that 3 The four equations are derived by the simulation results of Figures 4 − 6 (print the number of non-zero elements in each matrix and sum them).
1
1
1
0.88
0.75
0.6
0.76
0.5
0.2
Fig. 6.
The value of
C4 C1
for different values of r.
of C4 and choose the smaller one. However, when choosing C2 (C4 > C2 ), the decoding process just changes the normal sequence to matrix first sequence to obtain the optimization and still uses the large parity-check matrix H to conduct decoding process. Actually, we evaluate the computational cost of SD code under various configurations and failure scenarios and find that the possibility of C4 > C2 is only around 5%. Besides, the value of n is often equal to 4 or 5 and no more than 9 when C4 > C2 . Thus, we generally choose C4 except for some special cases when n ≥ 6, while the reduced computational cost is equal to C1 − C4 = m2 ∗ (z + 1) ∗ (r − z) when compared to the traditional method C1 . The computational 4 cost is reduced by C1C−C = 17.14% in the example shown in 1 Figure 2. Figure 4 presents the computational cost of various calculation sequences for different values of n for SD code when r = 16, z = 1. As shown in Figure 4, C4 has the smallest C3 2 value in most cases. We observe that the values of C C1 , C1 , and C4 C1 become larger with the increase in the value of n and also C2 C3 4 the s. The values of C , C1 , and C C1 increase more quickly as 1 4 the increased value of m. The average value of C C1 is equal to 85.78% (in the range from 47.97% to 98.06%), which is a large difference between C1 and C4 . Besides, the value of C4 C1 becomes larger as the reduced value of m (see Figure 5, where 1 < z ≤ s). We have similar observations when 4 ≤ r ≤ 24, as shown in Figure 6. We observe that the value of C4 C1 decreases as the value of either z or r increases. Therefore, PPM can reduce the computational cost by optimizing the calculation sequences. C. Parallelism As mentioned in the Section III-A, the partition operation generates p independent sub-matrices which correspond to the independent faulty blocks. The p independent sub-matrices H0 ...Hp−1 can be decoded in parallel to recover the corresponding independent faulty blocks. The remaining sub-matrix Hrest is decoded after the p matrix decoding operations have finished. As shown in Figure 3, the three independent submatrices H0 , H1 , and H2 are decoded in parallel to recover the
three independent faulty blocks b2 , b6 , and b10 . After the three matrix decoding operations are completed, the remaining submatrix Hrest are decoded to recover the remaining two faulty blocks b13 and b14 . Thus, the parallelism of PPM algorithm depends on the value of p. • case 1: p = 0, it means that no independent sub-matrix is created, Hrest = H. It cannot trigger parallelism. • case 2: p = 1, it means that only one independent submatrix is created and cannot trigger parallelism. • case 3: 1 < p < RH , the degree of parallelism is equal to p. There are further two subcases, depending on whether Hrest is null or not. • case 3.1: if Hrest = N U LL, it means that there are no dependent faulty blocks. • case 3.2: if Hrest 6= N U LL, the common case processed by PPM. • case 4: p = RH , it means that each faulty sector is an independent faulty sector and is independent of other faulty sectors, Hrest = N U LL. Besides, it causes the maximum parallelism. We focus on the case 3.2 in this paper. The time cost of decoding sub-matrix Hi by matrix-first sequence decoding is denoted as ci , 0 ≤ i < p. Matrix Hmax represents the sub-matrix that has the maximum time cost in decoding among the p sub-matrices H0 ...Hp−1 which correspond to the independent faulty blocks. Thus, cmax = max{c0 ...cp−1 }, 0 ≤ max < p. Ideally, the saved time cost is equal to Pp−1 i=0 ci − cmax . Actually, some additional time is spent on creating multiple threads and conducting matrix multiplication. But this latency is relatively low when the size of the sector is large. Our experimental results of PPM shown in Section IV take into considerations the overhead introduced by multithreading. We also restrain the number of threads T (T ≤ p) to avoid thread-overloading. D. Partitioned and Parallel Matrix Algorithm Based on the aforementioned description, we can conclude the key idea of PPM algorithm: the large parity-check matrix is partitioned into p + 1 sub-matrices based on the independence exploitation, which triggers T (T ≤ p) threads to process the p independent sub-matrices in parallel and then uses the recovered independent faulty blocks to recover remaining faulty blocks by decoding the remaining sub-matrix Hrest , while optimizing the calculation sequence for each matrix decoding operation. Thus, the decoding process of PPM can be summarized as follows: Step 1: Derive the parity-check matrix H. Step 2: Based on the failure scenario and H, one can generate a log table first, and then partition H into p + 1 sub-matrices based on the log table. Step 3: Arrange T (T ≤ p) threads to decode the p independent sub-matrices in parallel and conduct matrix decoding operation with the matrix first sequence for each independent sub-matrix Hi , 0 ≤ i < p. Derive two new matrices Fi and Si from each independent sub-matrix Hi (labeled with Step 3.1), invert matrix Fi to
create matrix Fi−1 (labeled with Step 3.2), and calculate the product of Fi−1 , Si and BSi (labeled with Step 3.3) by using the matrix first sequence, 0 ≤ i < p. Step 4: Once the T parallel matrix decoding operations are completed, the recovered independent faulty sectors participate in recovering the remaining faulty sectors by using the normal sequence to decode the remaining sub-matrix Hrest . When all of faulty sectors are recovered, the decoding process is finished. Because the recovered faulty sectors in Step 3 can be used to recover the remaining faulty sectors, we only extract those columns that correspond to the remaining faulty sectors from Hrest to create the matrix Frest . The remaining columns of Hrest are used to create Srest . 1,1 (8|1, 2) as an example to describe the We still take SD4,4 decoding process of PPM, as shown in Figure 3. The failure scenario is also unchanged. As mentioned in the Section III-A, SD code has a much simpler method to exploit the independence. Algorithm 1 describes the decoding process of PPM for SD code. In summary, the PPM algorithm can achieve performance improvement at two aspects. First, it exploits the parallelism of multiple concurrent threads to perform decoding operations. Second, it can reduce the computational cost by partitioning the matrix and optimizing the calculation sequences. Algorithm 1: PPM decoding algorithm for SD code. Step 1: Given an instance of SDm,s n,r (w|a0 , ..., am+s−1 ), we derive the parity-check matrix H directly from the definition, H is m ∗ r + s-row-r ∗ n-column array. Step 2: Partition H into p + 1 sub-matrices and p of the p + 1 sub-matrices are independent sub-matrices which triggers T threads to decode them in parallel. Let p ← 0; Let string ← N U LL; Create the remaining sub-matrix Hrest ; for i = 0 to r − 1 do Let c ← 0; for j = 0 to n − 1 do if N U LL == (string = read(bi∗n+j )) then c + +; if c > m then Use the m ∗ i, m ∗ i + 1,...,m ∗ i + m − 1th row of H to create independent sub-matrix Hp . Arrange thread (p mod T ) to decode the independent sub-matrix Hp by using matrix-first sequence matrix decoding. p + +; else Use the m ∗ i, m ∗ i + 1,...,m ∗ i + m − 1th row of H to fill up the remaining sub-matrix Hrest . Step 3: Once the T threads used to decode the p independent sub-matrices H0 ...Hp−1 are finished, we decode the remaining sub-matrix Hrest by using normal sequence matrix decoding. Matrix Decoding: Substep 1: Create two new matrices Sp and Fp . Use these columns of Hp which correspond to the faulty sectors to create Fp and the remaining columns of Hp to create Sp . Substep 2: Invert matrix Fp to create matrix Fp−1 . Substep 3: Calculate the product of F −1 , S and BS. The result is the value of the recovered faulty sectors.
IV. P ERFORMANCE E VALUATION We have detailed the design of PPM and evaluated the reduced computational cost of PPM from a theoretical perspective in the last section. Now we will evaluate them by real experiments in this section.
2.10
Improvement Ratio
We implement PPM on SD code, PMDS code, and LRC code to optimize the encoding/decoding performance and evaluate the improvement. And then the optimized encoding/decoding performance is compared with that of RS code. SD, LRC, PMDS, and RS are all based on the Galois Field arithmetic over w-bit symbols, so we make some modifications over the open source SD encoder and decoder [25] to implement them. PPM is written in C and implemented in the encoder and decoder of these codes. All experiments employ Intel’s SIMD instruction to accelerate the encoding/decoding performance [23]. We run our performance tests on three machines equipped with Intel E5 − 2603 (1.80GHz, 10MB L3-cache, 4-cores), i7 − 3930K (3.20GHz, 12MB L3-cache, 6-cores), and E5 − 2650 (2.00GHz, 20MB L3-cache, 8-cores) respectively. All these three CPUs have a 256KB L2-cache and support SSE4.2. Each decoding operation runs 10 times and the average results are shown in the following figures. Since PMDS code is a subset of SD code, the experimental results of SD code also reflect that of PMDS code. When a disk fails, the other disks in the array may encounter latent sector errors at the same time [3], [6], [14]. We concentrate on the simultaneous failures of disks and sectors. If the number of faulty disks m0 is smaller than m or the number of faulty sectors s0 is smaller than s, the decoding process resembles the worst case of decoding 0 0 ,s SDm n,r (w|a0 , ..., am+s−1 ). So we only test the worst case in this paper. We use a random integer generator [28] to simulate the m faulty disks (m random numbers in (0..n − 1)) and the s additional faulty sectors (the surviving sectors are labeled from 0 to (n − m) ∗ r − 1, s random numbers in (0..(n − m) ∗ r − 1)). The s additional faulty sectors can reside on z (1 ≤ z ≤ s) rows. We have tested the performance of PPM for different z and find that PPM always achieves nearly the same performance improvement for different values of z. Thus, we just show the results of z = 1 in this paper. For SD code, there is a feature that the degree of parallelism p is equal to r − z. In order to avoid thread-overloading, restrain the additional cost of parallelism, and maintain CPU overhead, we let T ≤ min{4, corenumbers}. Actually, when the value of T is larger than corenumbers, the improvement will decrease. Figure 7 shows the improvement of PPM under different values of T . When T < corenumbers, varying T can represent the effect of different number of cores over PPM. When m > 1, the improvement of PPM increases as the increased value of T when T ≤ corenumbers and then reverses when T ≥ corenumbers. PPM achieves the maximum improvement when T = corenumbers. When m = 1, PPM achieves the maximum improvement when T = 2 and the improvement of PPM decreases as the increased value of T . That is because the parallelism part occupies a small port of the whole encoding/decoding process (no more than 50%) and the additional cost is relatively high when m = 1. Besides, the improvement of PPM is more sensitive to the configurations of SD code. Figure 8 shows the performance improvement of PPM for SD code for different values of n and r. The encod-
1.40
0.70
0.00 6 11 16 21 6 11 16 21 6 11 16 21 6 11 16 21 6 11 16 21 6 11 16 21 6 11 16 21 6 11 16 21 6 11 16 21 1 2 3 1 2 3 1 2 3 1 3 2
n s m
Fig. 7. The performance improvement of PPM under different values of T (stripe size = 32MB, r = 16, z = 1, 4-cores E5 − 2603 CPU). 4.1
3.5
opt-SD,s=1 2.9 SD,s=2
2.5
SD,s=1
opt-SD,s=2 1.7 SD,s=3
1.5
opt-SD,s=3 0.5 6 RS,w=8 RS,w=16
0.5 10
14
18
22
2.8
2.8
2
2
1.2
1.2
6
10
14
18
22
6
10
14
18
22
6
10
14
18
22
RS,w=32
0.4
0.4 6
10
14
18
22
2.3
2.3
1.6
1.6
0.9
0.9
0.2
0.2
6
10
14
18
22
Fig. 8. The performance improvement of PPM for SD code (labeled by opt-SD in this paper) for different values of n and r (stripe size = 32MB, T = 4, z = 1, 4-cores E5 − 2603 CPU).
ing/decoding performance of SD code becomes lower as the value of either m or s increases (close to the speed of disk I/O). Thus, optimizing the encoding/decoding performance becomes more acute. PPM improves the decoding speed by 61.09% on average (in the range from 8.22% to 210.81%). We note that the performance improvement becomes smaller as the increased value of n or s and as the decreased value of m or r. It is consistent with the numerical analysis results mentioned in Section III-B. The jagged lines in all these figures are a result of switching between GF(28 ), GF(216 ) and GF(232 ) [14]. Besides, we also test GF(28 ), GF(216 ) and GF(232 ) for RS code (Note that all results of RS code shown in the figure are with m + 1). The results show that the speed of the optSD with m is competitive to RS code with m + 1, especially when the value of either n or m is large. We also evaluate the impact of stripe size on the improvement of PPM, shown in Figure 9. The affect of multi-threading overhead on PPM decreases as the increased value of stripe size, resulting steady improvement when stripe size is larger than 8M B. In order to verify that the efficiency of PPM is independence of the performance of CPU. We conduct the experiment on three kinds of CPUs i7 − 3930K, E5 − 2650, and E5 − 2603 and find that PPM achieves similar improvement on all the three CPUs, as shown in Figure 10. We also execute PPM in LRC code to observe the performance improvement. Figure
1.8
m=1,s=1 m=2,s=1 m=3,s=1
m=1,s=2 m=2,s=2 m=3,s=2
m=1,s=3 m=2,s=3 m=3,s=3
stripe sizes, and different CPUs for both SD code and LRC code, up to 210.81%.
1.2
V. R ELATED W ORK 0.6
0 2M
4M
8M
16MB
32MB
64MB
128MB
Fig. 9. The performance improvement of PPM for SD code for different stripe sizes (n = 16, r = 16, T = 4, z = 1, 4-cores E5 − 2603 CPU). 2.25
1.50
0.75
0.00 6 11 16 21 6 11 16 21 6 1116 21 6 11 16 21 6 11 16 21 6 1116 21 6 11 16 21 6 11 16 21 6 1116 21 1 2 3 1 2 3 1 2 3 1 3 2
Fig. 10. The performance improvement of PPM for SD code for different CPUs (stripe size = 32MB, r = 16, T = 4, z = 1).
Improvement Ratio
11 shows the performance improvement of PPM for LRC code for different storage cost (ranges from 1.1 to 1.7) with fixed stripe size (stripe size=32M B) and fixed strip size (strip size=64M B). The improvement ranges from 16.28% to 36.71%. However, the performance improvement of PPM for LRC code is smaller than that of SD code. The reason is that the latter can achieve a higher degree of parallelism and benefit more from parallelism than the former. Nevertheless, PPM can achieve performance improvement on both of the two asymmetric parity erasure codes. PPM does not introduce any extra resource to conduct parallelism, but utilizes the resource that is already available. Even using two threads to conduct parallel decoding, PPM can still achieve significant improvement (46.29% on average, in the range from 8.45% to 178.38%). Thus, PPM will not cause thread-overloading and do not consult much CPU overhead. The extra power consumption of PPM is also not high (our test results show that it is no more than two watts). But power/energy is not our focus in this paper, so we did not do detailed evaluation (also due to the page limit). Besides, we do not mention but demonstrate that PPM can achieve performance improvement without triggering parallelism in Section III-B. Thus, PPM can gain benefits in CPU overhead and power/energy efficiency by reducing the computational cost. In summary, PPM consistently achieves performance improvement for different value of n, m, r, s, z, T , different 0.39
Stripe size = 32MB
Strip size = 64MB
0.31
Speeding up the encoding/decoding process of erasure codes is a critical research area where many researchers have paid attention [5] [6] [8] [23] [27] [24], because the encoding/decoding procedure is common and the speedup has potential advantages on the latency. The encoding/decoding is triggered when reconstruction/full-stripe writes or failures occur [6] (failures happen in bursts [1]–[4]), extents are sealed [17], and user read the data protected by non-systematic code (such as F-MSR [29]). Besides, guaranteeing high-speed data encoding and decoding for erasure coding has become more important with the rapid development and wide adoption of storage class memories (SCM) such as PCM (phase change memory) [30] [31] [32], STT-RAM [33], [34]. Compared to the speed of such high performance storage components, the computation time spent on data encoding and decoding will not be trivial. This is also the reason behind researcher’s focus on improving the encoding/decoding performance of existing erasure codes [5] [6] [8] [35] [23] recently. These technologies are presented to improve encoding/decoding speed of symmetric parity erasure codes. The four schemes [5] [6] [8] [24] present new erasure codes to reduce the computational cost of the encoding/decoding process. Anvin’s scheme [35] is a special optimization of RS encoding for RAID-6. The scheme presented in [23] uses the Intel’s SIMD instruction to accelerate the multiply operation of Galois Field and is integrated into our scheme in our experiments. However, seldom works have been done to analyze the matrix encoding/decoding process and improve the encoding/decoding performance of asymmetric parity erasure codes. To the best of our knowledge, PPM is the first general algorithm that optimizes the encoding/decoding performance of asymmetric parity erasure codes. Exploiting the parallelism during encoding/decoding process has been a longstanding goal of storage systems design. A lot of parallelism algorithms such as block-level parallelism algorithm [36]–[38], disk-level parallelism algorithm [39], [40], and equation-oriented parallelism algorithm [41] have been presented. Different from these algorithms, PPM is a matrix-oriented parallelism algorithm and extracts independent sub-matrices based on the independent faulty blocks. The equation-oriented parallelism algorithm presented in [41] is more close to PPM, but it is designed for Cauchy RS code (a kind of symmetric parity erasure codes) and only discusses it in theory while PPM is a general algorithm designed for asymmetric parity erasure codes. Besides, PPM can also reduce computational cost except for exploiting the parallelism.
0.23 0.15
VI. C ONCLUSIONS 1.1
1.3
1.5
Storage Cost
1.7 1.1
1.3
1.5
1.7
Storage Cost
Fig. 11. The performance improvement of PPM for LRC code for different storage cost with fixed stripe size and fixed strip size (CPU is E5 − 2603).
In this paper, we categorize erasure codes into symmetric parity erasure codes and asymmetric parity erasure codes in terms of whether each parity block is calculated by the
same number of blocks or not. We observe that some stateof-the-art asymmetric parity erasure codes still use the traditional parity-check matrix to conduct encoding/decoding process. By analyzing the process, we find that the existence of independent faulty blocks provides a chance to exploit parallelism and reduce computational cost. It motivates us to present an optimization algorithm for these codes. The optimization algorithm, called Partitioned and Parallel Matrix (PPM) algorithm, can improve the encoding/decoding speed by decreasing the computational cost and exploiting the parallelism of the encoding/decoding operations. When employing PPM in SD code, PMDS code, and LRC code, it can achieve up to 210.81% improvement on the encoding/decoding speed for different configurations, thread numbers, stripe sizes and different CPUs. ACKNOWLEDGMENTS We would like to thank anonymous reviewers for their insightful comments. This research is sponsored by the National Basic Research 973 Program of China under Grant No. 2011CB302303, the National Natural Science Foundation of China under Grant No. 61300046, and the U.S. National Science Foundation (NSF) under Grant Nos. CNS-1218960 and CNS-1320349, the National High Technology Research and Development Program of China (863 Program) under Grant No.2013AA013203, and the Fundamental Research Funds for Central Universities, HUST, (Grant No. 2013KXYQ003). This work is also supported by Key Laboratory of Data Storage System, Ministry of Education. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the funding agencies. R EFERENCES [1] E. Pinheiro, W.-D. Weber, and L. A. Barroso, “Failure Trends in a Large Disk Drive Population,” in Proc. of USENIX FAST, 2007. [2] B. Schroeder and G. A. Gibson, “Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You?” in Proc. of the 5th Usenix Conference on File and Storage Technologies, February 2007. [3] L. N. Bairavasundaram and G. R. G. et al., “An Analysis of Latent Sector Errors in Disk Drives,” in Proc. of ACM SIGMETRICS, 2007. [4] D. Ford and F. L. et al., “Availability in Globally Distributed Storage Systems,” in Proc. of USENIX OSDI, 2010. [5] M. Blaum, J. Brady, J. Bruck, and J. Menon, “EVENODD: An Efficient Scheme for Tolerating Double Disk Failures in RAID Architectures,” IEEE Transactions on Computers, vol. 44, no. 2, February 1995. [6] P. Corbett and B. E. et al., “Row-Diagonal Parity for Double Disk Failure Correction,” in Proc. of USENIX FAST, 2004. [7] I. Reed and G.Solomon, “Polynomial codes over certain finite fields,” Journal of the Society for Industrial and Applied Mathematics, 1960. [8] J. B. et al., “An XOR-Based Erasure-Resilient Coding Scheme,” International Computer Science Institute, Tech. Rep. TR-95-048, 1995. [9] C. Huang and L. Xu, “STAR: An Efficient Coding Scheme for Correcting Triple Storage Node Failures,” in Proc. of USENIX FAST, 2005. [10] I. Iliadis, R. Haas, X. Hu, and E. Eleftheriou, “Disk Scrubbing Versus Intra-Disk Redundancy for High-Reliability RAID Storage Systems,” in Proc. of ACM SIGMETRICS, 2008. [11] B. S. et al., “Understanding latent sector errors and how to protect against them,” in Proc. of USENIX FAST, 2010. [12] L. N. Bairavasundaram and G. R. G. et al., “An Analysis of Data Corruption in the Storage Stack,” in Proc. of USENIX FAST, 2008. [13] W. Jiang and C. H. et al., “Are Disks the Dominant Contributor for Storage Failures? A Comprehensive Study of Storage Subsystem Failure Characteristics,” in Proc. of USENIX FAST, 2008.
[14] J. S. P. et al., “SD Codes: Erasure Codes Designed for How Storage Systems Really Fail,” in Proc. of USENIX FAST, 2013. [15] M. Blaum, J. L. Hafner, and S. Hetzler, “Partail-MDS codes and their application to RAID type of architectures,” IBM Research Report, Tech. Rep. RJ10498, February 2012. [16] O. Khan and R. B. et al., “Rethinking Erasure Codes for Cloud File Systems: Minimizing I/O for Recovery and Degraded Reads,” in Proc. of USENIX FAST, 2012. [17] C. Huang and H. S. et al., “Erasure coding in Windows Azure Storage,” in Proceedings of USENIX Annual Technical Conference, 2012. [18] M. S. et al., “XORing Elephants: Novel Erasure Codes for Big Data,” in Proceedings of the VLDB Endowment, vol. 6, no. 5, March 2013. [19] B. C. et al., “Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency,” in Proceeding of ACM SOSP, 2011. [20] “Amazon EC2,” http://aws.amazon.com/ec2/. [21] W. W. Peterson, E. J. Weldon, and Jr, “Error-Correcting Codes, Second Edition,” in MIT Press, Cambridge, 1972. [22] F. J. MacWilliams and N. J. A. Sloane, “The Theory of Error-Correcting Codes, Part I,” in North-Holland Publishing Company, New York, 1977. [23] J. S. P. et al., “Screaming Fast Galois Field Arithmetic Using Intel SIMD Instructions,” in Proc. of USENIX FAST, 2013. [24] M. Li and P. P. C. Lee, “STAIR Codes: A General Family of Erasure Codes for Tolerating Device and Sector Failures in Practical Storage Systems,” in Proc. of USENIX FAST, 2014. [25] J. S. Plank, “Open Source Encoder and Decoder for SD Erasure Codes,” in Technical Report UT-CS-13- 704, January 2013. [26] J. Plank, “The RAID-6 Liberation Codes,” in Proc. of the 6th USENIX Conference on File and Storage Technologies, February 2008. [27] J. S. Plank and J. L. et al., “A Performance Evaluation and Examination of Open-Source Erasure Coding Libraries For Storage,” in Proc. of USENIX FAST, 2009. [28] RANDOM.ORG, “Random Integer Generator,” in http://www.random.org/integers/, 2014. [29] Y. Hu, H. C. Chen, P. P. Lee, and Y. Tang, “NCCloud: Applying Network Coding for the Storage Repair in a Cloud-of-Clouds,” in Proc. of USENIX FAST, 2012. [30] B. C. L. et al., “Architecting Phase Change Memory as a Scalable DRAM Alternative,” in Proceedings of the 36th International Symposium on Computer Architecture, 2009. [31] M. K. Qureshi, V. Srinivasan, and J. A. Rivers, “Scalable High Performance Main Memory System Using Phase-Change Memory Technology,” in Proceedings of the 36th International Symposium on Computer Architecture, 2009. [32] H. K. et al., “Evaluating Phase Change Memory for Enterprise Storage Systems: A Study of Caching and Tiering Approaches,” in Proceedings of USENIX FAST, Santa Clara, CA, 2014. [33] X. Guo, E. Ipek, and T. Soyata, “Resistive computation: Avoiding the power wall with low-leakage, stt-mram based computing,” in Proceedings of ACM IEEE ISCA, 2010. [34] C. Smullen, V. Mohan, A. Nigam, S. Gurumurthi, and M. R. Stan, “Relaxing Non-Volatility for Fast and Energy-Efficient STT-RAM Caches,” in Proceedings of the The 17th IEEE Symposium on High Performance Computer Architecture (HPCA-17), February 2011. [35] H. P. Anvin, “The mathematics of RAID-6,” in http://kernel.org/pub/linux/kernel/ people/hpa/raid6.pdf, 2011. [36] J. Menon and D. Mattson, “Distributed Sparing in Disk Arrays,” in Proceedings of the 37th international conference on COMPCON, San Francisco, California, USA, Feb 1992, pp. 410–421. [37] R. Hou, J.Menon, and Y. Patt, “Balancing I/O Response Time and Disk Rebuild Time in a RAID5 Disk Array,” in Proc. of the 26th Hawaii International Conference on System Sciences, Kihei, HI, January 1993. [38] J. Lee and J. Lui, “Automatic Recovery from Disk Failure in ContinuousMedia Servers,” IEEE Transactions on Parallel and Distributed Systems, vol. 13, no. 5, pp. 499–515, May 2002. [39] M. Holland, G. A. Gibson, and D. P. Siewiorek, “Fast, On-Line Failure Recovery in Redundant Disk Arrays,” in Proc. of The Twenty-Third International Symposium on Fault-Tolerant Computing, June 1993. [40] L. Tian and D. F. et al., “PRO: A Popularity-based Multi-threaded Reconstruction Optimization for RAID-Structured Storage Systems,” in Proc. of USENIX FAST, 2007. [41] P. Sobe, “Parallel Reed/Solomon Coding on Multicore Processors,” in Proc. of 2010 International Workshop on Storage Network Architecture and Parallel I/Os, 2010.