2014 IEEE International Conference on Big Data
A New Zigzag MDS Code with Optimal Encoding and Efficient Decoding Jun Chen, Hui Li? , Hanxu Hou, Bing Zhu, Tai Zhou, Lijia Lu, and Yumeng Zhang Shenzhen Eng. Lab. of Converged Networks Technology Institute of Big Data Technology Shenzhen Graduate School, Peking University, Shenzhen, 518055, China Email:
[email protected]
Reed-Solomon code [6] is a class of famous MDS codes, which can tolerate multiple blocks failed. However, it requires such a large finite Galois field that makes the high encoding and decoding complexities. Compared with the original ReedSolomon (RS) Code, Cauchy Reed-Solomon (CRS) [7] code converts the finite Galois field arithmetic into a sequence of bit XOR operations and improves the performance. Due to its high performance, CRS code is widely used in distributed file systems. Although CRS code has high performance, the encoding and decoding complexities of CRS code are both O(Lk log(k + m)) per parity block [7], which are still higher than that of the optimal MDS code. Therefore, some MDS codes with optimal encoding complexity have been proposed. For example, MDS array code is one family of such codes. A common property of MDS array codes is that the encoding and decoding procedures use only simple XOR operation and thus are more computationally efficient [8]. In 1995, EVENODD code was presented as the first MDS array code for two nodes failure. Most of the parity-based codes use rowdiagonal parity (RDP) in different forms. The EVENODD code [9] uses the RDP idea for a (p−1) × (p + 2) array, p a prime. RDP code [10], which can recover any two nodes failure, shows a (p−1) × p array with RDP method, p a prime. An extension to multiple parity nodes using RDP method was obtained in [11]. Futhermore, STAR code [12] and new MDS array code [4] were extrapolated for three or more parity blocks. However, when two or more nodes are failed, for both RS code and array code, there is a common problem that a complex matrix must be solved to decode the original data, which makes the decoding complexity very high. To have an efficient decoding complexity, a Zigzag-decodable code [13] is introduced. The necessary and sufficient MDS property condition of Zigzag-decodable code is given in [18]. This code only requires a fast operation over GF (2) both in encoding and decoding, so that it is supposed to take place of traditional MDS codes. In this code, the length of a parity block is slightly larger with o bits overhead, where o = m(k − 1). However, it has a disadvantage that when m and k get larger, the overhead increases faster. We have worked on this problem, and finally find a new Zigzag-decodable code, whose overhead can be much less than the original one without losing high
Abstract—Distributed file system has emerged in recent years as an efficient solution to store the large amount of data produced anytime and anywhere. In order to guarantee data reliability, it is necessary to introduce redundancy to the storage systems. Compared to simple replication, practical systems are increasingly adopting erasure codes for better storage efficiency. However, traditional erasure codes such as maximum-distanceseparable (MDS) codes, are designed over a large finite field, which inevitably hinders the wide implementation of erasure codes. In this paper, we propose a new family of MDS codes with high computation efficiency. More specifically, only XOR operation is included in the encoding process to generate parity blocks. Upon failure of a storage node, we use the efficient Zigzag decoding method to recover the failed blocks, which achieves the optimal encoding and an efficient decoding. Furthermore, we implement the proposed codes in a distributed file system, and the results show the high performance of the new codes. Index Terms—Distributed file systems; MDS codes; Optimal encoding; Efficient decoding; Zigzag decoding.
I. I NTRODUCTION Nowadays, it is the age of Big Data. During the age of Big Data, the amount of data is rising at a super high speed. Too much data can not be stored in a single node anymore, so that the distributed file system with a large number of nodes is designed to store the large amount of data. For example, Google [1] and Facebook [2] are running the distributed file systems, which consist of a large number of inexpensive and individually unreliable machine nodes via a large distributed network. However, node failures in distributed file systems are “norm rather than exception” [1]. If a machine node in the system breaks down, the data on that node will be lost permanently. For the sake of data recovery, erasure code could be used in the distributed file system to protect the data from any failed machine node [3]. Multiple-duplicates is the simplest architecture to protect files from multiple nodes failure, while too many duplicates require much space to store. In order to make a full use of the parity, Maximum Distance Separable (MDS) [5] code is a desirable erasure code. For arbitrary k and m, MDS code splits a file into k source blocks with each block containing L bits, and adds m redundant blocks to become n = m + k blocks. MDS code can recover the file from any k available blocks in the n blocks.
978-1-4799-5666-1/14/$31.00 ©2014 IEEE
1
performance of encoding and decoding. The outline of this paper is as follows. In Section II, we introduce the original Zigzag-decodable code. Section III and IV present our new improved Zigzag-decodable code. Experimented results are provided in Section V, followed by the conclusion and future work in Section VI. II. O RIGINAL Z IGZAG -D ECODABLE C ODE In the construction of the original Zigzag-decodable code, a file is split into k source blocks si , 1 ≤ i ≤ k. Each block has L bits and can be represented by the polynomial as following si (z) =
L−1 X
si,j z j ,
Fig. 1: The original Zigzag-decodable MDS code (n = 6, k = 3)
(1)
j=0
where si,j is the (j + 1)-th bit of si . The i-th parity blocks (1 ≤ i ≤ m) is given as ci (z) =
k X vi,j sj (z).
(2)
j=1
Considering the source blocks and parity blocks, the corresponding matrix form is s(z) Ik s(z) =: A(z)s(z), (3) = c(z) V (z)
Fig. 2: The decoding of Zigzag-decodable code (n = 6, k = 3). The number represent the obtain order of Zigzag decoding
where Ik is the k × k identity matrix, and V (z) is an m × k Vandermonde matrix whose (i, j)-th element is vi,j = z i(j−1) . For example, if k = 3 and encoding matrix as A(z) 1 0 0 A(z) = 1 1 1
be determined from c3 exactly. Next, c2 offers the value c2,2 = s1,2 ⊕s2,0 , and leads s2,0 = c2,2 ⊕s2,2 . Each step will provide a exposed bit to be determined, and the next several bits are determined from s1,3 = c3,3 ⊕ s2,0 , s2,1 = c2,3 ⊕ s1,3 , s3,0 = c1,2 ⊕ s1,2 ⊕ s2,1 . Above all, it can be found that the bit value of si always comes from cm+1−i and the obtaining sequence is s1 ,s2 ,s3 ,s1 ,s2 ,s3 ... These are the rules which can generate an efficient decoding complexity. For the sake of Zigzag decoding, there are some overhead in the parity blocks. In the construction of original Zigzagdecodable code, there is at most
(4)
n = 6, we can generate the 0 1 0 z z2 z3
0 0 1 z2 z4 z6
.
o = m(k − 1)
(5)
bits overhead in the parity blocks. Unfortunately, it is a disadvantage that when m and k get larger, the overhead increases faster. In order to save space to store more data, we have taken some action to improve this code and make the overhead much lower.
Here, z a sj (z) denotes the right shifting of sj (z) with offset a. Hence, the corresponding blocks can be represented graphically in Fig 1. This Zigzag-decodable code is a MDS code and any k blocks are sufficient to recover the source data. Instead of solving a linear system by Gaussian elimination, the decoding can be worked in an efficient Zigzag decoding method. Before running the Zigzag decoding method, the surviving source blocks must be subtracted off the remaining parity blocks. After subtracting off the determined bits, it is proved that there is at least one bit exposed among the parity blocks, which can be read and determined directly. Fig 2 shows an illusion of decoding the Zigzag-decodable code (k = 3, n = 6). Suppose all the source blocks are failed, and 3 parity blocks are available. We need to decode the source blocks from the 3 parity blocks. First of all, s1,0 ,s1,1 ,s1,2 can
III. N EW C ODE C ONSTRUCTION This section, we will introduce the new Zigzag-decodable code, which can be viewed as an improvement of the original Zigzag-decodable code. To reduce the overhead, we replace the Vandermonde matrix V (z) in (3) by an extend Vandermonde matrix B(z), whose (i, j)-th element is bi,j = z pi +(i−q)(j−1) , where
2
(6)
q = dm/2e ,
(7)
and pi is the minimum non-negative integer which makes all pi + (i − q)(j − 1) non-negative. We have ( (q − i)(k − 1) i≤q pi = . (8) 0 i>q Fig. 3: The parity blocks for the new Zigzag-decodable code (n = 6, k = 3)
Meanwhile, B(z) is the product of a diagonal matrix and a Vandermonde matrix: B(z) = Diag(z p1 , z p2 , ..., z pm )V (z),
(9)
operation adding to the parity blocks. So that each parity blocks takes Lk bits addition and XOR, and in consequence, each parity blocks have an overall encoding complexity of O(Lk) per parity blocks, which is the optimal encoding of MDS code.
where V (z) =
(z 1−q )0 (z 2−q )0 .. .
(z 1−q )1 (z 2−q )1 .. .
m−q 0
m−q 1
(z
)
(z
)
··· ···
(z 1−q )k−1 (z 2−q )k−1 .. .
···
m−q k−1
(z
. (10)
Property 2. The overhead of the new Zigzag-decodable code is 50% of that of the code in [13] at most.
)
For the sake of Zigzag decoding, this new code also has overhead. The overhead is the maximum value of pi + (i − q)(j − 1) , which is
The parity blocks is defined as ci (z) =
k X bi,j sj (z).
(11) o0 = bm/2c (k − 1).
j=1
For example, if k = 3, n = 6, then q = 2 and {p1 , p2 , p3 } = {2, 0, 0} , the matrix in (9) is z2 B(z)= 1 1
z 1 z
2 1 z 1 = z2
1 ( z1 )1 1 1 1 1 z1 1
(12)
Compared to (5) in the original Zigzag-decodable code, there is
( z1 )2 1 . z2
o0 /o =
1 bm/2c (k − 1) ≤ . m(k − 1) 2
(13)
It illustrates that our new code can save at least 50% space of overhead comparing to the original Zigzag-decodable code.
The corresponding blocks are shown in Fig 3. It can be seen that, the encoding matrix is different from that of original Zigzag-decodable code. In the original code, the encoding matrix satisfies the increasing-difference property[13], and one of which is bi,j 0 − bi,j > 0, if j 0 > j . While the new code allows bi,j 0 − bi,j ≤ 0, if j 0 > j . Futhermore, after considering the value of bi,j 0 −bi,j for each parity blocks, the parity blocks can be categorized into 3 types. The first one satisfies bi,j 0 − bi,j > 0 , named increasing block; the second one satisfies bi,j 0 − bi,j < 0 , named decreasing block,; the last one is constant block, which satisfies bi,j 0 − bi,j = 0 and bi,j ≡ 0 . For example, in Fig(3), c3 (z) is increasing; c1 (z) is decreasing; and c2 (z) is constant. Futhermore, all the parity blocks of the original Zigzag-decodable code are increasing. After introducing the construction of the new Zigzagdecodable code, some properties of the new code are shown as followings.
IV. D ECODING T HEOREMS We have mentioned that the proposed new code is MDS code. This section will give the the prove of this MDS property and the theorems of decoding. Theorem 1. Among n blocks, any k blocks can recover the original source blocks. Proof. The given k blocks contain two classes of blocks: firstly, the source blocks, whose indices are P = {P1 , P2 , · · · , Pp }, where 1 ≤ Pi ≤ k; secondly, the parity blocks, whose indices are Q = {Q1 , Q2 , · · · , Qq }, 1 ≤ Qi ≤ m. According to the construction, there has sP (z) IP0 = s(z). (14) cQ (z) BQ (z)
Property 1. The encoding complexity is O(Lk) per parity block.
Define R as {1, 2, · · · , k} \P , so that sR (z) are the failed source blocks. Subtract sP (z) the off cQ (z) and get the linear combination of sR (z), called cQ,R (z). The matrix form is
Since each element of B(z) has only one term bi,j = z pi +(i−q)(j−1) , each source block requires L bits arithmetic addition to shift pi + (i − q)(j − 1) bits and L bits XOR
cQ,R (z) = BQ,R (z)sR (z).
3
(15)
determined bits, s1,1 , s3,1 , s2,1 can be determined in the same way. Run the process as before until all the undetermined source blocks have been determined. Property 3. The decoding complexity of the new code is O(Lk) per parity block. The whole decoding scheme consists of two parts: first, subtracting the surviving source blocks off the parity blocks, which could be finished with Lr(k−r) bits of XOR operation, where r is the number of failed source node; second, running in a Zigzag decoding scheme. According to [13], the complexity of second part can be O(Lr2 ). The overall decoding complexity is O(Lr(k − r)) + O(Lr2 ). Approximately, the decoding complexity is O(Lrk), and O(Lk) per failed source blocks, which is an efficient decoding complexity. Property 4. If the number of failed blocks is fixed, the decoding time of a given file is decreasing as k is increasing. During decoding a given file, assume first part takes T1 = aLr(k − r) sec, and second part takes T2 = bLr2 sec. As in the second part, to obtain 1 bit need to read several necessary data and have some computation, which spends several times of the duration of XOR operation, and as a result, the constant factor a < b. The total decode time is T = T1 +T2 = Lkr(a+ (b − a) kr ) = Br(a + (b − a) kr ) sec, where B = Lk is the size of the given file. If the number of failed blocks r is fixed, since b − a > 0 , the decoding time T is decreasing as k is increasing.
Fig. 4: Decoding for the new Zigzag-decodable code
BQ,R (z) is the sub-matrix of B(z), as well as the product of a diagonal matrix and a generalized Vandermonde matrix: BQ,R (z) = Diag · VQ,R (z).
(16)
Here VQ,R (z) is a generalized Vandermonde matrix whose determinant is not zero and invertible [14]. And the invertible diagonal matrix makes the BQ,R (z) invertible. Therefore, the failed source blocks sR (z) can be solved in a matrix method. Not only in a matrix method, this new code can also be decoded in a Zigzag method. Theorem 2. Among n blocks, any k blocks can recover the original source blocks in a Zigzag method. Proof. We has categorized all the parity blocks into 3 types. It can be noticed that the number of increasing blocks is dm/2e, the number of decreasing is d(m − 1)/2e, and the number of constant block is only 1. The decoding of our new Zigzagdecodable code is depended on the number of each type. Case 1: Assume all the surviving parity blocks are increasing. The Zigzag decoding scheme is the same as the one in [13]. Case 2: Assume all the surviving parity blocks are decreasing. This is totally opposite to Case 1. If we replace si (z) by sK+1−i (z), the parity blocks will become increasing and work in a Zigzag decoding scheme. Case 3: Assume all the surviving parity blocks are either increasing or decreasing. If there is ni increasing and nd decreasing, the ni increasing blocks will help to get the first ni bit of the first failed source blocks in a Zigzag decoding method. While the nd decreasing blocks will get the first bit of the last nd failed source blocks at the same time. The first iteration obtains the first bit of the failed blocks, and in the same way, the next bit will be generated in the next iteration. Case 4: Assume there is a constant block. Since the ni increasing and nd decreasing blocks can get the first bits of the first nd and last ni blocks, the remained undetermined bit will be obtained from this constant block. For example, in Fig 4, there is 1 increasing blocks c3 , 1 decreasing block c1 , and 1 constant block c2 . In the first iteration, in the decreasing block c1 , the s3,0 can be read directly; at the same time, the increasing block c3 , the s1,0 can be read too; finally, in the constant block c2 , s1,0 ⊕ s2,0 ⊕ s3,0 = c2,0 can generate s2,0 . The next iteration, after subtracting off the
V. E XPERIMENTS A ND R ESULTS It has been proved that this new Zigzag-decodable code has optimal encoding and efficient decoding, and can contribute to make a distributed file system of high performance. It is well known that CRS code has a high performance and it can recover a file when multiple blocks are failed. In order to exam the performance of our new code, we have made a distributed file system which is based on our new Zigzagdecodable code and CRS code. CRS code is based on the Jerasure erasure coding library [15]. The experiment system is based on Hadoop 0.22.0 and contains 36 machine nodes, and each machine node has hard disk 500GB, memory 4.0GB, CPU Intel i5,2.5GHz , operation system Ubuntu 12.04 .
Fig 5. shows the encoding time of new Zigzag-decodable code and CRS code. In order to generate m = 6 parity blocks for a file with size 1GB, it takes almost 6s for the new Zigzagdecodable code to encode when k ranges from 6 to 30. In the other hand, even though m = 6 is constant, the encoding time of CRS for 1GB file increases when k is from 6 to 30. As we mentioned before, the encoding complexity of CRS is O(Lk log(k + m)) per parity block, while new Zigzagdecodable code has an encoding complexity of O(Lk) per parity block. Here L is the size of one source block, and the size of the whole file is Lk , a constant value. So for the new Zigzag-decodable code, the encoding complexity per parity block is only depended on the file size, and the
4
O(Lk log(k + m)) per failed source block and that of our new Zigzag-decodable code is O(Lk). Fig 6 shows the decoding performance between CRS and new Zigzag-decodable code. In Fig 6(a), a file of size 1GB is encoded with k = 10 and m = 6. When different number of blocks failed, the decoding time is different. As the Fig 6(a) shows, both code needs more time when more blocks failed, but the new Zigzag-decodable will have a better performance and save almost 10% of time for decoding. In Fig 6(b), a file of size 1GB is encoded with a variable number of k source blocks and m = 6 parity blocks. Let the x-axis denote the number of original source blocks k, and the y-axis is the computation time (in second) of decoding when 6 source blocks are failed. When k gets larger, the CRS code takes more time to decode, while the new Zigzag-decodable code takes less time to decode according to Property 4. In Fig 6(b), comparing to CRS code, specially when k > 10 , the new Zigzag-decodable code takes much less time to recover a file. It is sufficient for us to believe that in both theory and experiment, our new Zigzag-decodable code has a higher performance than CRS code. CRS code, which is widely used in the distributed file system nowadays, will be replaced by higher performance codes , and one of them is our new Zigzagdecodable code.
Fig. 5: Encoding time of CRS and new Zigzag-decodable code, for m = 6 and 6 ≤ k ≤ 30
VI. C ONCLUSION A ND F UTURE W ORK MDS codes have been used for the distributed file system, to reduce the storage space as well as recover the file when multiple nodes are failed. One of the widely used MDS code, RS code, which requires the complex computing in a large finite Galois field, has been replaced by CRS code, which avoids the large finite field to make fast computing. However, CRS code still requires much more computation compared to the array codes. In order to make a faster computation, we develop a new MDS code. Without a large finite field, our new code needs the fast XOR operation only to encode parity blocks and uses Zigzag decoding method to recover failed blocks, which reaches the optimal encoding and efficient decoding. Furthermore, our new code has been implemented in a distributed file system to examine the performance of the MDS codes. To our observation, our new Zigzag-decodable code has the optimal encoding and efficient decoding in theorem, which helps to make a faster computation than CRS code when encoding and decoding especially when k is large. Our new Zigzag-decodable code can save more overhead to reduce the space for storage than that in [13]. However, we never think this is the minimum overhead for this class of Zigzag-decodable code. There must be an optimal solution still waiting for us to find it. When to regenerate 1 failed block, both RS, CRS and the new Zigzag-decodable code have to access almost all the surviving data, and none of them can reduce the I/O cost. For the sake of bandwidth, regenerate code [16], and some other array code, like the optimal rebuilding code [17], are introduced to reduce the I/O cost. However, most of them
(a) Decode when m = 6, k = 10 and variable number of source blocks failed
(b) Decode when m = 6, variable k, and 6 source blocks failed
Fig. 6: Decoding time between CRS code and new Zigzagdecodable code. computationally encoding time per parity block is steady for any k. Comparison to new Zigzag-decodable code, the encoding complexity of CRS has a factor, log(k + m), which makes the encoding time increased when k + m increases.
Similarly, the decoding complexity of CRS code is
5
use RS theory to generate parity blocks, and do not have the optimal encoding or decoding. In the future, we hope to combine our new Zigzag-decodable code and the regenerate theory to develop a new code, which can reach the lower bound of the regenerate bandwidth as well as optimal encoding and decoding complexities.
[17] I. Tamo, Z. Wang, and J. Bruck, “Zigzag Codes: MDS Array Codes With Optimal Rebuilding,” IEEE Transactions On Information Theory, VOL. 59, NO. 3, MARCH 2013. [18] H. Hou, K. Shum, M. Chen, and H.Li, “BASIC Regenerating Code: Binary Addition and Shift for Exact Repair,” Information Theory Proceedings (ISIT), 2013 IEEE International Symposium.
VII. ACKNOWLEDGMENT This work is supported by the National Basic Research Program of China (973 Program) (No. 2012CB315904), the National Natural Science Foundation of China (No. NSFC61179028), the Natural Science Foundation of Guangdong Province (No. NSFGD S2013020012822) and the Basic Research of Shenzhen (No. SZ JCYJ20130331144502026). R EFERENCES [1] S. Ghemawat, H. Gobioff, and S. Leung, “The google file system,” in ACM SIGOPS Operating Systems Review, vol. 37, no. 5. ACM, 2003, pp. 29–43. [2] N. Ellison, C. Steinfield, and C. Lampe, “The benefits of facebook friends: social capital and college students use of online social network sites,” Journal of Computer-Mediated Communication, vol. 12, no. 4, pp. 1143–1168, 2007. [3] Bairavasundaram, L.N., Goodson, G., Schroeder, B., Arpaci-Dusseau, A.C., and Arpaci-Dusseau, R. H. “An analysis of data corruption in the storage stack.” In FAST-2008: 6th Usenix Conference on File and Storage Technologies(San Jose, February 2008). [4] H. Hou, K. Shum, M. Chen, and H. Li, “New MDS Array Code Correcting Multiple Disk Failures,” in Proceedings of the IEEE Global Communications Conference (GLOBECOM 2014), Austin, Taxes, USA, December 8 - 12, 2014. [5] F. J. MacWilliams, N. J. A. Sloane, “The Theory of Error-Correcting Codes,” North Holland, Amsterdam, Holland, 1977. [6] Reed, I.S., and Solomon, G. “Polynomial codes over certain finite fields.” Journal of the Society for Industrial and Applied Mathematics 8(1960), 300–304. [7] J. Blomer, M. Kalfane, M. Karpinski, R. Karp, M. Luby, and D. Zuckerman. “An XOR-based erasure-resilient coding scheme.” Technical Report TR-95-048, International Computer Science Institute, August 1995. [8] M. Blaum, P. G. Farrell and H. C. A. van Tilborg, “Chapter on Array Codes, Handbook of Coding Theory,” edited by V. S. Pless and W. C. Huffman, to appear. [9] M. Blaum, J. Brady, J. Bruck, J. Menon, “EVENODD: an efficient scheme for tolerating double disk failures in RAID architectures,” IEEE Trans. on Computers, Vol. 44, No. 2, pp. Feb, 1995, 192-202. [10] P. Corbett, B. English, A. Goel, T. Grcanac, S. Kleiman, J. Leong and S. Sankar, “Row-Diagonal Parity for Double Disk Failure Correction,” In Proceedings of the 3th USENIX Conference on File and Storage Technologies, San Francisco, CA, USA, Mar, 2004, pp.1-14. [11] M. Blaum and R. M. Roth, “New array codes for multiple phased burst correction,” IEEE Trans. Information Theory IT-39,1 (January 1993), 66–77. [12] Cheng Huang, Lihao Xu, “STAR: An Efficient Coding Scheme for Correcting Triple Storage Node Failures,” In Proceedings of the 4th USENIX Conference on File and Storage Technologies, San Francisco, Dec, 2005, pp.197-210. [13] C. W. Sung and X.Gong, “A Zigzag-decodable code with the MDS property for distributed storage systems,” Proc. IEEE ISIT,Istanbul, Turkey, pp. 341-345, Jul. 2013. [14] F. R. Gantmacher, The Theory of Matrices, vol. 2. New York: Chelsea Pub., 1977. [15] Plank, J.S., Simmerman, S., and Schuman, C.D.Jerasure: A library in C/C++ facilitating erasure coding for storage applications - Version 1.2. Tech. Rep. CS-08-627, University of Tennessee, August 2008. [16] A. G. Dimakis, P. B. Godfrey, M. J. Wainwright and K. Ramchandran, “Network Coding for distributed storage systems,” IEEE Proc. INFOCOM, (Anchorage, Alaska), May 2007.
6