420
IEEE TRANSACTIONS ON DEVICE AND MATERIALS RELIABILITY, VOL. 13, NO. 3, SEPTEMBER 2013
Letters Reducing the Cost of Single Error Correction With Parity Sharing Pedro Reviriego, Salvatore Pontarelli, Juan Antonio Maestro, and Marco Ottavi Abstract—Error correction codes (ECCs) are commonly used to protect memory devices from errors. The most commonly used codes are a simple parity bit and single-error-correction (SEC) codes. A parity bit enables single-bit error detection, whereas a SEC code can correct one-bit errors. A SEC code requires more additional bits per word and also more complex decoding that impacts delay. A tradeoff between both schemes is the use of a product code based on a combination of two parity bits. This approach reduces the memory overhead at the expense of a more complex access procedure. In this letter, an alternative scheme based on the use of parity sharing is proposed and evaluated. The results show that the new approach significantly reduces the memory overhead and is also capable of correcting single-bit errors. Fig. 1.
Index Terms—Error correction codes, parity sharing, soft errors.
I. I NTRODUCTION Error correction codes (ECCs) are widely used to protect memories from errors [1]. Two common configurations are used in memories to mitigate errors: per-word parity bit and per-word single error correction (SEC) code. Both configurations enable memory accesses with word granularity. The parity bit can detect single bit errors while the SEC code can correct single bit errors. Obviously the cost of a SEC code is larger as several parity check bits are required per word and the decoding and correction logic is complex [2]. In some cases, it is of interest to enhance at the system level the protection of a memory that already has a parity bit. This can be done using product codes, which are formed by combining two codes [3]. For example in [4], a product code was constructed by using a memory word as a parity of a block of words. In this way, a per-word code was combined with a per block code forming a product code. When the codes that form the product code are a simple parity bit, they enable the correction of single bit errors. Therefore, SEC can be implemented in a memory that has a parity bit by using a word as a parity of a block of words. This is illustrated in Fig. 1. Such a scheme requires a memory overhead that depends on the size of the block and can thus be easily adjusted. However, the use of large blocks that reduces the memory overhead also reduces reliability. When the probability of a bit error is Pe , the probability Pl of an uncorrectable error in a block of l bits can be approximated when Pe is much smaller than one by ∼ Pl =
l l(l − 1) (Pe )2 (Pe )2 = 2 2
(1)
Manuscript received April 23, 2013; accepted July 2, 2013. Date of publication July 9, 2013; date of current version August 30, 2013. This work was supported by the Spanish Ministry of Science and Education under Grant AYA2009-13300-C03. This letter is part of a collaboration in the framework of COST ICT Action 1103 “Manufacturable and Dependable Multicore Architectures at Nanoscale.” P. Reviriego and J. A. Maestro are with the Universidad Antonio de Nebrija, 28040 Madrid, Spain (e-mail:
[email protected];
[email protected]). S. Pontarelli and M. Ottavi are with the University of Rome “Tor Vergata,” 00133 Rome, Italy (e-mail:
[email protected]; ottavi@ing. uniroma2.it). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TDMR.2013.2272484
Illustration of a product code to implement SEC.
which clearly grows with the block size l. Comparing a block of l bits with b blocks of k bits such that b × k = l, the ratio of uncorrectable error probabilities is approximately: Pl ∼ l(l − 1) (l − 1) ∼ = = =b Pk,b k(k − 1)b (k − 1)
(2)
which shows an increase in the probability that is approximately linear with the block size. In this letter, an alternative scheme to achieve SEC in memories is proposed. The approach is based on the use of parity sharing which is also formed by a combination of two codes. Parity sharing has been proposed for communications and storage applications [5], [6] and is commonly based on Reed-Solomon codes [7]. In our case, the codes use a parity bit and a SEC code. The proposed parity sharing scheme enables SEC in the memory with a lower memory overhead than the existing product code solution. The implementation of the encoding and decoding is also simple. The rest of the letter is organized as follows. Section II provides an overview of parity sharing. In Section III the proposed scheme is introduced. Then in Section IV it is evaluated in terms of memory overhead and memory access complexity. Finally, the conclusions of the work are presented in Section V. II. PARITY S HARING Parity Sharing (PS) is illustrated in Fig. 2. It can be observed that a per-word code ECC1 is used, but not all the parity check bits are stored in the memory. Instead, some of those parity check bits are themselves coded with a different code ECC2 and only the parity check bits obtained are stored in memory [5]. This reduces the memory overhead. To correct errors, the parity sharing works as follows, first the decoding of ECC1 words is done treating the missing parity bits as erasures. Once that is done, the vertical ECC2 words are decoded. Finally, the ECC1 words are decoded a second time, now with all the bits available. The main idea is that the stored parity check bits of ECC1 will be sufficient in most cases to correct the errors and erasures in a word. For the few cases, when that is not the case (there are too many errors in a given word), the second code ECC2 allows to recover the unstored parity bits for ECC1 so that in the second decoding more errors can be corrected.
1530-4388 © 2013 IEEE
IEEE TRANSACTIONS ON DEVICE AND MATERIALS RELIABILITY, VOL. 13, NO. 3, SEPTEMBER 2013
Fig. 2.
Illustration of the general parity sharing scheme.
421
3) Using the ECC2 parity bit reconstruct the Hamming parity bits for the erroneous word in 1. 4) Decode the erroneous word using the reconstructed Hamming code. When there is no error in the per word parity bit, the decoding stops in step 1 and only one read operation is needed. Only when an error is detected the other steps 2–4 are needed. Since soft errors are rare events, the vast majority of the accesses will be error free and will require only one memory read. Therefore, the average access time will be close to that of a memory protected with a per word parity bit only. The same applies to the product code in Fig. 1. This idea of performing error detection first and only proceed to the more complex phases of decoding when there are errors has been already used for other more advanced ECCs such as Euclidean Geometry [9] and Difference Set codes [10]. The PS scheme has a more direct impact when writing to the memory. In that case, the parity check bits for ECC2 have to be updated and therefore several memory accesses are required. A read before write operation is needed to recover the previous value and the ECC2 parity bits and to write the new value and update the ECC2 parity bits. The decoding procedure described is capable of correcting single bit errors in a data block as a single error is first detected by the word parity bit and then the Hamming parity check bits are re-constructed so that the error is finally corrected. The number of parity check bits required to protect a block of b words each with k data bits using the proposed PS scheme is b + log2 (k) + 1 when k is a power of two. This number is obtained by adding the bits required for: • A parity bit per word. • The parity bits for the Hamming code bits which for a word of k data bits are log2 (k) + 1 when k is a power of two. This compares with b · (log2 (k) + 1) bits when a per word Hamming code is used. A product code using vertical and horizontal parity bits requires b + k bits, which is also larger. IV. I MPLEMENTATION C OST
Fig. 3.
Illustration of the proposed parity sharing scheme.
As mentioned before, PS is commonly used with Reed-Solomon (RS) codes [6], [7]. This results in a complex decoding procedure and powerful error correction capabilities. For memories, in many cases, the correction of single bit errors is sufficient and decoding has to be performed with low delay. Therefore, existing PS schemes are not directly applicable to memory protection. III. P ROPOSED S CHEME To achieve single bit error correction, a simplified Parity Sharing (PS) scheme is proposed. The per-word code ECC1 is a parity extended Hamming code [8] and the vertical code ECC2 is a simple parity bit. For the per word code, only the parity bit is stored. The scheme is illustrated in Fig. 3. It can be implemented in a memory that has a parity bit by storing the ECC2 bits in additional words. The benefit compared with a product code is that only a fraction of a word is needed per block thus reducing memory overhead. The decoding procedure for the proposed PS scheme is as follows: 1) Read the word and check the parity bit. If there is no error, the data can be used directly. If there is an error, go to step 2. 2) Read all the words in the block, check the parity bit and reconstruct the Hamming parity bits. If other errors are detected, the errors cannot be corrected, otherwise proceed to step 3.
As discussed in the previous section, the proposed scheme can correct single bit errors. For that, a standard per-word SEC code or a product code that combines two parities can also be used. Therefore, to understand the advantages and drawbacks of each of the alternatives a comparison of their implementation cost in terms of memory overhead and memory access complexity is presented in the following. The memory overhead is a key cost factor as it determines the effective memory capacity. To put the different options in perspective, Table I summarizes the required parity bits for each of the alternatives considered for different block sizes and data word-lengths. It can be observed that the proposed PS scheme enables significant reductions in the number of required bits compared to both a per-word SEC code and a product code. The savings versus the product code increase for larger data word-lengths. Therefore, the proposed scheme can be an interesting option to minimize the number of additional memory bits required. It should be mentioned that the reliability decreases with the block size as discussed in the introduction. Therefore, the comparison with SEC is not an apples to apples comparison as SEC will have a lower probability of uncorrectable errors. This effect increases approximately linearly with the block size as discussed in the introduction. Therefore, the parameter b provides a direct estimate of the impact on reliability. A second factor to assess the implementation cost is the complexity of the memory access procedures. Obviously, the per-word SEC is the simplest option. In that case a read or write to a word only require one memory access. For the product code, reads require only one access when no error is detected. However, when errors are detected,
422
IEEE TRANSACTIONS ON DEVICE AND MATERIALS RELIABILITY, VOL. 13, NO. 3, SEPTEMBER 2013
TABLE I R EQUIRED N UMBER OF PARITY C HECK B ITS
V. C ONCLUSION AND F UTURE W ORK In this letter, a scheme to implement single bit error correction using Parity Sharing has been presented. The proposed scheme enables significant reductions in the number of parity check bits compared to the use of product codes or a per-word Single Error Correction code. The number of accesses required to write and read a word from the memory is the same as those needed for a product code. Therefore, the scheme is an interesting alternative to single error correction based on the use of product codes. The most direct practical application of the scheme is the implementation of single bit error correction on a memory that has a per word parity bit. R EFERENCES
all the words in the block have to be accessed to correct the error. The write accesses require the read-before-write operation as the vertical parity word has to be updated using the old value and the value that is going to be written into the memory. The situation for the proposed parity sharing scheme is similar. For reads, only one access is required when no error is detected. When errors are detected, the whole block needs to be accessed. For writes, also read-before-write are needed. Therefore, the PS scheme presented has a similar access complexity to that of product codes. We remark that the impact of this drawback is strictly dependent on the memory technology and on the memory application. For example, Phase Change Memories already use read-before-write operation to reduce energy consumption [11], while massive data storing usually requires consecutive memory accesses and involves whole code words thus not requiring read-before-write operations. The same applies to instruction caches and other memories whose contents are rarely overwritten.
[1] C. L. Chen and M. Y. Hsiao, “Error-correcting codes for semiconductor memory applications: A state-of-the-art review,” IBM J. Res. Develop., vol. 28, no. 2, pp. 124–134, Mar. 1984. [2] M. Y. Hsiao, “A class of optimal minimum odd-weight column SEC-DED codes,” IBM J. Res. Develop., vol. 14, no. 4, pp. 301–395, Jul. 1970. [3] S. Lin and D. J. Costello, Error Control Coding, 2nd ed. Englewood Cliffs, NJ, USA: Prentice-Hall, 2004. [4] P. Reviriego, C. Argyrides, J. A. Maestro, and D. K. Pradhan, “Improving memory reliability against soft errors using block parity,” IEEE Trans. Nucl. Sci., vol. 58, no. 3, pp. 981–986, Jun. 2011. [5] O. Collins, “Exploiting the cannibalistic traits of Reed–Solomon codes,” IEEE Trans. Commun., vol. 43, no. 11, pp. 2696–2703, Nov. 1995. [6] M. K. Cheng and P. H. Siegel, “List-decoding of parity-sharing Reed– Solomon codes in magnetic recording systems,” in Proc. IEEE Int. Conf. Commun., Jun. 2004, vol. 2, pp. 640–644. [7] G. C. Cardarilli, S. Pontarelli, M. Re, and A. Salsano, “Analysis of errors and erasures in parity sharing RS CODECS,” IEEE Trans. Comput., vol. 5612, pp. 1721–1726, Dec. 2007. [8] R. W. Hamming, “Error correcting and error detecting codes,” Bell Syst. Tech. J., vol. 29, pp. 147–160, Apr. 1950. [9] H. Naeimi and A. DeHon, “Fault secure encoder and decoder for nanomemory applications,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 17, no. 4, pp. 473–486, Apr. 2009. [10] S. Liu, P. Reviriego, and J. A. Maestro, “Efficient majority logic fault detection with difference-set codes for memory applications,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 20, no. 1, pp. 148–156, Jan. 2012. [11] J. Chen, R. C. Chiang, H. H. Huang, and G. Venkataramani, “Energyaware writes to non-volatile main memory,” ACM SIGOPS Oper. Syst. Rev., vol. 45, no. 3, pp. 48–52, Dec. 2012.