Des Autom Embed Syst (2011) 15:289–310 DOI 10.1007/s10617-011-9078-2
HVD: horizontal-vertical-diagonal error detecting and correcting code to protect against with soft errors Mostafa Kishani · Hamid R. Zarandi · Hossein Pedram · Alireza Tajary · Mohsen Raji · Behnam Ghavami
Received: 1 November 2009 / Accepted: 11 April 2011 / Published online: 15 May 2011 © Springer Science+Business Media, LLC 2011
Abstract This paper presents a high level error detection and correction method called HVD code to tolerate multiple bit upsets (MBUs) occurred in memory cells. The proposed method uses parity codes in four directions in a data part to assure the reliability of memories. The proposed method is very powerful in error detection while its error correction coverage is also acceptable considering its low computing latency. HVD code is useful for applications whose high error detection coverage is very important such as memory systems. Of course, this code can be used in combination with other protection codes which have high correction coverage and low detection coverage. The proposed method is evaluated using more than one billion multiple fault injection experiments. Multiple bit flips were randomly injected in different segments of a memory system and the fault detection and correction coverages are calculated. Results show that 100% of the injected faults can be detected. We proved that, this method can correct up to three bit upsets. Some hardware implementation issues are investigated to show tradeoffs between different implementation parameters of HVD method. Keywords Error detection · Error correction · Soft errors · Multiple bit upsets · Vulnerability M. Kishani · H.R. Zarandi · H. Pedram · A. Tajary · M. Raji · B. Ghavami () Department of Computer Engineering and Information Technology, Amirkabir University of Technology, Tehran, Iran e-mail:
[email protected] M. Kishani e-mail:
[email protected] H.R. Zarandi e-mail:
[email protected] H. Pedram e-mail:
[email protected] A. Tajary e-mail:
[email protected] M. Raji e-mail:
[email protected]
290
M. Kishani et al.
1 Introduction Much progress in scaling of process technologies down to the Deep-Submicron (DSM) domain has paved the way for a significant increase in the level of integration and performance of modern VLSI chips. The integration of complex System-on-a-Chip (SOC) is now a reality. As process technology scales down to small nanometers, high-density, low cost, high performance integrated circuits, characterized by high operating frequencies, low voltage levels and small noise margins will be increasingly susceptible to temporary faults [2]. Moreover, single-event upsets (SEUs) and single-event transients (SETs) generated by atmospheric neutrons and alpha particles severely impact field-level product reliability, not only for memories, but also for logics in very deep sub-micron technologies. When these particles hit the silicon bulk, they create minority carriers which if collected by the source/drain diffusions, could change the voltage level of a node. Although SEUs are the major concern in space and terrestrial applications, multiple bit upsets (MBU) and multiple event transient (MET) are also became important problem in designing memories because of these points: (1) Scaling down technology to increases the error rate [3, 9]. Therefore the probability of having multiple errors increases, (2) MBUs, METs can be induced by direct ionization or nuclear recoil after passing a high-energy ion [6], (3) The experiments in memories under proton and heavy ions fluxes in [4, 7] show the probability of having multiple errors is increased when the size of memory is increased. Therefore, achieving acceptable reliability levels for modern VLSI chips is a key issue. Unfortunately, packaging and shielding cannot effectively be used to shield against soft errors since they may be caused by neutrons which can be easily penetrate through packages [8]. In order to maintain good level of reliability, it is necessary to protect memory cells using protection codes. Error detection and correction codes play a key role in memory protection and reliability improvement. These codes are usually implemented in hardware, but require extended memory bus architecture to accommodate parity bits and additional encoding/decoding circuitry. The reliability issue can be solved using other forms of redundancy than hardware redundancy because hardware redundancy schemes like duplication or triple modular redundancies are expensive [17]. In low-cost satellite projects, the reliability of a system can be improved by cost effective software error detection and correction code schemes [5]. An important advantage of using software-based code scheme is that, one can change it dynamically to meet the observed response of the memory devices. This is particularly beneficial when, as is often the case in low-cost satellite design, there is little chance to test the devices prior to flight [18]. Various types of error detection and correction codes are used in computers, communication systems based on constrains exposed by applications. For example, for satellite applications, Hamming code and different types of parity codes are used to protect memory devices and complex codes are not applied due to time constraints and limited resources on board [5]. In this paper, a high-level error detection and correction method to protect against soft errors is proposed. This method is based on parities for each row, column and diagonal in slash and backslash directions. This method, which is named HVD, provides very high detection coverage rate that can correct up to three upsets in a data block. To show tradeoffs between parameters, a large data block is partitioned into some smaller data segments. Then some experiments have been done to investigate the effect of data segments granularity in a data block. The results show that HVD method is able to detect 100% faults injected in the protected block and we prove that it can correct up to 3 upsets. Although the proposed method
HVD: horizontal-vertical-diagonal error detecting and correcting
291
includes memory overhead more than previous methods, however, it improves the fault detection and correction coverage using parity codes which have low computing latency and implementation complexity. To find out the error detection coverage of the proposed method, more than one billion multiple fault injection experiments were done. Results show that all of the injected faults can be detected. Also, we prove that this method can correct up to three upsets using six lemmas which are proved in the paper when it is applied for the first time. To best of our knowledge, no references were published about a high-level technique able to correct all triple faults and detect a huge number of multiple faults in memories, as presented in this paper. The rest of this paper is organized as follows. Section 2 provides some related work. The proposed method is described in Sect. 3. It is shown how the parity bits in different dimensions are generated. Detection and correction algorithms are also introduced. Section 4 discusses about the physical implementation issues of the proposed method. Section 5 includes an example of the correction process proposed by the method. Experimental study of the proposed method based on fault injection is provided in Sect. 6. The comparison between HVD and previous protection codes is introduced in this section, too. Finally, Sect. 6 concludes the paper.
2 Related work Golay codes are an important example of cyclic codes. The binary linear code G23 is a (23, 12, 7) code which consists of 212 = 4096 distinct codewords. Each of this codeword is 23 bits long and having hamming distance 7. G23 (23, 12, 7) is widely used in space applications [14]. In fact, it is an example of perfect binary code [19]. It can perform error detection and correction through using table lookup in a directory [10, 11]. This code associates 12 data bits with 11 parity bits to form a set of 212 = 4096 codewords each of which is 23-bit long. The hamming distance is 7. Therefore, all error patterns up to three errors can be corrected. We used bit overhead and code rate for comparing different error correction and error detection methods. Bit overhead is the rate of parity bits to the number of data bits. Code rate is the number of data bits to number of bits in codeword. As a result, a method is better to have a less bit overhead and more code rate. number of parity bits 11 = = 91.67% number of data bits 12 12 number of data bits = = 52.17% Code Rate = number of codeword 23
Bit Overhaed =
(1) (2)
BCH codes [12] belong to class of linear cyclic block codes. For any integer m ≥ 3, and t < 2m − 1, there exists a primitive BCH code characterized by following parameters: n = 2m − 1,
n − k ≤ mt,
dh ≥ 2t + 1
(3)
For m = 5 and t = 3, there exists BCH (31, 16, 7) code which has triple error correction capability. In order to avoid unnecessary explanation, interested reader is referred to [12, 15] for detailed information. The code rate and bit overhead for BCH (31, 16, 7) would
292
M. Kishani et al.
be as follows: 15 = 93.75% 16 16 Code Rate = = 51.61% 31
Bit Overhaed =
(4) (5)
Reed-Solomon [13] is a special case of BCH code. It is a [n, k, n − k + 1] code; in other words, it is a linear block code of length n with dimension k and minimum Hamming distance n − k + 1. The error-correcting ability of a Reed–Solomon code is determined by its minimum distance, or equivalently, by n − k, the measure of redundancy in the block. If the locations of the error symbols are not known in advance, then a Reed–Solomon code can correct up to (n − k)/2 erroneous symbols i.e., it can correct half as many errors as there are redundant symbols added to the block. For n = 63 and k = 57, Reed-Solomon can correct up to 3 bit upsets in data. The code rate and bit overhead for RS (63, 57, 7) would be as follows: 7 = 12.28% 57 57 = 90.47% Code Rate = 64
Bit Overhaed =
(6) (7)
Rectangular parity codes are extension of simple parity codes. By assuming the logical organization of memory bits in rectangular form and calculating a parity check bit at the end of each row and each column, we obtain a two dimensional parity-check matrix. In the rectangular parity codes, if any data bit is changed, the corresponding row parity bit and column parity bit should be changed. This scheme can easily detect single bit-upset considering any changes in the vertical and horizontal parity bits. It also can detect and in some cases correct the multiple-bit upsets in the data part using simple algorithms. It should be noted that rectangular parity codes are not able to correct or even detect some kinds of arrangement of multiple-bit upsets. However, if there is an error in the parity bits, the scheme may encounter some problems in detecting and correcting the error. In [20] as well as [16], the method used the parity codes in 3 dimensions and showed that the method can detect 5 bit errors and correct 5 bit errors. The method supposed that error will occur only in data bits and they did not consider situations where error occurs in parity bits. The structure proposed in [20] is shown in Fig. 1. In Fig. 1, we injected 3 errors (the red (darker) bits). We cannot find any syndrome on the direction of v4 and h5 , because there is 2 bit flips in those directions. But d9 says that there is a syndrome in that direction. So we cannot correct these errors. To correct this error in our method, we have another diagonal parity direction. The intersection between two diagonal parity directions will lead to find the error in the data. It is important to state that we want to correct errors in data bits. To correct parity bits, we need to have one additional parity bit for each direction. So the main differences between these two methods are: (1) We considered errors which can occur in parity bits. These bits are not considered in [20]. (2) The method proposed in [20] can detect up to 5 bit errors but experimental results showed that our method can detect 100% of multiple errors.
HVD: horizontal-vertical-diagonal error detecting and correcting
293
Fig. 1 Errors in structure proposed in [20]
(3) Our correction method is different comparing to [20]. The correction method of proposed in [20] is based on 3 directions for parity, however, our correction method use 4 directions for parity and the additional parities for each direction.
3 The proposed error detection and correction method The proposed detection and correction method is called HVD code since the parity bits are applied on the row, column and two diagonals on a data part. In addition to horizontal (H) and vertical (V) parity bits, we use diagonal (D) parity bits in two directions as shown in Fig. 2. In order to increase the detection ability, an additional parity bit is computed based on calculated parity bits of each dimension. In our HVD code implementation, h, v, d and d represent the number of errors in the horizontal, vertical and slash and backslash lines respectively. Whereas h1 , v1 , d1 and d1 represent the position of first error in horizontal, vertical and the both diagonal parity lines, respectively. 3.1 Detection algorithm In order to detect bit upsets in the codeword, in receiver side, all mentioned types of parity bits on the data part of the codeword should be calculated again. As the different types of
294
M. Kishani et al.
Fig. 2 The horizontal, vertical, slash & backslash diagonal dimension parity scheme in HVD
parity bits can be computed independently in parallel i.e., while computing the horizontal parity bits, vertical parity bits can be calculated simultaneously; encoding and consequently detection procedure can be done by means of parallel computation. This property is useful in real-time and high speed applications. On the other hand, after detecting the first difference between the received parity bits and the calculated ones, the detection algorithm will be stopped and there is no need to compare the other calculated parity bits with the received ones. 3.2 Correction algorithm In this section, we introduce how the data array can be corrected after detecting a probable error. First, suppose three upsets occurred only on data part that we need to correct. After receiving a codeword in the receiver side, the parity bits must be computed again. Through comparing the calculated parity bits with the received parity bits, any differences in parity bits in the vertical, horizontal and both of the diameters dimensions can be found and it sets the corresponding syndrome bit to one. As a result, four arrays are produced such as example, the elements of the horizontal array include the number of the rows that has bit upsets. After that, each two arrays are chosen and all ordered pairs of them are produced. Through every ordered pair, there may be found maximum one candidate bit [21]. Note that no candidate bit may be determined from some case of the ordered pairs [21]. These
HVD: horizontal-vertical-diagonal error detecting and correcting
295
Fig. 3 The created candidate bit and no candidate bit of the intersection
situations are shown in Fig. 3. In the case of Fig. 3(b), the two parity bits are faulty. So the backslash and horizontal bits are not connected and there is no candidate bit. Based on Lemma 1, all of the real bit upsets are included in candidate bits. Lemma 1 All of the real bit upsets are included in candidate bits. Proof Assume that there is one bit upset that is not included in candidate bits. This assumption states that this bit is not the intersection point of any two ordered pair of the elements of the error array. Assume the case that this bit upset is found through a syndrome bit in one dimension. As the other three dimensions have no intersection with this dimension, this shows, there is a bit upset in each dimension that masks the dimension in parity bit calculations. This means that there exist at least four bit upsets which is in contrast with the assumption of having three bit upsets in the codeword. The other case is that none of the dimensions of a bit upset is found in the probable error dimensions. This implies that, there is a bit upset in all of the four dimensions that mask the dimensions in parity bit calculations. This means that there are five bit upsets that is in contrast with the assumption of having three bit upsets. In the next phase of the correction algorithm, elimination process of the candidate bits begins. During the elimination process, it is determined whether the candidate bit which is eliminated is a real error bit or not. First, remember that there is maximum three bit upsets and all of them are included in data bits (after each iteration of the loop, a bit upset is found and will be eliminated from the whole algorithm procedure). So there are maximum three errors in each dimension. Suppose that in the horizontal dimension, there are bit upsets in three rows. Therefore, the candidate bits which are out of these rows should be eliminated because it is clear that they cannot be the real bit upsets. In the next phase, the candidate bit which all of its four dimensions exist in the error arrays should be found. In other words, the candidate bits that should be determined are the ones which its row is found as a row which includes a bit upset and in addition, its column and its first diameter and second diameter is found as dimensions that include bit upsets. Lemma 2 shows that these candidate bits are the real bit upsets. Lemma 2 If all of four dimensions of a candidate bit is found in the error array, that bit is real bit upset.
296
M. Kishani et al.
Proof Suppose that this candidate bit is not a real bit upset. Therefore there should be four other bit upsets that mask this candidate bit in that dimensions. This means that there are at least four bit upsets in the codeword. After finding, correcting and eliminating these bits, the algorithm begins from beginning again and the error arrays are found again considering the fact that one bit upset has been found and only data part has some bit upsets. After this phase, the candidate bits that are the only one in one dimension are found; for example in one of the found rows, there is only one candidate bit. Lemma 3 shows that if there is only one candidate bit in a dimension, that bit is a real bit upset. Lemma 3 If a candidate bit is alone in one row of horizontal dimension (i.e. there is not any other candidate in the same row of horizontal dimension) and if there is a syndrome bit which is set to one in at row, the candidate bit is faulty and must be corrected. Proof Assume that there is a faulty bit in the row of horizontal dimension; so the syndrome bit of that row show this. On the other hand, candidate bit, say A, is the only candidate bit in the row. So A must be a bit upset. It is notable that at this step it is assumed that there is no bit upset on the parity bits. It will be clearly proved that A is correct if the syndrome bit of that row is zero. Lemma 4 shows that there is always a candidate bit that can be selected to be eliminated. Lemma 4 There is always a candidate bit that can be selected to be eliminated. Proof At least, there is one dimension which has more than one syndrome bit and less than or equal to three syndrome bits in its parity bits array. Without loss of generality, assume that the horizontal dimension has three syndrome bits which is set to one. So, it is clear that each of three bit upsets is in the separate row of the horizontal dimension. Assume that candidate bit A is the last candidate bit considering the column number. Three situations may happen. These situations are shown in Fig. 4. Suppose that candidate A cannot be eliminated. So, in the states 1 and 2, there should be more than one bit upsets in the dimensions d1 and d2 . This upset may be on points x or y. But the column number of x and y is more than A and this is in contrast with assumption. In state 3, one faulty bit, B, must be in a point in the same column of A. Therefore the point B can be selected as candidate bit and this situation is same as 1 or 2. At last, all of data bit upsets have been found and the algorithm will finish. By the same way, Lemma 5 shows that if there are two data bit upsets and there are parity syndromes in
Fig. 4 Different states in finding CCB
HVD: horizontal-vertical-diagonal error detecting and correcting
297
three dimensions, both of faulty data bits can be corrected and similarly, if there is one data bit upset and there are parity syndromes in two dimensions, the bit upset can be corrected. It is assumed in the next part that some errors may happen in parity bits. Think about that sometimes the parity bits may become faulty. Consider matrix E as below: h v d d (8) E= Ph Pv Pd Pd where h is the number of syndrome bits in the rows, and Ph shows if there are any error in the row parity bits or not and the same process is done for the other directions. Lemma 5 It is possible to find two error bits with three dimensions. Proof If there is one bit upset on the parity bits of horizontal dimension, the syndrome bit of this dimension is eliminated and one remaining faulty bit can be found by the syndrome bits in the other two dimensions. If two remaining bit upsets are on the parity bits of two different dimensions, the corresponding Pi ’s will show it and it is clear that the data part is correct. If both of the are on the parity bits of one dimension, the first row of bit upsets the matrix E will be 11 20 00 00 without considering the order of the bits in each row. The dimension which its Pi is equal to one is ignored and based on the other dimension there is no candidate bit and there is no bit upset on data part. The number of bit upsets on the parity bits can be zero, one, two or three. The first condition has been discussed in the previous part. If there is one upset on parity bits, one of the Pi ’s shows that there is at least one bit upset on parity bits of one of dimensions (Without loss of generality, assume bit upset has been occurred on horizontal dimension). To find the other bit upsets in the data part, the syndrome bits on that specific dimension (h) will be ignored. Therefore two faulty bits and syndrome bits in three dimensions remain. If two other bit upsets are in data part, Lemma 5 shows that both of them can be detected and corrected then. So, assume that both of the remaining upsets are in the parity bits. If two bit upsets are in two different dimensions of parity bits, their corresponding Pi ’s show the bit upsets and it is clear that the data is correct. If an error is in one dimension of parity bits, the syndrome bits in that dimension will be ignored as shown before and finally the syndrome bits in two dimensions and one error in the data part remains which can be detected and corrected as Lemma 6 proves. Lemma 6 It is possible to find one error bit with two dimensions. Proof It is possible to find a single error on data part when there are two of its dimensions in the matrix E. If the bit upset is in the parity bits, there is one faulty bit in one dimension and there is no candidate bit and therefore the data part is correct (Fig. 5). If the bit upset is included in data part, there is one syndrome bit in each dimension and therefore the intersection of dimension is real faulty bit (Fig. 6). Now, if two remaining errors are in parity bits of one dimension; in this situation several arrangements may occur as shown in Fig. 7. 1 0 0 0 2 0 0 0 3 1 1 1 (9) 0 0 0 0 0 0 0 0 0 0 0 0
298
M. Kishani et al.
Fig. 5 Bit upset on parity bits
Fig. 6 Bit upset on data part
In states a and b, there is no intersection of different dimensions of syndrome bits and the algorithm states that the data is correct. In states c and d, as the candidate bit which all of its four dimensions exists in the error arrays, the bit will be eliminated and therefore we return to the state a or b. There is no intersection in state a or b; therefore, the algorithm claims that the data is correct now. In state e, the intersection point is the bit X and the algorithm can easily find it. State f is a specific state because there are four dimensions in the error array and the intersection point of these dimensions is not a single point. The algorithm considers as an exceptional case. Here, during elimination process, if there is a candidate bit which three of its dimensions exists in the error array, that bit is considered as a real bit upset. If there are three bit upsets in the parity bits, certainly one of the Pi ’s is equal to one. So as mentioned, that dimension is ignored and Lemma 5 shows that it is possible to find the entire bit upsets in this specific case.
4 Physical implementation Different implementation parameters of HVD method such as data segment size and bit overhead are evaluated to provide us with tradeoffs between these parameters. 4.1 Granularity parameter To show tradeoff between implementation parameters, a 1024 × 1024 block of data is considered. HVD method is used to correct errors on this block. There are several possible
HVD: horizontal-vertical-diagonal error detecting and correcting
299
Fig. 7 Different arrangements of three upsets
configurations to implement the proposed method for a given data block. In the first configuration, HVD is applied to the entire block. In the other one, the block is divided into four separate blocks, called data segments, and then the proposed method is applied on it. In other configurations, the number of separated block increases. As Table 1 shows, the bit overhead rate increases with the number of segments. Figure 8 shows the error correction rate as the number of error increases for different number of segments. The figure shows that the correction rate of configuration increases as the number of segments increases. Correction rate is division of number of correctable bits by total bits in our method. This metric is a float value between 0 and 1. The correction rate of “1” means that all of possible errors in data bits can definitely correctable. As another example, if the data block is partitioned into 64-bit segments, 43 errors can be corrected in about 70% situations. Considering the Table 1 and the Fig. 8, there is a tradeoff between the overhead and the correction rate of the proposed method. The suitable configuration can be determined based on this tradeoff and the application in which the HVD coding will be applied. 4.2 HVD implementation The proposed implementation of HVD coding scheme is shown in Fig. 9. The figure shows the 114 bit codeword consisted of 64 data bits and 50 additional bits for parity bits. The slash and backslash parities are asymmetric and the number of XORs differ from 0 (the first line)
300 Table 1 Overhead of each block configuration
M. Kishani et al. Configuration (# of Segment(s))
parity bits
Bit overhead ( data bits )
1
0.006
4
0.012
16
0.023
64
0.047
256
0.095
1024
0.191
Fig. 8 Correction rate vs number of errors for different block configurations
to 14 (the middle line). This architecture is decided to have better error correction coverage. Note that, the overhead of a codeword depends on the number of additional bits compared to the whole memory block size. In order to reduce the relative overhead of the proposed method, the tradeoff which was discussed in the previous sub-section should be taken into account. So, the memory is partitioned into appropriate number of segments and the codeword is calculated for the segments. On the other hand, in applications with block-based read and write operations, the error detection and correction methods and the corresponding hardware are implemented based on the specific memory segment sizes. Figure 10 shows the general structure of the hardware used for calculating and updating the parity bits. The proposed hardware is designed to calculate the parity bits required for block read and write operations just in one cycles. Therefore, the performance overhead of the method is signif-
HVD: horizontal-vertical-diagonal error detecting and correcting
301
Fig. 9 Parity bits for a given memory block
Fig. 10 General hardware structure for the proposed method
icantly low. As this hardware is shared for all of the memory blocks, its area overhead is negligible compared to the other hardware structure used for error detection and correction code. In read operation, we need to check all parity bits in horizontal, vertical slash and backslash directions. We assume that the granularity of the memory access is a block. So the numbers of the parity bits to be calculated in the block read and write operations are the same. The number of XOR gates needed is shown in Table 2. Moreover, Table 2 shows the dynamic energy per calculation and leakage power of proposed method in 45 nm technology. We assume that a dedicated hardware is used to calculate Horizontal, Vertical, Slash and Backslash diagonal parities simultaneously in just one cycle. Meantime, in the cases
302
M. Kishani et al.
Table 2 The number of the XOR gates used for a given block memory Parity direction
Symbol
Number of XOR gates
Dynamic energy per
Leakage power
calculation (j )
(Watt)
Horizontal parity
h
7 × 8 + 7 = 64
5.05725 × 10−14
4.01892 × 10−06
Vertical parity
v
7 × 8 + 7 = 64
5.05725 × 10−14
4.01892 × 10−06
Slash diagonal parity
d
3.13979 × 10−06
Backslash diagonal parity
7×6 + 6×5 + 14 = 50 2 2 7×6 + 6×5 + 14 = 50 2 2
3.95098 × 10−14
d
3.95098 × 10−14
3.13979 × 10−06
that parity calculation delay can be tolerated, we can use a simpler hardware (i.e. use a hardware having a few number of XOR gates, but with delay penalty) to reduce the leakage power overhead. In order to estimate dynamic and leakage energy consumption of proposed method, Synopsys HSPICE Simulator for 45-nm Physical IP and Standard Cell Development is used [22]. 4.2.1 Read and write procedure While reading a block, all the parity bits in horizontal, vertical and diagonal directions are calculated and compared with the previous parity bits. If incoherence is detected, the correction mechanism will correct the affected block and then the correct block is read. While writing a block, the parity bits in all directions are calculated and written in the appropriate place. As seen, the computation process is done in one cycle using combinational XOR structures and therefore it reduces the performance overhead of the method in expense of some hardware costs. 4.3 Correction method implementation The correction method of HVD is a step by step algorithm. This method is explained in Sect. 5 with an example. As the correction procedure of the proposed method is complicated, HVD is efficient for the system with back-up facilities. They can try for retransmitting instead of beginning a correction process as the detection rate is near 100%.
5 An example of correction In this section, we consider a codeword with 3 errors and try to correct it. The codeword architecture with errors (the red (dark) squares) is shown in Fig. 11. The architecture is explained in the previous section. The steps of correction are as follows. 5.1 Step 1: regenerate parity bits After regenerating parity bits, we can compare them with their original ones. Any difference is a sign of error on bits that generate parity. In Fig. 12 we showed the mistaken parity bits. The parity bits of horizontal and vertical parities are shown in the architecture. The diagonal parity bits are shown on the right side of architecture.
HVD: horizontal-vertical-diagonal error detecting and correcting
Fig. 11 The codeword architecture with 3 errors
Fig. 12 Codeword with parity codes
303
304
M. Kishani et al.
Fig. 13 Candidate bits
5.2 Step 2: mark candidate bits Each parity bit has a corresponding line of codeword bits that generate it. Consider lines of each mistaken parity bit. When we intersect two lines, if there is a conjunction then it is a candidate bit. To find all candidate bits, we should intersect each two lines. The candidate bits of the example are shown in Fig. 13 in numbered rectangles. 5.3 Step 3: refine candidate bits to find errors To refine candidate bits, we use Lemma 3. For example consider candidate 1. It is the only candidate in line of h1 (parity 1 of horizontal direction), and h1 is not a mistaken parity. Therefore, candidate 1 is not an error, so we remove it from candidate bits. We can remove candidates 2, 12, and 16. After this step, the architecture is shown in Fig. 14. Now consider candidate bits 3, 5, 15, and 13. They are the only bits in direction d4 , d5 , d12 , and d13 respectively. There is a mistaken parity on d4 and candidate 3 is the only candidate on that direction, therefore candidate 3 is an error. But d5 , d12 , and d13 are not mistaken parity bits, so candidate 5, 15, and 13 are not errors and should be removed. The remaining candidate bits are showed in Fig. 15. Candidate bits 8 and 14 are alone in direction of d6 and h7 , respectively. d6 and h7 are mistaken parity bits, so candidate bits 8 and 14 are errors. The status of architecture at this stage is shown in Fig. 16. Now consider candidate bit 4 and mistaken parity bit h3 . Because candidate bit 3 and 4 are the only candidate bits in line of h3 and candidate bit 3 is an error, candidate bit 4 is not an error and should be removed. Similarly we can remove other non error bits. The final architecture is shown in Figs. 16 and 17.
HVD: horizontal-vertical-diagonal error detecting and correcting
305
Fig. 14 Refining candidate bits, step 1
Fig. 15 Refining candidate bits, step 2
6 Experimental results Fault injection is one of the important method to estimate the fault detection coverage of different error correcting codes. In order to estimate the fault detection coverage of the proposed the method, we used the fault injection method on logic simulation level.
306
M. Kishani et al.
Fig. 16 Refining candidate bits, step 3
Fig. 17 Refining candidate bits, final step
To measure the error detection coverage, we used a 64-bit data and therefore the codeword was 114 bits. We did ten million Monte Carlo simulations for each number of faults. We injected one fault in codeword for Monte Carlo simulation and then repeated this process for other numbers of faults from 2 to 114. Then we start to find if the method will find any
HVD: horizontal-vertical-diagonal error detecting and correcting
307
error in each simulation. Results showed that in all simulations, HVD code stated that there is an error. This means that the error detection coverage is 100%. Fdet : Fault Detection Coverage
(10)
Fdet = Pr{Detect fault | There is at least fault}
(11)
Fdet =
n
(Pr{Detect fault | There is i faults} × Pr{There is exactly i faults})
(12)
i=0
where n is the number of codeword bits. From the Binomial Theorem we have: Pr{There is exactly i faults} =
n (Pfail )i × (1 − Pfail )n−i i
(13)
The experimental results showed that: ∀f ≤ nPr{Detect fault | There is i fault} = 1 Fdet =
n
(14)
Pr{There is i faults}
(15)
i=0
Fdet =
n n i=0
i
(Pfail )i × (1 − Pfail )n−i
(16)
Fdet = (1 − Pfail + Pfail )n = 1
(17)
Fault correction coverage of the proposed method is equal to 100% for three bit upsets in addition to complete correction coverage for one and two bit upsets. This has been thoroughly proved during describing the correction algorithm using some proved lemmas. So, we avoid using simulation to show our claim when there are obvious proofs for the lemmas used to prove the correction algorithm. Here we compare the HVD code with other triple bit error correcting code. This comparison is based on bit overhead and code rate which have been defined in the Sect. 2. It is presented in Table 3. As seen, that the proposed HVD code is better in respect to bit overhead metric in comparison with Golay and BCH codes and worse than Reed-Solomon coding scheme. It is shown better in the Fig. 18. As explained in the previous section, the error detection and correction procedure of HVD takes only one cycle to be done, its delay overhead is very low and it is equal or less than every other coding scheme. Table 3 Comparison of triple bit error correcting codes Error correcting
# of
# of
# of
Bit overhead
Code rate
method
data bits
parity bits
codeword bits
(%)
(%)
Golay(23, 12, 7)
11
12
23
91.67
52.17
BCH(31, 16, 7)
15
16
31
93.75
51.61
7
57
64
12.28
90.47
64
60
124
78.12
56.14
RS(64, 57, 7) HVD(64)
308
M. Kishani et al.
Fig. 18 Comparison of triple bit error correcting codes
Golay coding scheme can perform error detection and correction through using table lookup in a directory [10, 11]. For example, G23 (23, 12, 7). This code associates 12 data bits with 11 parity bits to form a set of 212 = 4096 codewords with 23-bit long. However, HVD is implemented using XOR gates. The number of XOR gates is 228 for a 114-bit codeword. So its hardware implementation cost is higher than the proposed method. BCH (63, 45) code is a triple error correcting. Using this code up to 6 errors can be detected. The related generator polynomial is of degree of 18. To encode a block of bits (45 bits in this case), a number of zeros equal to the degree of the generator polynomial (18 in this case) is appended to it. Then it is divided by the generator polynomial using binary arithmetic. The remainder (in this case 18 bits) plus the information constructs the complete codeword. To test the codeword for errors, it is divided by the generator polynomial. The remainder is zero if there are no errors. So, to encode a block 63 bits, 63 × 18 = 1134 XOR and shift operations are required. If an ordinary sequential divider is used, the delay of encoding is 64 + 18 = 82 cycles. Determining where an error exists or not is by dividing the codeword by the generator polynomial again. As explained, encode and decode process of BCH codes have a high delay, comparing with the HVD encoding and decoding delay which is just one cycle. Beside, the encoding/decoding power is considerable, because of the huge number of shifts and XORs required (63 × 18 = 1134 in this case). In general, the number of required shifts and XORs is a factor of n2 , Where in HVD method the number of XORs and so the power consumption is a factor of n. The error correction procedure of BCH method is much more complex than encoding. In general, the error correction of BCH involves three steps: 1. Compute the syndrome from the received codeword. 2. Find the error location polynomial from a set of equations derived from the syndrome. 3. Use the syndrome location polynomial to identify errant bits and correct them. But since the error correction is needed in rare cases, we avoid comparing it and assume software implementation for error correction of all methods. Similar to BCH, Reed-Solomon has a complex hardware implementation comparing to the HVD coding method. It also has a performance overhead larger than one cycle for reading or writing process. Although the bit overhead and the code rate parameters are better than other methods, the performance and area overhead of this method is less than the proposed method.
HVD: horizontal-vertical-diagonal error detecting and correcting Table 4 Dynamic energy per code calculation and leakage power of HVD and BCH
Method
HVD BCH
Table 5 Total code calculation power of HVD and BCH
Method
309
Dynamic energy per
Leakage power
calculation (j )
(Watt)
1.80164 × 10 − 13
1.43174 × 10−5
1.79216 × 10−12
9.82798 × 10−6
Total code calculation power (Watt)
HVD
0.00012
BCH
0.00107
6.1 Power consumption comparison In this section we report Dynamic and leakage energy consumption for HVD compared to the BCH code. As the error correction capability of the BCH is the same as HVD, we choose BCH for comparison. We used Synopsys HSPICE Simulator for 45-nm Physical IP and Standard Cell [22] to estimate dynamic and leakage energy consumption. Table 4 shows the dynamic energy per code calculation and leakage power of HVD and BCH methods. It shows that the dynamic energy per code calculation of BCH is about 10 times more than HVD while these two methods show competitive leakage power of code calculation hardware. Assume that the target system works with 3 GHz frequency and on average 20% of all operations are memory operations. Target system needs 600,000,000 memory accesses per second. Assuming the cache miss rate of 1%, Target system needs 594,000,000 cache accesses per second. Assuming a system like this, total code calculation power of HVD and BCH is as Table 5 shows. Table 5 shows that total code calculation power of HVD is just 11% of BCH. Note that BCH(63, 45) which is compared with HVD, contains just 45 bits of data. If we justify the number of bits (i.e. calculate power per bit) HVD code calculation power is just 7% compared to BCH.
7 Conclusions and future work This paper presents a high-level error detection and correction method which is called HVD code. This type of protection code uses the parity code in four directions in a data block. All of multiple bit upsets can be detected and 3-bit errors can be corrected, based on the experimental results. Although the fault correction coverage of the proposed method is equal to the Golay code, however Golay code imposes significant area and power consumption and considerable low error detection coverage compared to the proposed method. Moreover, Golay, BCH and Reed-Solomon codes require terrific computing operations to detect and correct bit errors. Based on the mentioned points, in applications for which high error detection coverage is very important, HVD code can be useful. Of course, this code can be used in combination with the other code schemes that has high correction coverage and low detection coverage in comparison with HVD code.
310
M. Kishani et al.
References 1. Hazucha P, Svensson C (2000) Impact of CMOS technology scaling on the atmospheric neutron soft error rate. IEEE Trans Nucl Sci 47(6):2586–2594 2. International Technology Road map for Semiconductors (2002) http://public.itrs.net/ 3. Ferreyra PA, Marques CA, Ferreyra RT, Gaspar JP (2005) Failure map functions and accelerated mean time to failure tests: new approaches for improving the reliability estimation in systems exposed to single event upsets. IEEE Trans Nucl Sci 52(1):494–500 4. Karlsson J, Liden P, Dahlgern P, Johansson R, Gunneflo U (1994) Using heavy-ion radiation to validate fault-handling mechanisms. IEEE MICRO 14:8–23 5. Imran M (2006) Using COTS components in space applications. Master Thesis, University of TUDelft 6. Hentschke R, Marques R, Lima F, Carro L, Susin A, Reis R (2002) Analyzing area and performance penalty of protecting different digital modules with hamming code and triple modular redundancy. In: International symposium on integrated circuits and systems design, pp 95–100 7. Reed R (1997) Heavy ion and proton induced single event multiple upsets. In: IEEE nuclear and space radiation effects conference, pp 2224–2229 8. Seifert N, Moyer D, Leland N, Hokinson R (2001) Historical trend in alpha-particle induced soft error rates of the alpha microprocessor. In: Proceeding of 39th annual IEEE international reliability phys symp, pp 259–265 9. Argyrides C, Zarandi HR, Pradhan DK (2007) Multiple upsets tolerance in SRAM memory. In: International symposium on circuits and system, New Orleans, LA, May 2007 10. Rubinoff M N -dimensional codes for detecting and correcting multiple errors. Comun ACM 545–551 11. http://www.eccpage.com/golay23.c 12. Berlekamp ER (1968) Algebraic coding theory. McGraw-Hill, New York 13. Fill TS, Glenn Gulak P (2002) An assessment of VLSI and embedded software implementations for reed-Solomon decoders. In: Signal processing systems 14. http://mathworld.wolfram.com/GolayCode.html 15. Lin S, Costello DJ Jr (1983) Error control coding: fundamentals and applications. Prentice-Hall, Englewood Cliffs. ISBN 0-13-283796-X 16. Thirunavukkarasu U, Babu Anne N, Latifi S (2004) Three and four-dimensional parity-check codes for correction and detection of multiple errors. In: International conference on information technology: coding and computing, p 480 17. Shirvani PP, Saxena NR, McCluskey EJ (2000) Software-implemented EDAC protection against SEUs. IEEE Trans Reliab 3:273–284 18. Underwood CI, Oldfield MK (2000) Observations on the reliability of COTS-device-based solid state data recorders operating in low-earth orbit. IEEE Trans Nucl Sci 47:647–653 19. Togneri R, deSilva CJS (2002) Fundamentals of information theory and coding design, discrete mathematics and its applications. CRC Press, New York. ISBN 1-58488-310-3 20. Pflanz M, Walther K, Galke C, Vierhaus HT (2004) On-line techniques for error detection and correction in processor registers with cross-parity check. J Electron Test Appl doi:10.1023/A:1025165712071 21. Hyun Yoon D, Erez M (2009) Memory mapped ECC: low-cost error protection for last level caches. ACM SIGARCH Comput Archit News 37(3):116–127 22. http://www.synopsys.com/home.aspx