Software Implemented Fault Tolerance through Data Error Recovery

1 of 8

Software Implemented Fault Tolerance through Data Error Recovery Goutam Kumar Saha Member ACM Mail to: CA –2 / 4B, CPM Party Office Road, Baguiati, Deshbandhu Nagar, Kolkata -700059, WB, INDIA. [email protected] This paper examines how a new software-implemented data error recovery scheme can be so effective in comparison to conventional Error Correction Codes (ECC) during the execution time of an application. The proposed algorithm is three times faster than the conventional software-implemented ECC and application program designers can easily implement the proposed scheme because of its simplicity while designing their fault tolerant applications at no extra hardware cost. The proposed softwareimplemented scheme for execution-time data-error detection and correction relies on three-fold replication of application data set as a basis for fault tolerance.

1. Introduction In this proposed scheme, we have used an error-masking scheme as well as a data recovery scheme that corrects the corrupted data. We have used triplicate data model. The proposed approach is suitable for any application that refers to a set of data during its computation time and that has no memory space problem for accommodating triplicate data, whose correctness is very important for producing correct output. Three images of a lookup data of an application under execution are kept in the system memory. Upon execution of the application, three bytes are patched from the three different images stored in different memory segments and compared to check the validity of the fetched data byte. If more than two fetched bytes have the same contents, the byte is considered as fault-free, otherwise, faulty. The proposed scheme is compared to some other available error checking and correcting codes. ECC codes are commonly used for solid-state memory systems for online error detection and correction. CRC codes are for checking the validity of the data over unreliable communication links. BCH codes are another cyclic coding scheme for error detection and correction. RS codes are block-based error correction codes commonly used for massive storage media such as CD, DVD, etc. The proposed scheme aims to supplement the conventional ECC codes. Design of online lowcost software implemented scheme for data error or byte error detection and recovery thereof is always useful for any low-cost reliable application systems. Simplicity of a scheme is equally important for system engineers while employing such scheme in their application systems. Here, we focus on software implementation for an application's databyte-errors-detection and correction only through an extended triple modular redundancy (TMR) -type scheme and error masking. This technique is applicable to any table look-up scheme. Short duration noises often cause random data corruption in the primary memory of a computing machine. Electrical Fast Transients (EFT), Electrostatic Discharge (ESD), Electromagnetic Pulses (EMP) are the example of short duration noises. A scientific application often computes erroneous results if it reads bad data from a table. Often, we take it granted that our program code and data banks are absolutely correct while designing software for an application. But, it is always not correct because the highspeed processing units are often victimized by short duration noises [1-4]. www.acm.org/ubiquity/views/pf/v6i35_kumar.pdf

2 of 8

2. Conventional ECC We all know about the various conventional error detection and correction schemes like Checksums, Hamming Codes, Cyclic Redundancy Checks (CRC), Weighted sum code (WSC) [14], Bose-Chaudhuri-Hocquengham (BCH) codes and Reed-Solomon (RS) codes etc. They are not free from limitations [5,6,7]. The single parity checks can detect only odd number of single bit errors. Any even number of single bit-errors remains undetected. So it is inadequate for detecting any number of errors. CRC is useful for multiple bit-errors detection. Again in CRC, shift register circuits are normally used. The Shift register circuit is for dividing polynomials and finding the remainder. Modulo 2 adders, multiplier circuits are also used. In CRC, the receiver fails to detect the errors when aliasing occurs; the remainder is zero, erroneously. Again, CRC is having high time redundancy and that is why they are normally hardware based. Software based CRC implementation is impractical in a real time application. An h-bit CRC, generated by G(X)=(X+1)P(X), where P(X) is a primitive polynomial of degree h-1. The CRC has maximum code-word length of 2h-1-1 bits and minimum distance d=4, i.e., all double and odd errors are detected. This code can also detect any error burst of length h bits or less. In other words, its burst-error-detecting capability is b = h. Although the CRC, WSC and Fletcher Checksum can be extended to have infinite length, but their minimum distances all reduce to 2. Again, a Hamming code is to provide only the single bit-error correction and double bit-errors detection. In a typical Checksum where n bytes are XORed and the result is stored in (n+1)th byte. Now if this byte itself is corrupted due to transients or in the case of even changes, the errors remain undetected by this typical Checksum. The Bose, Chaudhuri and Hocquenghem (BCH) codes form a large class of error correcting codes. The Hamming and Reed-Solomon (RS) code are two special cases of this powerful error correcting technique. Hamming, BCH and RS codes have nice mathematical structures. However, there is a limitation when it comes to code lengths. These conventional error correcting codes namely, BCH, RS have limitations and there exists very high time redundancy when they are implemented by software. When the check-bits become erroneous, the stored check bits do not match the computation and as a result, a block code fails. In general, checksum schemes fail when they are corrupted enough to transform to another valid code work (the distance of the code). However, the proposed technique is capable enough to detect byte-errors and corrections thereof with an affordable redundancy in both time and memory space. The proposed scheme is three times faster than CRC and BCH. However its space redundancy is more than these popular codes. Interested reader may refer to [7-12] for related works on fail-safe kind of fault tolerance through error detection.

3. The Software-Implemented Scheme The steps or algorithm for data byte error detection and correction thereof is stated in algorithm “DCDE”.

Algorithm- DCDE Procedure The following basic steps show how the DCDE scheme works.

www.acm.org/ubiquity/views/pf/v6i35_kumar.pdf

3 of 8

It verifies the corresponding three data bytes at an offset say, f, of the three images of the application data and if any byte error has occurred, then it repairs the corrupted byte. Starting addresses of the three images are known. */ Step 1.

Set S = Size of an image in bytes. /* Size of an image in bytes is known */

Step 2.

Set f = 0

Step 3.

B12 = I 1f .XOR. I 2f

Step 4.

If B12 .EQ. 0 , Then: No Error

/* Initialize the memory offset say, f */ /* The byte at offset f in the corresponding two images I 1 and I 2 are XOR ed */ /* go to step 5 i.e., program control goes out of the outer End If of step-4, for scanning the next byte of the three images */

Else: B13 = I 1f .XOR. I 3f If B13 .EQ. 0 , Then: I 2f = I 2f .XOR. B12 /* Bytes at I 1f, I 3f are correct but I 2f is bad, so the erroneous byte at I 2f is corrected */

Else: B23 = I 2f .XOR. I 3f

If B23 .EQ. 0, Then: I 1f = I 1f .XOR. B12 /* Bytes at I 2f, I 3f are correct but I 1f is bad, so the erroneous byte at I 1f is repaired */

Else: Call ERROR /*All the three bytes at an offset f in the three Images are corrupted i.e. Memory Hardware problem or permanent error, so call ERROR Routine to repair from master copy */ {End of If structure} {End of If Structure} {End of If Structure} Step 5.

Set f = f + 1 /* Offset f is incremented by one to point next byte*/ If f < S, Then: /* Scan the next byte for error detection & Goto Step 3. repair thereof */ Else: Return /* On completion of entire scanning & repair of all the sets of data (in triplicate) starting from 0th byte through (S-1)th byte of 1st,2nd and


4 of 8

3rd images simultaneously, program control goes back to the main application. */ {End of If structure} [End of Algorithm – DCDE]

4. Discussion It is assumed that the starting addresses of the three images of the set of application data are say I1, I2 and I3 respectively. When the offset f is say, 0 (initial value), then the address I10 denotes the starting address of first image I1 only, because I10 has the value of (I1 + 0) i.e., starting address plus offset. In general, if Im be the starting address of the mth image then, the address of the f th byte (or at offset say, f) is shown by equation (1).

Imf

= Im + f

…… (1)

Again, if any one byte out of the three corresponding bytes of three images at an offset say, f is corrupted, then this DCDE routine can repair the corrupted byte by XORing. The affected byte is detected by means of comparison of three bytes at the same offset, as shown at step 3 and step 4 of the Algorithm- DCDE. If two corresponding bytes content are same then XORing result will be zero i.e., 00000000 or 00H. Otherwise, result is a nonzero value. In general if the two byte-contents of mth and nth images at an offset say, m n f, are say, I f, I f and if these two values are not corrupted, then the following equation is true.

Imf

.XOR.

Inf

= 0

……. (2a)

Otherwise, if the two byte-contents are not similar, then the equation (2a) will not be satisfied, i.e.,

Imf

.XOR.

Inf

≠ 0

……. (2b)

The possibility of getting inadvertently alteration (by transients) of two bytes at distant locations, into a similar corrupted value in order to dissatisfy equation (2b) is almost zero. In other words the chances of byte-errors remaining undetected is

1 / ( 2 8 ) * 1 / ( 2 8 ) = 2 –16

……. (3)

This method is capable of detecting even all 8- bit errors i.e., even an entirely corrupted byte can also be detected. n n* m Again, it can repair all the 8-bit errors. If say, I f byte is corrupted to I f, and say, I f

Im and Io, and then by n comparing three corresponding bytes of the three images, we can detect that I f byte is and

Iof

byte contents are same at an offset f of two images


5 of 8

corrupted (as shown at step-3 and step-4 of the Algorithm-DCDE). The corrupted byte is repaired in the following way. This is applicable even for all 8 bit-errors in a byte. m n Say, I f = 1001 1101 and, the corrupted bit pattern at I f i.e., say, In*f = 0110 0010 then the result (in Smn) after XORing these two bit patterns will be

Smn

=

Imf

.XOR.

In*f

=1111 1111 Smn )

n* Now, the bit pattern of (I f .XOR.

n*

will be 1001 1101 i.e., I f is repaired. If there is no error in program and data code then the following equation will be satisfied.

Imf = Inf = Iof

……..(4a)

The chance of satisfying the equation (4) by the corrupted three bytes of the three images at the same offset is negligibly small, because the transients’ effects on memory, registers are very random and independent in nature.

1 / ( 2 8 ) * 1 / ( 2 8 ) * 1 / ( 2 8 ) = 2 -24 ……. (4b)

In other words, the chances of three bytes at different locations corresponding to a particular value with similar bit pattern, getting altered (due to random effects of transients) simultaneously to a similar value in order to satisfy equation (4c) is negligibly small.

Im*f = In*f = Io*f

……..(4c)

Again, the chances of a particular value (of one byte size) stored at same offset in the three images, getting altered to three different values (of different bit pattern), is 24 –24 1 / (2 ) or, 2 . The above disastrous effect indicates a possibility of memory hardware or permanent errors and the ERROR routine is invoked for necessary recovery thereof. In other words, the possibility of invoking the error routine namely, ERROR (as shown at step-4) will be negligibly small. This is very effective for on-computing application-data error detection and error recovery of the application-data during the life cycle of the application. After detecting and repairing the entire application data, program control goes back to the main application. Even a totally corrupted image can be repaired by this proposed technique by repairing one byte after another byte. Space redundancy of this proposed technique is about three. However, because of the lower economic trend on the hardware prices, this much space redundancy can be easily affordable in many applications where space and time constraints are not so stringent. Little higher time redundancy can also be afforded because of easily affordable highspeed machine. This proposed technique is capable of detecting and repairing any number of soft errors (not reproducible) as well as permanent errors during the run time


6 of 8

of an application. Fault detection is inversely proportional to the upper limit of the variable No_of_Calls_to_DCDE, i.e.,

FD α 1 / No_of_Calls_to_DCDE

…….. (5)

Thus, depending on the threat of transients potential, the upper limit value of No_of_Calls_to_DCDE or the frequency of calls to the DCDE routine from the various points of an application program can be changed in order to meet the application designers' requirement.

5. Effectiveness of the proposed Scheme Compared to ECC The various block codes like BCH, Hamming, RS codes have nice mathematical structures. However, there is a limitation when it comes to code lengths. A bounded distance algorithm is usually employed in decoding block codes and, it is generally hard to use soft decision decoding for block codes. Again such codes are not always adequate for on-chip DRAM applications. The encoding and decoding circuits of these codes employ a linear feedback shift register (LFSR), which, if used in a DRAM chip, will introduce very high access delay. Again such codes are not suitable for detection and correction of a byte that gets entirely corrupted. Software implementation of these codes suffers from very high time redundancy. The code rate (RC ) of BCH (n=63, k= 45) is (45/63) or 0.7, where k = information bits and, n= code length = (k + r), r is the number of parity check bit, whereas in this proposed approach's RC is (8/24) or 0.33. Code efficiency of BCH (63,45) or E = (Number of correctable bits (t) / Code length (n)) = 3/63= 5%, whereas, the proposed scheme's code efficiency is (8/ 24) or 33.33 %. In a BCH code, the number of bits which can be corrected in an n-bit code word is t = r/m, where m=log2(n+1). In other words, number of parity bits (r) required to correct t bits = (t * m) = t * log2(n+1). Thus, in order to correct all 8 bits the BCH code length (n) becomes 56 bits where k is 8. So for the worst case when all bits of a byte get corrupted then the space redundancy of BCH (56,8) becomes completely unaffordable. In such case of entire byte corruption, BCH code has the least code efficiency of (8/56) or 14%, whereas the proposed scheme has the code efficiency of 33.33 %. The minimum distance between BCH codes is related to t by the inequality

2t +1 ≤ d min (odd); 2t + 2 ≤ d min (even). r In other words, the Hamming code, where n = 2 - 1, is seen to be a BCH code where m = r, so that t=1. The RS code can also correct burst errors. With 6 check symbols, up

to 3 symbol errors and therefore 3 bit errors can be corrected. Again implementation complexity of the coding and decoding scheme (in block codes) increases with increase in code length and its capability to detect and correct errors. The required computational operations increase as the number of parity symbols increases. It is seen that software operation count per byte of the CRC-16 (that are generated by binary polynomials such as 16 15 2 X + X + X + 1) is 46.2 and the operation count per byte of the proposed scheme is only 15. It is seen [13] that software operation count per byte tCRC (s,n) = 5.5ns + 3n g(s), where a code-word consists of n tuples or polynomials and each tuple has s bits. When s=8 and n=64 then ns = 512 bits and g(8)= 52. Then the ratio of the byte operation counts of the CRC and the proposed scheme (including detection and correction of all


7 of 8

eight bits of an information byte) is 46.2 / 15 = 3.08 (for only error detection, this ratio is 46.2 / 5 = 9.2, where 5 is the operation count per byte of the proposed scheme for error detection only through comparing bytes only). In other words, the proposed software scheme is 3 times faster (for error detection and recovery of an entire information byte) than the software implemented CRC code with 16 check bits. Error detection is nine times faster here. Code efficiency is also higher than BCH code.

5.1 The Overhead Comparison We all know that the major drawback of error detection and fault tolerance by softwaremeans comes from the increase in execution time and the memory area overhead. On studying over a simple program of Bubble sort of 60 integer values, the overhead factors are listed below in Table 1. It is observed that this software scheme leads to a better performance. Memory Overhead Time Overhead Factor Program Approach Factor CRC-Non-Distribute Hamming Proposed Software Scheme (SF)

>10 >10 < 2.5

Software Implemented Fault Tolerance through Data Error Recovery

Software Implemented Fault Tolerance through Data Error Recovery

Suggest Documents

Software-implemented fault-tolerance and

SWIFT: Software Implemented Fault Tolerance

Transient Software Fault Tolerance Through Recovery

Software-Implemented Hardware Fault Tolerance ... - CiteSeerX

Software-Implemented Hardware Fault Tolerance ... - CiteSeerX

Software Implemented Fault Tolerance: Technologies ... - IEEE Xplore

Software-implemented fault-tolerance and ... - Semantic Scholar

Soft-error Detection through Software Fault-Tolerance techniques

The TIRAN Approach to Reusing; Software Implemented Fault Tolerance

Scalable Fault Tolerance through Byzantine Locking - Parallel Data Lab

Software-Based Replication for Fault Tolerance

Improving Workflow Fault Tolerance through ... - Google Sites

Fault Tolerance Through Energy Balanced Cluster

Improving Workflow Fault Tolerance through ... - Google Sites

Improving Workflow Fault Tolerance through ... - Google Sites

Improving Workflow Fault Tolerance through ... - Google Sites

approaches to software fault tolerance - CiteSeerX

Fault Tolerance Via N-Modular Software Redundancy

Software-Controlled Fault Tolerance - Liberty Research Group

ReSoFT: A Reusable Software Fault Tolerance Testbed

ReSoFT: A Reusable Software Fault Tolerance Testbed

Challenges facing Software Fault-tolerance - CiteSeerX

Challenges facing Software Fault-tolerance - CiteSeerX

Software-Controlled Fault Tolerance - Liberty Research Group