An Efficient Technique to Tolerate MBU Faults in Register ... - CiteSeerX

19 downloads 0 Views 458KB Size Report
incurring low amount of area and power overheads. The proposed technique is motivated by the fact that not all data stored in the register file occupy full register ...
An Efficient Technique to Tolerate MBU Faults in Register File of Embedded Processors M. A. Abazari, M. Fazeli, S. G. Miremadi Department of Computer Engineering, Sharif University of Technology, Tehran, Iran {abazari, m_fazeli}@ce.sharif.edu, [email protected]

Abstract This paper presents a cost efficient technique to protect embedded processors register file against multiple bit faults. The proposed technique provides high level of protection for register file against MBUs incurring low amount of area and power overheads. The proposed technique is motivated by the fact that not all data stored in the register file occupy full register width. The key idea behind the proposed technique is to exploit these unused bits for fault detection/correction purposes. In fact, in the proposed technique, based on the available free bits in a register, an appropriate error detection/correction code is employed. To achieve the goal, three bits have been added to each register showing the register data width or available free bits in the register. To correct the errors, parity bit and hamming code is used in the proposed technique. The proposed technique is extensively evaluated on an ARM embedded processor IP core using fault injection experiments. The fault injection results show that the proposed technique can detect 99% of errors in which 22% of errors can also be corrected in the presence of up to 16 bit errors. The proposed technique imposes about 7% area and 1% power overheads. However, the remarkable fault detection/correction provided by the proposed technique a viable solution to cope with multiple bit faults in spite of its rather high power consumption and area overheads.

I.

Introduction

Soft errors caused by high energy particles have known as a serious reliability challenge in space applications. As technology shrinks toward nanometer era, soft error is also becoming a serious reliability issue for ground level applications [7] [8]. It should

be noted that packaging is not an effective solution to protect the chips against soft errors, as the material itself emits alpha particles which can result in soft errors [9]. In addition, neutron particles coming from cosmic rays can easily penetrate through package producing soft error [6]. Until recently, Single Bit Upsets (SBUs) were regarded as the main consequence of particle strikes in digital circuits, but now in nano-scale technologies, a particle strike can alter the state of two or more adjacent memory cells resulting in Multiple Bit Upsets (MBUs) [16]. Recent studies show that probability of MBU is 10% in 50 nanometer technology [15]. Based on these facts, design of fault-tolerant systems in the presence of MBUs is of decisive importance. Power dissipation was traditionally a serious problem in embedded systems. Because these systems have serious constrains at size, weight, and battery life. Now power dissipation has also become an important issue for high performance systems even if they are plugged in a permanent power supply. This is mainly because of negative effect of power dissipation on reliability and cost. Power dissipation in processor produces heat which must be removed from chip surface. The heat removal issue hasn’t been very important till now because of existence of appropriate cooling equipments which could effectively remove inchip generated heat. But now we have faced with nanometer technology which is denser and requires more cooling cost. Additionally chip temperature increase due to power dissipation causes silicon failures such as electron migration, junction exhaustion and gate disruption [14]. Therefore, power dissipation is an important issue not only in embedded systems but also in high performance systems. The register file is a critical portion of processor from both reliability and power dissipation points of view. This is because, the register file keeps data for a long

period of time and it is accessed frequently by the processor core [10] [17]. Therefore, a fault occurring in the register file propagates to other parts of processor and in most of times cause to processor malfunction. As it is accessed very frequently by the processor, it consumes high power. It has been reported that register file consumes between 10% and 25% of the processor energy [11] [12] [13]. Parity and error correction codes are commonly used technique to protect register file. Parity is an effective technique for error detection, however, for the error correction; a recovery mechanism such as rollback recovery is required. Using recovery mechanism to at system level causes large performance and power consumption overheads. Although full ECC protection can detect and correct MBUs, it has two main problems : 1) Protecting the entire register file with ECCs incur high amount of power overhead, as data should be coded in each write cycle and encoded in each read cycles. This problem is more serious for ECCs designed for correcting MBUs. 2) Common error correction codes such as SEC-DEC are not able to detect and correct MBUs. In this paper, we have proposed an efficient technique to cope with MBUs in register file. The proposed technique is motivated by the fact that not all storing data in register fill require the entire register data width. Based on this fact, the free bits in the upper half of registers can be used to store the error detection or correction codes. Depending on the number of available unused bits in a register, the proposed technique employs an appropriate code that can detect or correct as many bit flips as possible. Simulation results show that the proposed technique provides a high level of reliability in the presence of MBUs. The rest of paper is organized as follows: section 2 presents history and related works. In section 3 noted bases is explained. Novel method to protect register file has been proposed in section 4. Conclusion and overhead of this method is reported in section 5.

II.

Background and related works

Basically, there are two fault types, namely soft and hard faults. Hard faults are caused by physical defects and cannot be removed by data overwriting or other fault removal methods. Soft faults refer to faults that

do not cause permanent defect. In fact, in case of soft faults, if the system is restarted, the fault would disappear. This paper focuses on radiation induced faults i.e. SEUs/MBUs. Many fault-tolerant techniques have been proposed for register file in literature. Duplication with comparison and Triple Modular Redundancy techniques are among the well-known hardware redundancy based techniques for register file. Information based redundancy techniques such as SEC-DED and parity codes are also widely used for protecting register file against soft errors. Another proposed method is two-rail which means original and duplicated circuits execute with reverse outputs. Semi-duplication or time duplication is another method that a circuit executes twice. Hamming code is usually used for single error correction or double error detection. Berger code is a well known method for detecting single or multiple bit error with same direction in combinational circuits. One of the simplest ways for error detection is parity, which can detect an erroneous bit. Cross parity is proposed for detecting multiple bit error even. Base of this method is at least calculating three parity vectors; vertical, horizontal and diagonal [2]. Register duplication assigns an empty register for an active one [3] [5] [10]. In-register duplication considers full and half-full situations and protects register based on this consideration [5]. The par-shield architecture which is proposed in [4] uses parity concept and a table for saving error detection and correction codes.

III.

Motivation

In this section we will discuss about proposed method motivation.

Narrow value operand Our observation in many experiments with different benchmarks on register file was that much of registers contain narrow value operand; the data effective width is less than register width. Therefore there is lots of unused space in any register which can be used for protecting it. As it has proposed in table 1, 1% of registers uses whole of its width for saving data. Noting this point we can protect these registers aware of their unused space. If there is much more unused

space we would have better error detection and correction. But if there is a few space or no we can just detect error. Table 1- Data width ratio in register file Register Status Ratio 16 bit empty 6 bit empty

36% 63%

full

1%

As mentioned above, base of proposed method is parity and hamming code generation and comparison. This hybrid method gives us a powerful ability to detect errors (even 31 bit error in 32 bit architecture). If there be enough bits we can correct 3 bit errors. Follow we are going to explain this method implementation.

Register

Multiple Bit upset

Status detector

Status 2 bit Vin1

Vin2

EDC/ECC generation phase

EDC/ECC

32 bit EDC generator

Vout1

26 bit ECC generator

Vout1

16 bit ECC generator

Vout2

2-bit Decoder

Vout2

Till now most of proposed methods for register file protection have focused on SEU. Nowadays by technology improvement and chip size reduction, processors are more vulnerable to any high energy particle and even a particle can cause MBU. So we tried to come up with a new idea to detect adjacent or non adjacent MBUs and correct them if possible. In proposed method if adjacency is increased the ability of detection and correction increases too.

4 bit

Old EDC/ECC

Error detection unit

IV.

Proposed idea for register file protection

We try to protect each register separately. This method increases space overhead but it prepares more effective protection to register file. Figure 1 shows a big picture from proposed method. We will discuss proposed method operation in read and write states.

Write operation As it’s been shown in figure 1 in write period data status should be detected then saved in status register. Afterward using a combinational circuit, we generate parity and hamming code from this register then put this code inside the register. Now we have completed generating and saving error detection and correction code.

Read operation In read period we generate error detection and correction code aware of status register and compare with exist code inside the register. If there was unequally it means that an error occurred. If there was enough bits we can correct error, but if not, we just detected it.

Error flag

1 bit error 3 bit error correction correction

2 bit

EDC/ECC check phase Register

Figure 1- Proposed method schematic

Status register Now we’ll explain the generation of 3 bit status registers. We introduce 3 states for a register. First state is that 16 bits are unused in register and 16 bits have been used. Second state is when there are 26 useful data and 6 bit empty in register. In third state there is no unused bit in register file which means data width is as same as register width. Note table 1 that shows the ratio of every state in a register file. It’s clear from table 1 that a few percent of registers are full and the others have some unused bits which can be used for register protection. There is an important point that status register have 3 bits and so it can save 8 states but it just provides 3

states. This is because, as you’ll see in third state, we’ll use status register for saving parity bits; because in third state the data register is full and we have no space to save parity. So we use status register bits.

Status

0 0 1

26 bit data

Hamming Generator

Figure 3- 26-bit error detection and correction code generation

State 1: the status register in this state would be filled by 000, and 16 unused bits will be used for saving error detection and correction code. This code is able to detect 29 error bits and correct 3 error bits. How this code is generated is as follow. Every 5 bits each of which has 2 bit distance from another one are capsulated in a group. For this group we generate 4 bits hamming code and a parity bit. Then these bits are being putted next to the data. So we have 5 bits detection and correction code for every 5 bits group. We repeat this action for next two groups and finally we have 15 bits code which is putted next to the data. The summation of this method has been shown in figure 2.

State3: if there be a register in this state its status register first bit would be filled by 1, and 2 unused bits in status register will be used saving error detection code. First, two 16 bits interleaved groups are being formed, and then a parity bit for each group is being assigned. In this manner we are able to detect 3 bits adjacent or a few non adjacent errors. It should be taken into account there is a little probability to have a full register that faces MBU. Simulation results verify this claim. Figure 4 shows algorithm of this code generation. Status

1

32 bit data

Status

000 Hamming Generator

Figure 4- 32-bit error detection code generation Figure 2- 16-bit error detection and correction code generation

When 3 bits adjacent error occurs each error bit is detected by the parity and corrected by its related hamming code. When 6 bits adjacent error occurs it means every group has obtained two errors. So this error couldn’t be found by parity but it could be found by hamming code. In case of 15 bits adjacent error we have 3 bits error in each group which could be detected by parity and so on, till 29 bits error. The situation for non adjacent errors is as same. State 2: the status register in this state would be filled by 001, and 6 unused bits will be used for saving error detection and correction code. This code is able to detect 31 error bits and correct 1 error bit. How this code is generated is as follow. 5 bits hamming code and a parity bit are generated for whole of 26 bits data and saved next to the data. Generation procedure of this code has been summarized in figure 3.

Putting these three states in a combinational circuit with minimum area overhead produces our detection and correction code generator. This combinational circuit first, generates status register aware of data width. Then it generates detection and correction code and puts it in right place. As it’s been shown in figure 1 when reading register file first detection and correction code is generated noting status register. Then this value is compared by the value embedded in register and an error bit is produced if there be unequally. Then we peruse if we can correct it. If we were able to correct this error we would do it with help of in-register hamming code. For correcting each error we XOR the generated hamming code with in-register hamming code and give this value to a decoder. Output of decoder is connected to register bits so each error bit would be corrected automatically. In figure 5 a corrector is shown which is containing a 5*32 decoder.

C16 C8 C4 C2 C1 S1

D1

S2

D8

S3

Register

P16 P8 P4 P2 P1

Decoder(5*32)

ENB

Figure 5- Error correction mechanism

V.

Results

Below we are going to show experimental tools and results.

Tools We have simulated this method on register file of ARM processor and have used Model-Sim for error injection and result observation. We have used Power Compiler to obtain power consumption and area overhead. We used some benchmarks from Mi-bench package [18].

Types of error To calculate reliability we have introduced 4 error types. 1- Detected and corrected: this kind of error is detected and so corrected by proposed method. 2- Detected: this kind of error can just be detected by proposed method. 3- Masked: error which is overwritten by application before read or has occurred on an unused register or has masked automatically. 4- Undetected: this kind of error can’t be detected by proposed method. The last type of error is dangerous. We should reduce the probability of this error occurrence. Proposed method prevents this kind of error very well. Results which are come from simulation have been reported in figures 6-7. In these figure percent of each

type of error which is introduced above has been reported for each benchmark. As it has been shown in figure 6, proposed method is able to detect and correct 30% of one bit error. 70% of one bit errors are masked automatically and we have no undetected error. For two and three bits error the situation is same which means we have no undetected error. In figure 6 each column shows the percentage of an error type for specific number of error bits. Finally, as mentioned in figure 7, for errors which are utmost 16 bits (1 to 16 bit) we can detect 30% and correct 4% with just 0.5% inability of detection. In figure 7 each group of columns associated for a benchmark and each column shows the percentage of an error type for 1 to 16 bits error. It should be explained that results are reported by 95% dependability and 3% tolerance. It means each value is reported by tolerance 3% and is dependable by 95%.

Comparison In table 2, proposed method is compared by duplication with comparison and TMR and unprotected register file, from area and power points of view. Noting this table, area overhead of proposed method is less than each other method. Table 2- Comparing proposed method to duplication and tmr from area and power overhead points of view Power(mw) Area(mm2) ARM (original code)

191.8450

1034899

Duplication with comparison

201.1288 (5%)

1331371 (29%)

TMR

210.3614 (10%)

1608595 (55%)

Proposed method

193.7410 (1%)

1109475 (7%)

From power consumption point of view, proposed method overhead is less than others too. These low overheads have coupled by high ability of detection and correction.

100%

100%

Basic Math

90%

90%

80%

80%

70%

70%

60%

60%

50%

50%

40%

40%

30%

30%

20%

20%

10%

10%

0%

0% 1-bit

100%

2-bit

3-bit

FFT

1-bit 100%

Quick Sort

90%

90%

80%

80%

70%

70%

60%

60%

50%

50%

40%

40%

30%

30%

20%

20%

10%

10%

0%

2-bit

3-bit

String Search

0% 1-bit

2-bit

3-bit

1-bit

2-bit adjacent 3-bit adjacent

Figure 6- Error detection and correction rate for basic-math, fft, quick-sort, and string-search benchmarks by 1000 error injection

Detected & Corrected Error (DCE)

Detected Unrecoverable Error (DUE)

Masked

Silient Data Corruption (SDC) 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

basicmath

bitcnts

fft

qsort

stringsearch

Detected & Corrected Error (DCE)

5.4%

3.6%

5.1%

3.7%

3.9%

Detected Unrecoverable Error (DUE)

31.5%

22.9%

32.4%

23.5%

24.5%

Masked

62.6%

73.0%

62.0%

72.3%

71.1%

Silient Data Corruption (SDC)

0.5%

0.5%

0.5%

0.5%

0.5%

Figure 7- Error detection and correction rate for 1 to 16 bits error by 1000 error injection

Conclusion In this paper a fault tolerant method has been proposed for register file in embedded system. This method can protect register file against MBUs with minimum area and power overhead noting data width. Basic idea of the method is using parity and hamming code for error detection and correction. The method, by average can detect 99% error and can detect and correct 20% error. Besides, it causes 7% and 1% area and power overheads respectively. These overheads are reasonable due to the high ability of this method in error detection and correction.

Acknowledgment We would like to express our gratitude to S.N. Ahamdian, R. Eftekhari, and M. Ibrahimi for their valuable technical advices that have improved the quality of this paper.

References [1] Fujiwara E., Code Design for Dependable Systems Theory and Practical Applications, Tokyo Institute of Technology, 2006. [2] Pflanz M., Walther K., Galke C., Vierhaus H.T., On-Line Error Detection and Correction in Storage Elements with Cross-Parity Check, IBM Deutschland Entwicklung GmbH, Processor Development Dept., Germany, Brandenburg Technical University of Cottbus, Computer Science Dept., Germany, 2002. [3] Memik G., Chowdhury M.H., Mallik A., Ismail Y.I., Engineering Over-Clocking: Reliability-Performance TradeOffs for High-Performance Register Files, Electrical and Computer Engineering Dept., Northwestern University, IEEE, 2005. [4] Montesinos P., Liu W., Torrellas J., Using Register Lifetime Predictions to Protect Register Files Against Soft Errors, Department of Computer Science, University of Illinois at Urbana Champaign, 2007. [5] Kandala M., Zhang W., Yang L.T., An Area-Efficient Approach to Improving Register File Reliability Against Transient Errors, Dept. of Electrical and Computer Engineering Southern Illinois University Carbondale, IL 62901, Dept. of Computer Science St., Francis Xavier University Antigonish, Canada, 2007. [6] Hazucha P., Svensson C., Impact of CMOS Technology Scaling on the Atmospheric Neutron Soft Error Rate, IEEE Trans. on Nuclear Science. Vol. 47, No. 6, pp. 2586-2594, December 2000. [7] Cohen N., Sriram T.S., Leland N., Moyer D., Butler S., Flatley R., Soft Error Considerations for Deep-Submicron CMOS Circuit Applications, Int’l. Electron Devices Meeting, Washington, DC, 1999, pp. 315-318. [8] Keyes R.W., Fundamental Limits of Silicon Technology, Proc. of IEEE, Vol. 89, No. 3, Mar. 2001, pp. 227-339.

[9] Zhou K., Mohanram K., Gate Sizing to Radiation Harden Combinational Logic, IEEE Trans. on CAD, Vol. 25, No. 1, pp. 155-166, January 2006. [10] Memik G., Kandemir M.T., Ozturk O., Increasing Register File Immunity to Transient Errors, Europe Conference and Exhibition, IEEE, 2005. [11] Balkan D., Sharkey J., Ponomarev D., Ghose K., Selective Writeback: Reducing Register File Pressure and Energy Consumption, IEEE Trans. On Very Large Scale integration Systems, Vol. 16, No. 6, pp. 650-661, June 2008. [12] Aggarwal A., Franklin M., Energy efficient asymmetricallyported register files, Proc. of the Int’l Conference on Computer Design (ICCD), San Jose, California, 2003, pp. 2– 6. [13] Balasubramonian R., Dwarkadas S., Albonesi D., Reducing the complexity of the register file in dynamic superscalar processor, Proc. of the IEEE/ACM Int’l Symposium on Microarchitecture, MICRO, 2001, pp. 237– 248. [14] Ponomarev D., Kucuk G., Ergin O., Ghose K., Isolating Short-Lived Operands for Energy Reduction, IEEE Trans. On Computers, Vol. 53, No. 6, pp. 697-709, June 2004. [15] Seifert N., Slankard P., Kirsch M., Narasimham B., Zia V., Brookreson C., Vo A., Mitra S., Gill B., Maiz J., RadiationInduced Soft Error Rates of Advanced CMOS Bulk Devices, Proc. of 44th Annual International Reliability Physics Symposium, San Jose, 2006, pp. 217- 225. [16] Dutta A., Touba N. A., A Low Cost Code-Based Methodology for Tolerating Multiple Bit Upsets in Memories, IEEE Workshop on System Effects of Logic Soft Errors, Apr. 2007. [17] Montesinos P., Liu W., Torrellas J., Using Register Lifetime Predictions to Protect Register Files Against Soft Errors, IEEE Transactions on Dependable and Secure Computing (IEEE TDSC), To Appear, 2008. [18] Guthaus M. R., Ringenberg J. S., Ernst D., Austin T. M., Mudge T., Brown R. B., MiBench: A free, commercially representative embedded benchmark suite, IEEE 4th Annual Workshop on Workload Characterization, Austin, TX, December 2001.

Suggest Documents