False Error Vulnerability Study of On-line Soft Error ... - Springer Link

1 downloads 0 Views 853KB Size Report
Apr 15, 2010 - Abstract With technology scaling, vulnerability to soft errors in random logic is increasing. There is a need for on-line error detection and ...
J Electron Test (2010) 26:323–335 DOI 10.1007/s10836-010-5153-z

False Error Vulnerability Study of On-line Soft Error Detection Mechanisms Kiran Kumar Reddy · Bharadwaj S. Amrutur · Rubin A. Parekhji

Received: 23 October 2008 / Accepted: 11 March 2010 / Published online: 15 April 2010 © Springer Science+Business Media, LLC 2010

Abstract With technology scaling, vulnerability to soft errors in random logic is increasing. There is a need for on-line error detection and protection for logic gates even at sea level. The error checker is the key element for an on-line detection mechanism. We compare three different checkers for error detection from the point of view of area, power and false error detection rates. We find that the Double Sampling checker (used in Razor), is the simplest and most area and power efficient, but suffers from very high false detection rates of 1.15 times higher than the actual error rates. We also find that the alternate approaches of Triple Sampling and Integrate & Sample method can be designed to have zero false detection rates, but at an increased area, power and implementation complexity. The Triple Sampling method has about 1.74 times the area and 1.83 times the power as compared to the Double Sampling method and also needs a complex clock generation scheme. The Integrate & Sample method needs about 6% more power and is 0.58 times the area of Double Sampling. It comes with more stringent implementation constraints as it requires detection of small voltage swings. We also analyse for Double Transient Faults (DTFs) and show

Responsible Editor: S. Kajihara K. K. Reddy (B) · B. S. Amrutur Department of ECE, IISc, Bangalore, India e-mail: [email protected] B. S. Amrutur e-mail: [email protected] R. A. Parekhji Texas Instruments (India) Pvt. Ltd., Bangalore, India e-mail: [email protected]

that all the methods are prone to DTFs, with Integrate & Sample method being more vulnerable. Keywords On-line testing · Soft error detection · False errors

1 Introduction Radiation-induced soft errors on large-scale integrated circuits are becoming increasingly problematic as device sizes are scaled down, operating voltages are reduced, and node capacitances shrink [6]. Memories are more vulnerable to soft errors due to small transistor sizes and storage capacitances. But recently, combinational logic is also becoming susceptible to soft errors due to technology scaling. Soft errors are caused by the charge deposition induced by collision of high energy radiation particles like neutrons and alpha particles with the substrate in semiconductor materials. Neutrons are produced by collision of cosmic rays with atmospheric particles [15], while alpha rays are emitted by the impurities in the packaging materials and interconnect [10]. While both neutrons and alpha particles deliver parasitic charge, the charge deposited by neutrons (25–150 fC/μm) is much greater in magnitude than alpha-particles (4–16 fC/μm) [1]. These soft error strikes create Transient Fault (TF) which can be detected only when the chip is functionally operating. This motivates the need for on-line checking. Many on-line checking methods have been proposed in the literature like hardware redundancy, coding and time redundancy. The important parameters for an error checking circuit are its actual error detection capability and false error rejection capability.

324

Ideally it should detect all actual errors and reject all false errors. A TF creates an error in the system only when it is latched by a flip-flop in the system. This is the actual error. There can also be an error indication by the checker even when the TF is not latched by the system, called the false errors. When the checker indicates an error, the erroneous part of the system is isolated from the healthy part of the system to prevent the error from corrupting the rest of the system. The failed operation is re-executed, requiring additional overhead due to execution time, isolation. In case there is an error indication, we cannot distinguish between an actual error and false one, and re-execution is, therefore, always performed. Too high a false error detection rate will lead to very significant degradation in system performance. The problem of false errors (referred to as No-Harm Alarm in [12]) is first investigated in [12]. The study compares the NHAR (No-Harm Alarm Robustness) of on-line code checkers. It proposes a possible approach to design NHAR checkers for coding based on-line checkers. In this paper, we compare on-line checkers based on Time Redundancy [7, 11] (Double Sampling, Triple Sampling, Integrate and Sample Method) for false error detection vulnerability, area, power and implementation complexity. An alternate checker for soft error strikes is proposed in [9], where they use a circuit to monitor currents in the bulk. Soft error strikes lead to current transients in the bulk node which are then detected. However in this approach, all soft error strikes above a certain level get detected, regardless of whether the associate transient faults actually lead to an error or not. Consequently, we expect the false alarm rate to be much higher for these class of detectors compared to time redundancy based detectors. Hence we restrict our discussion to only time redundancy based detectors in this paper. In Section 2, we give an overview of time redundancy based on-line detection mechanisms. In Section 3, we introduce the problem of false errors. We then describe the simulation setup used to characterize error checkers in Section 4. We provide some analysis and discuss simulation results in Section 5. We discuss the effects of multiple glitches (in the same flip-flop) on error checkers in Section 6. Finally, conclusions are drawn in Section 7.

2 Overview of On-line Error Detecting Mechanisms Consider a single pipeline stage shown in Fig. 1 where error detection circuit (EDC) is appended to the flip-

J Electron Test (2010) 26:323–335

Fig. 1 Data path block under consideration

flop to be protected. Different methods have been proposed for on-line testing of soft errors, delay and crosstalk faults in combinational logic. The basic idea behind all these methods is that they try to distinguish between a fault free and a faulty system. For error recovery based on time redundancy, this is done by comparing combinational output, Co , with the latched value, F Fo , over a period where Co is expected to be stable. Typically, Co is stable in the interval Tclk to Tclk + Tcmin , where Tcmin is the contamination delay of the logic. Usually any comparison over Tcmin is difficult. Hence, Tcmin is increased to a larger Tstab , called stability window, such that the comparison of Co and F Fo can be easily done during this period. Hence this error detection comes with the overhead of having to buffer all the fast paths in the combinational logic to have a delay of at least Tstab . This class of detection methods can be implemented in more than one form depending upon the way comparison is done between Co and F Fo in the stability window. Based on the type of comparison, which can be continuous or discontinuous, there are three methods namely Double Sampling, Triple Sampling, Integrate & Sample method. 2.1 Double Sampling The Double Sampling method is based on the Razor technique [3] that has been proposed for dynamic voltage scaling in order to operate the circuit at lowest possible supply voltage that meets the speed specifications. It compares two samples of the combinational output Co , one sampled at the positive clock edge (F Fo ) and the other at the end of the stability period (F F2 ) as shown in Fig. 2. A mismatch between the two is used to detect a delay error due to inadequate supply (Fig. 4b). The error checker outputs from each flip-flop are combined in a big OR tree. The same structure can also be used to detect soft errors. Figure 4a schematically shows how a soft error somewhere in the logic block causes a glitch at Co during the sampling phase, thus leading to an error which gets detected by the checker. While this technique can detect any delay error (Fig. 4b) with minimal extra

J Electron Test (2010) 26:323–335

325

a

b

Fig. 2 Double sampling (Razor) method

hardware when used for detection of soft errors, it can result in a considerable number of false detects.

Fig. 4 Soft error (a) and delay error (b) cases for double sampling & triple sampling methods

2.2 Triple Sampling

2.3 Integrate & Sample

Since soft error in combinational logic is a transient disturbance, it can be detected by averaging the difference of the combinational output and the flip-flop output over the stability window. An approximation to this averaging can be achieved via a small enhancement to the technique of Triple Sampling method [11]. In this approach, in addition to samples taken in Double Sampling method (F Fo , F F2 ), another sample is taken at the middle of the stability window (F F1 ) using an additional sampling clock,Clk_d. We introduce an error checker by doing an XOR of the flop output(F Fo ) with the majority value of the three samples as shown in Fig. 3. The waveforms for the operation of this circuit under a soft error condition is shown in Fig. 4a. This approach needs an extra clock, Clk_d, at the center of stability window. The error checker outputs from each flip-flop are combined in a big OR tree. This method can detect soft error and delay errors up to a size Tstab . 2

The basic principle of Integrate & Sample method is the comparison of Co and F Fo over a period(Tstab ) where the combinational output is expected to be stable under error-free operation. This method is a generalization of class of methods referred to as Current Sensing Comparator in [18] and Modified Stability Checker in [13]. The comparison is done using a current integrator that integrates difference between Co and F Fo over the stability period (Tstab ef f ). This is implemented by using an Integrate & Sample cell as shown in Fig. 5. The output node (Sense) is pre-charged during the negative half cycle and the circuit is enabled by the Clk during the positive half cycle. When Clk is high and there is a mismatch between the two inputs (Co and F Fo ) the XOR structure creates a pull down current. This pull

Fig. 3 Triple sampling method

Fig. 5 Integrate & Sample [13] block

326

J Electron Test (2010) 26:323–335

PVT variations and hence leads to a more robust implementation. For this discussion, we use a fixed reference level for the sense amplifier. In case of actual error (Cycle 2 in Fig. 7), the Sense line discharges so that the difference between the Dummy and Sense line exceeds the reference value (Vthresh ) and triggers an error. In case of false error (Cycle 1 in Fig. 7), the difference between Dummy and Sense lines is less than Vthresh and hence there is no error. The Dummy line is static in both the cases. Fig. 6 Dynamic ORing structure of Integrate & Sample method [13]

Voltage (V)

Voltage (V)

Voltage (V)

Voltage (V)

Voltage (V)

down current is integrated onto a capacitor (Cint ) to give a voltage drop from Vdd on the Sense node which is proportional to the time interval for which Co and F Fo are different. Under no error condition, Co and F Fo are same for almost the entire duration of Tstab resulting in negligible voltage signal. However, under actual error condition, Co and F Fo are different for almost the whole Tstab leading to a large voltage signal. The circuit is implemented in a differential manner in order to protect it against common mode noise using two XOR structures. The dummy XOR’s footer transistor is grounded (Fig. 5). The error checker output from all such Integrate & Sample blocks are combined using a dynamic OR structure which also provides for the capacitor for current integration as shown in Fig. 6. The reference generator is connected to the dummy line and it provides the reference voltage against which comparison is made. A simple implementation uses a fixed reference voltage of Vdd [13]. In the Appendix, we will describe another reference generator which creates an adaptive reference voltage which tracks the

1 0.5

CYCLE 1

CYCLE 2

clk

0 4.5

5

5.5

6

6.5

7

7.5

8

8.5

5

5.5

6

6.5

7

7.5

8

8.5

5

5.5

6

6.5

7

7.5

8

8.5

5.5

6

6.5

7

7.5

8

8.5

1 0.5 0 4.5

C

o

1 0.5 0 4.5 1 0.5

FF

o

Dummy Sense

0 4.5

5

0 4.5

There is a real error in the system only if there is a soft error induced glitch around the positive edge of the system flip-flop. However error checkers can flag an error in response to a soft error induced glitch, even though it might not be latched in the system flops. In this section we will do a simple calculation to estimate the false error rates for the simplest technique of Double Sampling (Fig. 2) and show that it can lead to a large number of false errors under certain conditions. The double sampling technique [3] works by comparing the samples at the positive and negative edge of the clock. It imposes a constraint on the minimum delay of the combinational logic, Tcmin (contamination delay) to be at least half a cycle so that there are no transitions in the positive half cycle. A soft error glitch at the positive edge gets latched onto system flipflop. Being a transient disturbance it has to die down after few hundreds of pico seconds. The sample taken at the negative edge of the cycle has therefore the correct value. Hence, a mismatch between two samples indicates an actual error (Fig. 8). But an error is also flagged when there is glitch at the negative edge of clock. This is not an actual error as it is not get latched into the system flip-flop. False errors occur because of the comparison used in error detection (Fig. 8). We will next estimate the false error rate for a Razor based system using published data for Soft Error Rate (SER). SER due to neutrons increases by orders of magnitude from the ground level to higher altitudes. Let us examine the SER at the ground levels and at an altitude of 60,000 ft. where the neutron flux is maximum. The authors in [17] propose a simple model for the estimation neutron induced SER. SER(Qc ) = BGR(d, Qc ).N.Vs .C

1 0.5

3 Problem of False Errors

(3.1)

Error 5

5.5

6

6.5

7

7.5

8

8.5

time (ns)

Fig. 7 Actual and false errors in Integrate & Sample method [13]

where BGR is the burst generation rate, N is the neutron flux, Vs is the sensitive volume (sensitive area x sensitive depth), C is the collection efficiency, and Qc is

J Electron Test (2010) 26:323–335

327

the critical charge of the soft error. Using this formula, they calculate the SERneutron for 350 nm process @3V for a drain area of around 36 λ2 to be 6 × 10−12 errors/ h, where λ is half the minimum channel length. We will use this number to extrapolate and estimate the SER for a unit inverter of NMOS/PMOS size of 6λ/12λ in a 65 nm process. With a typical diffusion extension of 5λ, the diffusion area at the output of this unit inverter is 90 λ2 . The critical charge, Qc , the BGR and hence the SERneutron for this unit inverter in the 350 nm process at a reduced voltage of 1V, is also 1.5 × 10−12 errors/ h since Qc ∝ Areajunction × Voltage. The authors in [14] show that the SER at 65 nm is approximately 0.25 times that of SER at 350 nm at 1V. They also show that at 65 nm the rates due to the alpha and neutron induced SER are same at ground levels and hence, SERα = SERneutron = 1.5 × 10−12

errors hr

(3.3)

where N f lops is the number of flops in the system. Consider a large ASIC with a cycle time of around Tcycle = 40F O4 and N f lops = 4million. Typically Tpmin is about 3 FO4 (fanout-4) inverter delays. For such an ASIC in a 65 nm process, Eq. 3.3 gives a F IT f alse = 900 and a Mean Time to False Error (MTFE) = 126 years at ground level. At an altitude of 60,000 ft., the Neutron flux increases 230 times that at ground level to

Fig. 8 False error (a) and actual error (b) waveforms

SER60,000 f t. = (SERα + 230SERneutron )ground

(3.4)

This leads to an MTFE at 60,000 ft. of only 200 days. We can extend the above calculations to Razor-II [2], which uses a latch and transition detector. There a false error occurs if there is pulse anywhere in the entire positive half cycle. Hence Eq. 3.3 gets modified as Error Rate f alse = 0.5 ∗ SER ∗ N f lops

(3.5)

Repeating the above calculation gives MTFE of 49 years at ground level and 79 days at an altitude of 60,000 ft. Note that from Eq. 3.3, we can see that the false error rate increases and hence MTFE decreases for high performance chips like microprocessors which have much smaller cycle times.

(3.2)

and the total SER is the sum of the two SERs as they are independent. Consider a system with cycle time of Tcycle and let Tpmin be the minimum sustainable pulse. Lets assume that the soft error strike can occur at any point within the clock cycle and it creates a pulse of width no more than Tpmin which gets propagated to the sampling flipflop. The probability of this pulse causing a false error (see Fig. 8) in the Razor scheme is Error Rate f alse = Tpmin /Tcycle ∗ SER ∗ N f lops

4,580 cm−2 h−1 [16]. However the alpha flux will remain unchanged and hence

4 Simulation Setup When a particle strikes a sensitive node in the circuit, it produces a current pulse with a rapid rise time, but a gradual fall time. The shape of the pulse is modeled by a one-parameter function [4] as shown in Eq. 4.1  2Q t −t Iser (t) = √ (4.1) eT πT T Here Q refers to the amount of charge collected due to the particle strike, T is the time constant for the charge collection process and is a property of the CMOS process used for the device. The probability distribution of the collected charge Q is extracted from the data given in [5]. The current pulse produced by a particle strike results in a voltage pulse at output node of the device. We study the false error rates of different checker schemes by comparing their operation with that of an ideal error checker which detects all actual errors and does not flag any false error. Here the checkers are assumed to be error-free. We use two copies of the circuit, where one (circuit under test) is modified by incorporating pulsed current sources at each node, to inject the desired current pulse as in Eq. 4.1 to mimic a soft error event.By controlling the times when these pulsed sources are turned on and off, we can simulate the times at which a soft error even occurs. The other circuit block is the ideal error free circuit which gives the correct result (See Fig. 9). The figure shows a simple circuit of four inverters just for illustration purposes, with a current source at each circuit node. The circuit under test block also has the error checker attached

328

J Electron Test (2010) 26:323–335

Fig. 9 Datapath under test, b i = 1 if node i has SEU

to its output flip-flop. The ideal checker just compares the output of the circuit under test (F Fo ) with that  of the ideal error-free circuit (F Fo ), thus flagging an error only when there is an actual error and does not flag any false errors. Comparing the error results of the ideal checker and actual checker, four conditions arise as shown in Table 1. The circuits are simulated for the worst case charge injection. A set of 4,000 errors are injected into the circuit at 1000 different injection points each clock for the four nodes (N1−4 ) in the circuit under test. The errors are injected randomly at one of the four injection points, at a random time within each clock cycle. The same simulation framework is used for all the three types of error checkers. The design of the checkers for guaranteed error detection depends on the maximum soft error size (δmax ) it can tolerate. The injected charge at the circuit node depends on the impinging particle charge, energy, angle of attack and the collection area of the node. The injected charge in turn creates a voltage pulse of width δ which depends on: 1. Drive strengths of the pull-up and pull down paths of the gate driving that circuit node, 2. The total charge induced, Q due to the particle strike. 3. The load capacitance, Cnode of the circuit node The width of the output pulse developed due to the soft error strike is maximum,δmax , when the drive strength and the load capacitance are minimum and the charge induced, Q, is the maximum tolerable value used by the designer. Since our circuit under test is a FO4

inverter chain, we have simulated the weakest gate in the chain to find δmax . In 130 nm technology, we find that the maximum pulse width of soft error (δmax ) is approximately 200ps for a maximum injected charge of 120 fC. This is obtained through circuit simulation using the current injection model of Eq. 4.1. Note that δmax is fixed for a given standard cell library and a given maximum injected charge. While accurate transistor level simulation can be done for the simple chain of four inverters shown in Fig. 9, this becomes computationally infeasible for larger circuits. Hence, we have developed a switch level library of gates in terms of which bigger benchmark circuits are synthesized. In general, any logic gate with function f (in1, in2....) is implemented as a ideal logic function followed by pull-up and pull-down resistors connecting the output of the cell to power and ground respectively as shown in Fig. 10. The pull-up and pulldown resistor along with the output capacitor model the delay of the logic gate. A current source (Q,T) (Eq. 4.1) is appended at the output of the gate to simulate the induced charge of the soft error strike. For example, a two-input Nand gate can be implemented by  followed by ,pull-up and pullits logic function ( A.B) down resistor with a capacitor that models the delay of the Nand gate, and a current source that models the induced charge of soft error strike . A switch level library consisting of NAND2/3/4, NOR2/3/4, XOR2, NOT1 gates is built. ISCAS’85 benchmark circuits like C17,C432,C499 are synthesized in terms of the switchlevel library. The simulation flow is explained in the Fig. 11. The final netlist is replaced in the Circuit Under Test in the Datapath shown in Fig. 9. The Error Free Reference Circuit is generated by removing the pulsed current source at the output of the switch-level cells. Since a full logic level function is implemented in the simulated circuit, logical masking is captured in this framework [8]. Also since the delays of the gates are captured via the RC model at the output of the gate, electrical masking effects like suppression of short pulses is also captured. We implement flops using

Table 1 Definition of different types of injected errors Error (Ideal checker)

Error (Actual checker)

Type of error

0 0 1 1

0 1 0 1

No error False error Undetected error Actual error

Fig. 10 Leaf cell in switch level library

J Electron Test (2010) 26:323–335

329

Number of errors

150

100

50

0 0.5

Fig. 11 Simulation flow for soft error estimation

Actual errors False errors Undetected errors

1 1.5 2 2.5 3 3.5 Ratio of Tstab to maximum soft error size

4

Fig. 13 Error statistics for double sampling method (C17, switch level, Qc = 120fC)

transistor-level circuits and realistic clock waveforms. Hence latching window effects are also captured and only pulses which occur in the setup-hold window of the flop are captured by the flop.The Soft error strikes are simulated by a randomly triggering the pulsed current sources at one of the circuit nodes at random instants in each clock cycle. The input vector to the benchmark circuits are selected randomly for each cycle, and hence an average behaviour over a number of input vectors is obtained. The Integrate & Sample [13] is modeled by current source, which is enabled by difference between Co and F Fo (Fig. 12) and an enable signal (Clk), that discharges the Sense node. The comparator has an offset of δV and triggers an error if the Vsense < Vthresh − δV . 2 The voltage Vsense , must be at least Vthresh + δV for 2 it not to trigger an error (Vsense = Vdd − CIonint t, Sense node is initially precharged to vdd). We have used a comparator with a threshold voltage (Vthresh ) of 150mV and δV=100mV which are nominal and conservative values. The change in voltage of the node Vsense is given by: Ion Tint Cint

Every error detection mechanism has a tendency to flag false errors due to inherent nature of comparison involved. In Double Sampling method as long as Tstab > δmax all the errors are detected. But it also gives false errors when there is a transition on the combinational output (Co ) around the second latching edge as shown in Fig. 8. Assuming uniform distribution of time of soft error strike and uniform distribution of delays from the node of soft error strike to the output node, there will always be soft error pulses around the negative clock. The values of F Fo and F F2 are different and hence Double Sampling method gives an error even though there is no error.

150

(4.2) Number of errors

Vsense =

5 False Error Analysis and Simulation Results

100

Actual errors False errors Undetected errors 50

0 0.5

Fig. 12 Integrate and sample method [13]

1 1.5 2 2.5 3 3.5 Ratio of Tstab to maximum soft error size

Fig. 14 Error statistics for triple sampling method (C17, switch level, Qc = 120fC)

330

J Electron Test (2010) 26:323–335

Number of errors

150

condition on Ion is put by the offset voltage of the comparator. Ion (Tstab ef f − δmax ) > δV Cint

Actual errors False errors Undetected errors

100

where Tstab ef f = Tstab − (δmax − Tsetup ), which is the actual period available for integration shown in Fig. 4a. In order to prevent the false soft error created glitches (of duration δmax ) from triggering an error, the condition that must be satisfied is:

50

0

(5.1)

1

1.5

2

2.5

3

3.5

4

4.5

5

Ion δmax δV < Vthresh − Cint 2

5.5 8

x 10

I on /C int in I&S method

Fig. 15 Different errors as function of the ratio level, Qc = 120fC)

Ion Cint

(5.2)

Similarly, for the actual errors to trigger the error the condition is given by:

(C17, switch

Ion Tstab ef f δV > Vthresh + Cint 2

This puts a condition on the charging current Ion

Figure 13 shows the error detection statistics for worst case injected charge of Q = 120 fC for C17 benchmark circuit simulated at switch level. All the actual errors are detected for Tstab > δmax , but the false errors are around 0.67 times of actual errors. In this figure (and the next two figures), the X-axis is the ratio of Tstab to δmax since Tstab is an independent parameter under the control of the circuit designer. In case of Triple Sampling method, a majority operation of three samples separated by a minimum separation of δmax always gives correct value and avoids any false errors i.e. Tstab > 2 ∗ δmax . But this comes with overhead of generating and distributing an extra clock (Clk_d) to do an additional sampling in the middle of the stability window. Figure 14 shows that under worst case charge injection, the undetected errors decrease while both the actual and false errors increase till 2.25*δmax for the Triple Sampling method. In Integrate & Sample method, the false errors can be eliminated by proper choice of Ion . The first

Table 2 Comparison of three methods (Double, Triple, Integrate & Sample method) for Tclk = 1.5ns and Tstab = 750 ps for STF(Single Transient Fault)

A actual errors, U undetected errors, F false errors

(5.3)

(Vthresh + δV ) (Vthresh − Ion 2 < < Tstab ef f Cint δmax

δV ) 2

(5.4)

Figure 15 clearly shows that there is a range of currents for complete detection and false error rejection as predicted by Eq. 5.4. The Double and Triple Sampling methods have been simulated for Tstab = T2clk for comparison with the Integrate & Sample method. The value of Ion satisfying Eq. 5.4 is chosen. All comparisons are made for worst case induced charge of Q = 120 fC (Table 2). The simple inverter chain circuit has been simulated at the transistor level whereas the remaining benchmarks (C17, C432, C499) are simulated at the switch level. The Triple Sampling method has higher area (approx. 1.7 times) and power overhead (approx. 1.83 times) than Double Sampling method because of the one extra flip-flop and the majority voter. The false

Item

Double sampling (Tstab = T2clk ) A U F

Triple sampling (Tstab = T2clk A U

FO4 C17 (6 gates) C432 (160 gates) C499 (202 gates) Area Power

153 146 148 278 1 1

153 146 148 278 1.74 1.83

0 0 0 0

177 124 125 212

0 0 0 0

F 0 0 0 0

I&S (Ion = 35μA) A U 153 146 148 278 0.583 1.06

0 0 0 0

F 0 0 0 0

J Electron Test (2010) 26:323–335

error to correct error ratio is around 1.15 for the Double Sampling in comparison to 0 for the Triple Sampling method and Integrate & Sample method. The Integrate & Sample method consumes around 6% more power and occupies 0.58X the area of the Double Sampling method. The areas were estimated using Cadence Virtuoso Layout Editor and power in Cadence Spectre simulator. Note that the overheads for generation and distribution of extra clock for Triple Sampling have not been included in this analysis. The variability in the extra clock that is generated is also critical as it can increase the Tstab which in turn increases the area of the combinational block, in order to satisfy condition (Tstab > 2 ∗ δmax ). Arguments for the existence of false error detection in the Double Sampling method are independent of the circuit under test as seen from the results. As discussed earlier, the Double Sampling method flags a false error if the soft error event causes a glitch at the input of the extra latch around the end of the Tstab window. Integrate & Sample and the Triple Sampling checkers can be designed such that their false error detection can be made to be zero.

6 Performance of Checkers in Multi-glitch Scenario With technology scaling, the probability that a single particle strike leading to glitches on multiple nodes, leading to Multiple Transient Faults (MTFs), is increas-

Fig. 16 Undetected error in double and triple sampling methods for DTFs

331

Fig. 17 DTFs

False error in double and triple sampling methods for

ing [19]. Multiplicity of two being a reasonable case to study [12]. These multiple glitches can either manifest themselves on separate flip-flops or on the same flipflop. The probability of the latter increases in circuits with high fan-in. There is no problem in the operation of the checkers in case of MTFs on separate flip-flops.

Fig. 18 Undetected and false error in Integrate & Sample method [13] for DTFs

332

J Electron Test (2010) 26:323–335

Table 3 Comparison of three methods (Double, Triple, Integrate & Sample method) for Tclk = 1.5ns and Tstab = 750 ps for DTF(Double Transient Fault) Item

Double sampling (Tstab = T2clk ) A U

C17(6 gates)

705

0

F

Triple sampling (Tstab = T2clk A U

790

705

0

F

Integrate & Sample (Ion = 35μA) A U

F

0

678

5

27

A actual errors, U undetected errors, F false errors

However, when two glitches occur at the same input node of flip-flop, there is finite probability of the all the checkers not being able to detect an error. Multiple errors impact the checkers as follows: 1. Double Sampling: Errors can get masked when two pulses appear on the same net one at positive edge and other at negative edge (Fig. 16). The false error rate increases as the frequency of the pulses increases. 2. Triple Sampling: Errors can get masked here also if there are glitches at more than one sampling edge (Fig. 16). False errors are also possible when there are glitches around the 2nd and 3rd sampling edges (Fig. 17). 3. Integrate & Sample method: In Integrate & Sample method, multiple errors decrease the available Tstab ef f interval by δmax . Due to decrease in the Tstab ef f , there is not enough sense voltage developed to give an error and hence error gets masked (Fig. 18). False errors can also occur if there are multiple glitches in the positive half cycle so that there is enough sense voltage to give an error (Fig. 18). Increasing the current will not help as there must be time difference between the interval developed due to the false (δmax ) and multiple errors (Tstab ef f − δmax ). Thus, all the three checkers can miss an actual error, as well as show higher false error rates. C17 is analyzed for DTFs (Double Transient Faults) where the soft error strikes two nodes in the same cycle. The results (Table 3) shows that only Integrate & Sample method has undetected and false errors and Double and Triple sampling methods have no undetected errors and Triple sampling method has no false errors. This can be explained by the fact that two pulses occurring, one at the positive edge and other anywhere in the positive half cycle leading to undetected error for Integrate & Sample method, is certainly more probable than the two pulses occurring exactly at the sampling edges in case of Double and Triple sampling methods. Though our Monte Carlo simulation did not generate the latter case, it can occur, though at a reduced probability.

Large hold times are required in-order to enable Double and Triple Sampling to detect all DTFs. The separation between the two sampling edges must be at least 2 ∗ δmax + Tsep (where Tsep is maximum separation between the pulses) in order to ensure there are no undetected errors and false errors in case of DTFs (Fig. 16). Another solution is to take more samples. For example, five samples are required for detecting two simultaneous errors with certainty. Integrate & Sample method can be made robust to DTFs by increasing Tstab ef f by δmax . The increase in the stability window for the Integrate & Sample is less compared to Double and Triple Sampling. In the Integrate & Sample method, the increase is merely δmax irrespective of the separation between the pulses, where as in the Double and Triple Sampling methods, the increase is directly proportional to Tsep .

7 Conclusion In this paper we have presented a comparative study of three different error checkers for online testing for soft error detection for on-line testing. Our study is based on SPICE and switch-level simulations of the circuit under test, with models of current sources mimicking soft error strikes at potentially every node of the circuit. This framework also includes the ideal circuit under test without any errors as well as the ideal error checker against which the three error checkers are compared. All the three error checkers can be designed to detect all actual errors due to single transient faults. Double Sampling checker has a large false error detection rate, while the Integrate & Sample and Triple Sampling checker can be designed to have zero false errors. In terms of the implementation complexity, the Double Sampling technique is the simplest and can be easily integrated into existing digital design flows. But it suffers from a high false error detection rate of 1.15x the actual error detection rate for our simple test circuit (FO4 chain).

J Electron Test (2010) 26:323–335

The Triple Sampling method has 1.74 times the area and 1.83 times the power of the Double Sampling method, but can be designed to have zero errors. However the generating of an additional sampling edge in the center of the stability window is additional problem to solve. The Integrate & Sample method can also be engineered to have zero false error rates. It occupies about 0.58X the area and consumes 6% more power than the Double Sampling method. It does not need an extra sampling edge unlike the Triple Sampling method. However, it is a semi-analog technique and its integration with existing digital flows needs to be investigated further. We have described a robust implementation of this technique. Our estimates show that the false error rates at ground level are sufficiently low for small chips and hence Double Sampling technique will be an adequate technique for detecting soft errors in most chips. However for large high performance chips, operating at high altitudes, the false error rates for the Double Sampling technique may become unacceptably high and one may motivate the need for either the Triple Sampling checker or the Integrate & Sample checker, both of which can be designed for zero false errors. All the three methods are vulnerable to DTFs on same flip-flop. Among the three methods, Integrate & Sample method is most vulnerable. While all the three techniques can be engineered to detect DTFs, the design complexity will increase.

333

Fig. 20 Actual and false errors in Integrate & Sample method

fall mid-way between the Sense voltages developed due to actual (sensea) and false (sensef) soft error cases as shown in Fig. 19. This can be achieved by using two Integrate & Sample cells, of half the width of that on the Sense line, onto the Dummy line with each having the integrating interval of δmax and Tstab ef f as shown in Fig. 21. If Ion is the on current of an Integrate & Sample cell of width W, then Integrate & Sample cell of width W/2 has an on current of approximately I2on . In such a case using Eq. 4.2. The voltage across the Sense node for actual error is Vsensea =

Ion Tstab ef f Cint

(7.1)

Similarly, the voltage across the Sense node for false error is Appendix: Robust implementations of Integrate & Sample Using a sense amplifier with a fixed reference will not be practical in view of the variations in the Integrate & Sample cell. The solution to this is to generate a reference on-chip so that it can track the variations. Instead of using a constant Dummy line, the Dummy can be made to discharge at a rate which allows it to

Vsensef =

Ion δmax Cint

(7.2)

The voltage developed across Dummy node is given by Vdummy ∼ ∼

Ion 2

Cint

Tstab ef f +

Ion 2

Cint

δmax

Vsensef + Vsensea 2

(7.3)

The process of generation of Dummy reference level is explained in the Fig. 19. Figure 20 shows the working

Table 4 Conditions for Monte Carlo simulation

Fig. 19 Generation of dummy reference level for Integrate & Sample method

Parameter

Mean

Std. dev.

Vthp Vthn toxp toxn L Cint

−250 mV 216 mV 2.86 nm 2.73 nm 130 nm 100 fF

8.33 mV 7.2 mV 95.3 μm 91 μm 4.33 nm 3.33 fF

334

J Electron Test (2010) 26:323–335

References

Fig. 21 Reference generator for Integrate & Sample method shown in Fig. 6

of Integrate & Sample method in case of a false error (Cycle 1) and an actual error (Cycle 2). In order to validate the correct working of the above explained behavior, the setup is analyzed using Monte Carlo simulation. The circuit is simulated by varying the process parameters like Vth (threshold Voltage), tox (oxide thickness), L (channel length) for all the transistors in the Integrate & Sample scheme and circuit parameters like Cint (Table 4) The Fig. 21 shows the low margin (dummy-sensea) versus the high margin (sensef-dummy) simulated at four operating temperatures of 0◦ C, 27◦ C, 50◦ C, 100◦ C. Correct operation requires both these margins to remain positive. The sense amplifier outputs show the expected working of the Integrate & Sample method for the entire temperature range across all the MonteCarlo simulations (Fig. 22).

0.26 temp=0C temp=27C temp=50C temp=100C

sensef−dummy(in V)

0.24 0.22 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.04

0.06

0.08 0.1 0.12 dummy−sensea(in V)

0.14

0.16

Fig. 22 Margins in Integrate & Sample scheme, Monte Carlo simulations for worst case process for four different temperatures

1. Baumann R (2001) Soft errors in advanced semiconductor devices part I: the three radiation sources. IEEE Trans Device Mater Reliab 1:17–22 2. Blaauw D, Kalaiselvan S, Lai K, Ma WH, Pant S, Tokunaga C, Das S, Bull D (2008) Razor II: in situ error detection and correction for PVT and SER tolerance. In: IEEE international solid-state circuits conference 3. Ernst D, Das S, Lee S, Blaauw D, Austin T, Mudge T, Kim NS, Flautner K (2003) Razor: circuit-level correction of timing errors for low-power operation. IEEE Micro 24(6):10–20 4. Freeman LB (1996) Critical charge calculations for a bipolar SRAM array. IBM J Res Develop 40(1):119–129 5. Harminder D, Sylvester D, Blaauw D (2005) Gate-level mitigation techniques for neutron-induced soft error rate. In: International symposium on quality electronic design, pp 175–180 6. Hazucha P, Karnik T, Maiz J, Walstra S, Bloechel B, Tschanz J, Dermer G, Hareland S, Armstrong P, Borkar S (2003) Neutron soft error rate measurements in a 90-nm CMOS process and scaling trends in SRAM from 0.25-μm to 90-nm generation, international electron devices meeting, pp 523– 526 7. Kiran M, Amrutur B, Parekhji R (2008) False error study of on-line soft error detection mechanisms. In: International online testing symposium 8. Kunal Ganeshpure AS, Kundu S (2007) On accelerating softerror detection by targeted pattern generation, ISQED 9. Lisboa CA, Kastensmidt FL, Henes Neto E, Wirth G, Carro L (2007) Using built-in sensors to cope with long duration transient faults in future technologies. In: Proc IEEE Int’l Test Conf, pp 1–10 10. May TC, Woods MH (1979) Alpha-particle-induced soft error in dynamic memories. IEEE Trans Electron Devices 26(1):29 11. Nicolidis M (1999) Time redundancy based soft error tolerance to rescue nanometer technologies. In: VLSI test symposium, pp 86–94 12. Rossi D, Omana M, Metra C, Pagni A (2006) Checker no-harm alarm robustness. In: International online testing symposium, pp 10–12 13. Satish Y, Amrutur B, Parekhji R (2007) Modified stability checking for on-line error detection. In: International conference on VLSI design, pp 787–792 14. Seifert N, Slankard P, Kirsch M, Narasimham B, Zia V, Brookreson C, Vo A, Mitra S, Gill B, Maiz J (2006) Radiation induced soft error rates of advanced CMOS bulk devices. In: International reliability physics symposium, pp 217–225 15. Shivakumar P, Kistler M, Keckler SW, Burger D, Alvisi L (2002) Modeling the effect of technology trends on the soft error rate of combinational logic. In: International conference on dependable systems and networks, pp 389–398 16. Taber A, Normand E (1993) Single event upset in avionics. IEEE Trans Nucl Sci 40(2):120–125 17. Tosaka Y, Kanata H, Satoh S, Itakura T (1999) Simple method for estimating neutron-induced soft error rates based on modified BGR Model. IEEE Electron Device Lett 20(2):89–91 18. Tsiatouhas Y, Arapoyanni A, Nikolos D, Haniotakis Th (2002) A hierarchical architecture for concurrent soft error detection based on current sensing. International online testing workshop, p 56

J Electron Test (2010) 26:323–335 19. Ziegler JF, Curtis HW, Muhlfeld HP, Montrose CJ, Chin B (1996) IBM experiments in soft fails in computer electronics (1978–1994). IBM J Res Develop 40(1):3–18

Kiran Kumar Reddy got his B.E in ECE from Vasavi college of Engineering, Hyderabad, India in 2006 and M.tech in Microelectronics from IISc, Bangalore, India in 2008. He is currently working as Electrical Design Engineer in Cypress Semiconductor, Bangalore, India.

Bharadwaj S. Amrutur got his B.Tech in CS&E from IIT Bombay at Mumbai, India in 1990 and a MS and Ph.D in EE from Stanford University in 1994 and 1999 respectively. He

335 is currently working as Assistant Professor, ECE Dept, IISc Bangalore. Rubin A. Parekhji is a Senior Member of Technical Staff at Texas Instruments, Bangalore (India), driving DFT and test methodology, technology and automation into different TI products for various applications. He has published papers and delivered tutorials in several international conferences, and has been on the technical program committee for some of them. He is on the editorial board of two referred journals, and has contributed a chapter on “Embedded Core and System-onchip Testing” in Springer’s new book “Advances in Electronic Testing: Challenges and Methodologies.” He has been a visiting faculty member at Indian Institute of Technology, Bombay (India), and Indian Institute of Science, Bangalore, (India) , and has guided several master’s students. He has a Ph.D. degree from IIT Bombay.