Microelectronics Reliability 54 (2014) 1613–1626
Contents lists available at ScienceDirect
Microelectronics Reliability journal homepage: www.elsevier.com/locate/microrel
Fault-tolerant TMR and DMR circuits with latchup protection switches V. Petrovic´, G. Schoof, Z. Stamenkovic´ ⇑ IHP, Im Technologiepark 25, 15236 Frankfurt (Oder), Germany
a r t i c l e
i n f o
Article history: Received 5 December 2013 Received in revised form 1 April 2014 Accepted 1 April 2014 Available online 26 April 2014 Keywords: Single event effect Fault tolerance Triple and double modular redundancy Latchup protection switch ASIC design
a b s t r a c t The paper presents CMOS ASICs which can tolerate the single event upsets (SEUs), the single event transients (SET), and the single event latchup (SEL). Triple and double modular redundant (TMR and DMR) circuits in combination with SEL protection switches (SPS) make the base of the proposed approach. The SPS had been designed, characterized, and verified before it became a standard library cell. A few additional steps during logic synthesis and layout generation have been introduced in order to implement the redundant net-lists and power domains as well as to place the latchup protection switches. The approach and accompanying techniques have been verified on the example of a shiftregister and a middleware switch processor. Ó 2014 Elsevier Ltd. All rights reserved.
1. Introduction Malfunctions of electronic devices due to the single event effects, being an effect of radiation, are observed not only in cosmic and airborne equipment, but also in mainstream applications. Together with the progressing integration and scaling of the electronic chips their susceptibility to errors increases. The current space microelectronics development and high energy physics research require shorter design time and cheaper solutions for the radiation and fault tolerant ASIC designs [1–3]. The main idea is to provide ASICs capable of correct and reliable functioning in the radiation environment using the standard semiconductor technologies and design flow. Therefore, the design of the advanced fault-tolerant digital integrated circuits needs scientific research and progress concerning three important aspects: a. Analysis of irradiation effects and circuit faults. b. Development of fault models and simulation test benches. c. Design of fault-tolerant circuits and systems. Definition and description of the basic irradiation effects and fault mechanisms can be found in the literature [4–6]. The thesis referenced in [7] provides an overview of different radiation environments and investigates interaction mechanisms between energetic particles and the matter. It also introduces a new ⇑ Corresponding author. Tel.: +49 3355625726. E-mail addresses:
[email protected] (V. Petrovic´), schoof@ ihp-microelectronics.com (G. Schoof),
[email protected] (Z. Stamenkovic´). http://dx.doi.org/10.1016/j.microrel.2014.04.001 0026-2714/Ó 2014 Elsevier Ltd. All rights reserved.
semi-empirical model for estimating the electronic stopping force of heavy ions in solids. The most common irradiation effect is a single event effect (SEE) induced by a cosmic particle strike. Three common types of SEE are known: single event upset (SEU), single event transient (SET) and single event latchup (SEL). A single event upset causes the change of state in a storage element. It affects the memory cells and sequential logic. A single effect transient causes a short impulse (and wrong logic state) at the combinational logic output. The wrong logic state will propagate in case that it appears during the active clock edge. On the other hand, a single event latchup causes the excessive current flow through a NPNP structure in CMOS circuits. The single event latchup, compared to the SEU and SET, is a potentially destructive state [8–11]. In order to analyze the final impact of the single event effects and provide the high fault coverage at circuit and system levels, it is necessary to develop practical, accurate, and simple simulation fault models. A good fault model should keep the correct function of a circuit (NAND, NOR, flip-flop, etc.) in case of no fault and cause the circuit malfunction in case of a fault. Beside the logic function, the most relevant modeling parameters are the effect duration, the minimal time of current discharge, and the current pulse intensity. The easiest way to simulate a fault in the logic circuit is use of a fault injector: XOR or XNOR gate. Fault injectors can be added once the logic synthesis has been completed and a design net-list has been generated. Computer parsers are used for automatic insertion of the fault injectors [12–15]. Simulation fault-injection is the most used approach for validation of the fault-tolerant systems. The proposed fault models can be used during behavioral and net-list simulation. The models are usually described in a hardware description language (VHDL or Verilog).
1614
V. Petrovic´ et al. / Microelectronics Reliability 54 (2014) 1613–1626
We can clearly separate the known SEU, SET, and SEL faulttolerant techniques into the two categories: circuit level techniques [16] (hardened-cell design [17], triple modular redundancy (TMR) [18], double modular redundancy (DMR) [19], and error detection and correction for memories [20]) and layout level techniques [21] (enclosed layout transistors, guard rings, and trench isolation). There are also many mitigation techniques based on the expensive technology changes [21]. When it comes to SEU and SET, the most common fault-tolerant techniques are TMR and DMR. The triple modular redundant circuit consists of three identical modules and a 3-input majority voter. The voter’s function is to pass through the major input value to the output. As we speak about digital circuits, the modules are memory elements such as flip-flops or latches. The main disadvantage of this technique is that the system fails in case of a faulty voter. Therefore, a new triple voting logic was developed to complete the circuit redundancy. Each of the three voters is fed from outputs of all three memory modules. This technique is known in the literature as the full triple modular redundancy. A detailed analysis of the triple modular redundancy was presented in [22]. The higher redundancy provides the higher fault tolerance and circuit reliability but increases the chip area, energy consumption, and costs. Therefore, the goal is to trade-off between the redundancy level and the reliability requirements taking into account the prospective application. In order to reduce the high hardware overhead produced by TMR and keep the design reliability high, the double modular redundancy with self-voting was developed [19]. A novel double/triple modular redundancy (DTMR) technique for dynamically reconfigurable devices tested on a finite state machine was presented in [23]. The SEL mitigation techniques can be classified in the three main groups: a. First approach uses the current sensors at board-level to detect the excessive current induced by latchup. The power supply of the affected device is switched off and, after a long enough period of time, reestablished again. This approach suffers from a serious drawback: the circuit state is destroyed and cannot be recovered. b. Second approach [24] is based on introduction of an epitaxial buried silicon layer and reduction of the well resistivity. However, this modification incurs additional costs and may impact circuit performance (breakdown voltage, for example). c. Third approach [25] uses guard rings (additional N-type and P-type regions) that break the parasitic bipolar transistor structure. This solution is very efficient but can result in excessive circuit area and, therefore, price. A new SEL mitigation scheme that combines error correcting codes with intelligent power line implementation was presented in [26]. It prevents the circuit damage and corrects errors caused by latchup in a transparent manner. The area cost is low and the scheme can also be used to correct SEUs and reduce power dissipation. In spite of the huge research efforts and respectable scientific achievements, there are still challenges regarding the use of commercial ASIC technologies in space and safety-critical applications. The most important are: a. The radiation-hardened technologies are expensive (qualification and quality requirements are severe) and commercially not attractive (small volume production). b. There is no standard integrated framework of circuit design techniques that provides simultaneous SEU, SET, and SEL fault-tolerance.
c. There is no standard design automation flow of faulttolerant digital ASICs and SOCs. This paper presents a design flow for fault-tolerant ASIC that is based on the redundant circuits with latchup protection and additional implementation steps during logic synthesis and layout generation. The proposed approach protects ASICs from the SEU, SET, and SEL faults by combining and integrating different faulttolerant techniques. The design methodology has been described in [27]. The following paper sections describe a SEL protection switch (Section 2), redundant (TMR and DMR) circuits with latchup protection (Section 3), and test results of the implemented faulttolerant circuits (Section 4). The conclusions are summarized in Section 5.
2. SEL protection switch Instead of using the external current sensors and expensive technological modifications, we have introduced a single event latchup protection switch (SPS) which can be integrated with standard cells and control their supply current. The proposed SEL protection switch [28,29] is shown in Fig. 1. The function of this circuit is to switch off the power supply in case of the excessive supply current. A current sensor/driver (P5 transistor) is used for supplying the power to the logic that needs to be protected from latchup. The P5 transistor operates in the linear (ohmic) region. It must provide enough current during normal operation (as a driver) and survive a potential latchup (as a sensor). The three terminals of the circuit from Fig. 1 have the following functions: ‘‘Poff’’ is used to switch off the controlled section and activate the power saving mode. ‘‘TSTOP’’ comes from a power network controller and is used to reconnect the power supply to the standard cells that have safely recovered after the latchup. ‘‘TSTART’’ informs the power network controller that the latchup has been detected in the corresponding standard cells. When all the circuits operate correctly, both inputs of the NOR gate (P1, P2, N2, and N3 transistors) are at logic 0 and the output (B node) is therefore at logic 1. The C node is set to logic 0 because the P4 transistor is off (there is no current through the R1 pull-down resistor). This activates the P5 transistor which drives enough current (and voltage) at the VDD_a output. Logic 1 at VDD_a deactivates the P6 transistor and the A node is set to logic 0. The TSTART output stays at logic 1 that indicates no latchup condition. In case of the excessive supply current, the output voltage of the P5 transistor increases what, in turns, causes the voltage drop of the transistor drain (VDD_a output). As soon as the VDD_a voltage is under the threshold voltage, the feed-back line activates the P6 transistor and the A node goes to logic 1. It activates the N0 transistor that sets the TSTART output to logic 0 and informs a power network controller that the latchup has occurred. On the other side, the output of the NOR gate (B node) is set to logic 0 and the P4 transistor is activated. The TSTOP input is set to logic 0 by the power network controller. Therefore, after the latchup has been detected, the C node goes to logic 1 and forces the VDD_a output to logic 0 (the power supply is cut off). After a programmed time, the protection phase is finished and the power network controller sends a pulse to the TSTOP input. The P3 transistor is deactivated and there is no current through the R1 pull-down resistor. The C node is again at logic 0 and the P5 transistor can supply the connected logic. If the controlled section (circuit) needs to be switched off for a longer period of time, the power network controller sets the Poff input to logic 1. The switch had been characterized and verified before it became a standard cell of our technology library [30]. Fig. 2 presents
V. Petrovic´ et al. / Microelectronics Reliability 54 (2014) 1613–1626
1615
Fig. 1. SPS circuit schematic [28,29].
Fig. 2. SPS timing diagram.
a timing diagram of the SPS input and output signals. The TRIGGER signal (generated externally) initiates the latchup. The time required to set the VDD_a output to ground level is defined as a power off time (POFT). This time added to the activation time of the P6 and N0 transistors is (necessary to activate TSTART) makes a latchup recognition time (LRT). Once the SPS has been in protection mode, it can be waked up by sending a positive pulse to the TSTOP input. If the total time necessary to deactivate the P3 transistor and activate the P5 transistor is a power on time (PONT), by adding the total time required for deactivation of the P6 and N0 transistors to the PONT, we define a protection deactivating time (PDT). The minimal TSTOP pulse duration must be longer than the PDT. Fig. 3 shows a transient response (timing of the drain current and voltage) of the P5 transistor after the latchup has occurred and the SPS has turned to protection mode. Timing parameters of the fabricated switch have been measured by a digital storage oscilloscope. Afterwards, the measured parameter values have been corrected due to the parasitic capacitances and resistances of cables and connectors. These values are shown in Table 1. A burn off test (destruction) of the current sensor/driver (P5 transistor) has also been performed to prove its resistivity to latchup. The transistor burn off is caused by thermal breakdown. The schematic of a test circuit used for burn off tests is presented in Fig. 4.
Circuit output (the ‘‘out’’ pin) is grounded. Logic ‘0’ at the ‘‘control’’ pin activates the PMOS sensor/driver transistor. The output current rises until the thermal breakdown occurs. The time that takes between transistor activation and transistor breakdown is a transistor burn off time. Table 2 presents the burn off times of three test transistors which can be used as current sensors/drivers. The simulated worst-case power consumption of the SPS is 750 lW in the normal condition and 3.16 mW in the latchup condition. However, the measured power consumption is about 500 lW in the normal condition and goes up to 1.25 mW when the latchup occurs. The difference between power consumption values in normal and latchup conditions comes from a voltage drop across the R1 pull-down resistor. Timing measurements are done in order to characterize the SPS circuit as a standard cell for use in an automated design flow. The measured SPS cell signals are depicted in Fig. 5. The latchup TRIGGER signal activates the reed-relay used as a short-circuit switch and synchronizes the measurement equipment. The VDD_a, TSTOP, and TSTART signals are already described in details. The Poff signal was active all the time. Three groups of test circuits are produced in standard IHP 0.25 lm process [30], in order to prove the functionality of SPS cells. The irradiation tests of SPS cells were performed in the Radiation Effects Facility at the Cyclotron Institute of Texas
V. Petrovic´ et al. / Microelectronics Reliability 54 (2014) 1613–1626
1616
Fig. 3. Drain current and voltage of the P5 transistor after the latchup occurrence.
Table 1 SPS timing parameters.
3. Redundant circuits with latchup protection
POFT
LRT
PONT
PDT
120 (ps)
440 (ps)
1.42 (ns)
1.71 (ns)
Redundancy is always required for a fault-tolerant design. The higher redundancy, the better protection, but also the larger chip area, power consumption, and cost are. Therefore, it is necessary to trade-off between the redundancy rate and the cost for each and every application. However, the redundant circuits (TMR and DMR) need to be protected from the latchup condition too. This requires significant modifications of the standard TMR and DMR circuit designs. We have proposed the SEL protection switch and the independent redundant power domains as a foundation of the latchup protected redundant circuits [29,31]. 3.1. TMR circuit with latchup protection
Fig. 4. Burn off test circuit.
Table 2 Burn off time of test transistors (the length of transistor channel is 0.25 lm). Width of transistor channel (lm) tburn-off (ns)
83.32 21
166.64 30
249.96 38
A&M University in College Station, Texas (USA). The SEL tests used ion energies up to 74.8 MeV cm2/mg in different thermal conditions. The ion cocktail was composed of different radiation sources: 4Ne (2.6 MeV cm2/mg), 40Ar (8 MeV cm2/mg), 63Cu (18.7 MeV cm2/mg), 63Cu (26.4 MeV cm2/mg – tilt 45°), 84Kr (26.6 MeV cm2/mg), 84Kr (37.6 MeV cm2/mg – tilt 45°), 141Pr (56 MeV cm2/mg), 181Ta (74.8 MeV cm2/mg), 197Au (82.8 MeV cm2/mg), and 197Au (117.1 MeV cm2/mg – tilt 45°). The effective linear energy transfer (LET) was in range of 2.6 MeV cm2/mg up to 82.8 MeV cm2/mg. All the SPS cells operated correctly during the irradiation tests.
A modified version of the TMR circuit, which includes the latchup protection, is based on the redundant power supply lines and the supply current control. The proposed protection technique can handle destructive latchup effects without losing the circuit functionality. Therefore, when the latchup occurs in one of three power domains, other two power domains are not affected. Fig. 6 presents a TMR circuit with three SEL protection switches which control the three independent power domains (VDD_a, VDD_b, and VDD_c). No additional voter is required within a redundant circuit. Only the output pad needs a control signal generated by an additional voter. This additional voter is supplied by any of the three redundant power domains. In case of the latchup (high supply current) in one of the three power domains, the corresponding switch breaks off the supply line of this domain but the other two power domains (and redundant circuits) continue with normal operation. The outputs of the ‘‘switched off’’ circuit are set to the low logic level. Since one of the three redundant modules is switched off, this TMR circuit operates as a DMR circuit and it stays so as far as the latchup protection is active. The problem can arise if a radiation particle hits the operating part of digital circuit and induces a transient or an upset. In this case, the presented circuit cannot operate properly any more. A TMR circuit with power redundancy operates properly if: a. One or two of three power domains are switched off. b. Only one flip–flop fails due to an upset.
V. Petrovic´ et al. / Microelectronics Reliability 54 (2014) 1613–1626
1617
Fig. 5. SPS signals and controlled voltage (VDD_a) during latchup protection phase.
Fig. 6. TMR circuit with the latchup protection switches.
c. Only one voter fails due to a transient. It means that an additional logic (named ‘‘Data Keep’’) is necessary to compare the logic states at all the three voter inputs with the states at the flip-flop input and output during an active clock edge.
Fig. 7 presents a functional diagram of the ‘‘Data Keep’’ logic. It observes the logic states and supply conditions of the redundant circuits. This logic compares the redundant flip-flop outputs, power supply relevant signals, and input and output of the flip-flop belonging to the same power domain to secure the TMR failurefree operation.
V. Petrovic´ et al. / Microelectronics Reliability 54 (2014) 1613–1626
1618
Fig. 7. ‘‘Data Keep’’ logic for a TMR circuit.
It is important to notice that the outputs of sequential elements are dependent on the power supplies (depicted by red and blue dotted lines in Fig. 7). The most critical case is when a SET occurs near an active clock edge. There are two possible faults: the metastability or wrong logic state (upset) of the flip-flop. Both can induce the circuit’s failure. The ‘‘Pulse’’ signal enables a latch component to save the data during an active clock edge. In order to save the correct value, the signals of the other redundant circuits are used to control the latch. Therefore, the ‘‘Data Snapshot’’ block is a circuit composed of a latch and combinational logic. In case of the latchup, the ‘‘Extended Voter’’ enables the logic value coming from the ‘‘Data Snapshot’’ to propagate. The ‘‘MUX’’ defines the correct value at the third input of the ‘‘Extended Voter’’ in case that one of the redundant domains is switched off. If both redundant power domains are switched off, the ‘‘Pre-Data Out’’ block bypasses the primary flip-flop output to the main output. The logic states of the TMR ‘‘Data Keep’’ are presented in Table 3. It operates in three different power modes: – Only the primary power domain is active. – Both redundant and primary power domains are active. – Primary and one redundant power supply are active. ‘‘X’’ denotes a do not care state, FF1 is the primary flip-flop, FF2 and FF3 are redundant flip-flops, and ‘‘–’’ denotes an inactive output.
Table 3 Logic states of the TMR ‘‘Data Keep’’. Active power domains
Input data primary flip-flop (latch)
FF1 output
FF2 output
FF3 output
Main output
Primary
X
FF1
–
–
FF1
Primary and both redundant
X X X X X X X X
0 0 0 0 1 1 1 1
0 0 1 1 0 0 1 1
0 1 0 1 0 1 0 1
0 0 0 1 0 1 1 1
0 0 0 0 1 1 1 1
0 0 1 1 0 0 1 1
0 1 0 1 0 1 0 1
– – – – – – – –
0 0 0 1 0 1 1 1
Primary and one redundant
The probability that ‘‘Data Keep’’ logic works without any fault (PDK0) is described by:
PDK0 ¼ P3FF PV PDS PEV PPDO
ð1Þ
where PFF is the fault-free probability of a flip-flop, PV is the faultfree probability of a voter, PDS is the fault-free probability of ‘‘Data Snapshot’’, PEV is the fault-free probability of ‘‘Extended Voter’’, and PPDO is the fault-free probability of ‘‘Pre-Data Out’’. The circuit still operates correctly if one flip-flop is faulty. The following equation describes this case:
PDK1FF ¼ ð1 PFF ÞP2FF PDS PV PEV PPDO
ð2Þ
If the main voter is faulty and all the other components are fault-free, the ‘‘Data Keep’’ still operates correctly. This case is covered by:
PDK1V ¼ ð1 PV ÞP3FF PDS PEV PPDO
ð3Þ
The ‘‘Extended Voter’’ should be designed to handle possible radiation effects in silicon devices. The ‘‘Pre-Data Output’’ is used to select the redundant power domains. It is also used to provide filtering of the transient effects, coming from the ‘‘Extended Voter’’, before the logic state is defined at the outputs of the sequential element. The next equation describes analytically an interesting case when one flip-flop and the main voter are faulty at once:
PDK1FF1V ¼ ð1 PFF Þð1 PV ÞPFF PDS PEV PPDO
ð4Þ
The failure-free probability of the ‘‘Data Keep’’ logic can be calculated using the following equation (all the fault-free probabilities are expressed in terms of the fault-free probabilities of the flip-flop and voter, the fault-free probability of ‘‘Data Snapshot’’ is identical to the fault-free probability of a flip-flop (PDS = PFF), the ‘‘Extended Voter’’ fault-free probability is identical to the voter fault-free probability (PEV = PV), and the fault-free probability of ‘‘Pre-Data Out’’ is ideal (PPDO = 1))
PDK ¼ PDK0 þ 3PDK1FF þ P DK1V þ 3P DK1FF1V ¼ P4FF P2V þ 3ð1 PFF ÞP3FF P2V þ ð1 PV ÞP4FF PV þ 3ð1 PFF Þð1 PV ÞP2FF P V
ð5Þ
After defining the failure-free probability of the ‘‘Data Keep’’ logic, it is possible to calculate the failure-free probability of a TMR circuit with latchup protection. As we have all the required faultfree probabilities, the failure-free probability of a TMR circuit with the ‘‘Data Keep’’ logic can be calculated as:
PTMR ¼ P3FF P3V P3DK þ 3P2FF P2V P2DK ð1 PFF PV PDK Þ ¼ 3P 2FF P2V P 2DK 2P3FF P 3V P3DK
ð6Þ
V. Petrovic´ et al. / Microelectronics Reliability 54 (2014) 1613–1626
3.2. Self-voting DMR circuit with latchup protection In order to reduce the hardware overhead produced by the TMR and to keep the design reliability high, the self-voting DMR proposed in [19] is used. Self-voting is based on a 3-input majority voter configured to vote on the two external input values and its own output value. However, a circuit with the self-voting, without modifications in the feed-back line, has low failure-free probability. The reason is high circuit sensitivity on the transients occurring near an active clock edge. The setup/hold time margins are in this situation violated and flip-flops are unable to save the correct state (Fig. 8). This is also a potential upset initialized by combinational logic. In case that a transient effect occurs in the combinational logic during an active clock edge, flip-flops save the different logic states. This is also potential upset initialized by combinational logic. Depending on the previous logic state, the fault can propagate through the circuit causing the failure. To mitigate the described effect and handle the transients with longer duration than the flip-flop setup/hold time, it has been recommended to use the logic state of the flip-flop input as the third input of the voter during the setup/hold time [32]. After the setup/ hold time is over, the voter output can be again used for voting until the next active clock edge. In order to provide the latchup protection, the DMR circuit with self-voting needs additional components: the latchup protection switches and control logic (named ‘‘Data Keep’’). Fig. 9 shows a DMR circuit with the ‘‘Data Keep’’ logic which controls the circuit operation in the latchup protection phase. The DMR circuit includes two SEL protection switches which control the two independent power domains (VDD_a and VDD_b). No additional voter is required within a redundant circuit. Only the output pad needs a control signal generated by an additional voter. This additional voter is supplied by any of the two redundant power domains. The SEL protection switch cuts off the supply line of the power domain (and redundant circuit) which suffers the latchup. It means that the redundancy is lost during the latchup protection phase. The ‘‘Data Keep’’ observes the logic states of all cells in a power domain as well as the status of the redundant power domain. In more detail, it observes the logic states of both flip-flop outputs, the output of the primary voter, and the input of the primary flip-flop. The most critical scenario is a transient which occurs around an active clock edge. In this case, there are two possible consequences: the metastability or the wrong logic state (upset) of a flip-flop. Therefore, the ‘‘Data Keep’’ compares the logic states at the flip-flop outputs (primary and redundant) with input data of the primary flip-flop. In case that a transient occurs within the setup-hold time, the flip-flop saves the wrong logic state. The ‘‘Pulse’’
1619
block provides enough time to save the stable logic state at the flipflop data input. Dependent on the saved logic state, the logic states of both flip-flops and the logic state at the primary voter, the ‘‘Data Keep’’ provides the accurate logic state at the third input of the primary voter. The current status of the redundant power domain must be taken into account too. An important advantage of the presented approach is configurability. In case that the system need not a full protection against the radiation effects (terrestrial operation, for example), it is possible to reduce the power consumption by selecting a single mode operation. The single mode operation disables the protection against all the three radiation effects – transients, upsets, and latchup. In this mode, the ‘‘Data Keep’’ ignores the signals which come from the redundant circuit. The ‘‘Data Keep’’ logic consists of a multiplexer, memory element (latch), and voter. A functional diagram of ‘‘Data Keep’’ logic is presented in Fig. 10. During the clock transition, a short pulse is generated to control a multiplexer and send data from the flip-flop input to the third voter input. In case of a SET during this short period of time, regardless of the wrong state of the flip-flop output, the third voter input receives the correct logic value. This approach works fine for transient pulses of a few hundreds of picoseconds, what is limited by the flip-flop setup/hold time. Table 4 shows the logic states of the DMR ‘‘Data Keep’’ logic for two different power modes: – Only the primary power domain is active. – Both power domains are active. ‘‘X’’ denotes a do not care state, FF1 is the primary flip-flop, FF2 is the redundant flip-flop, and ‘‘–’’ denotes an inactive output. In case that both power domains (primary and redundant) are switched off, all the nodes of the redundant circuit will be set at the low logic state. The ‘‘Data Snapshot’’ block is used to provide the correct logic state during the complete clock cycle. The saved logic state is dependent on the logic states of the primary digital circuit as well as redundant flip-flop. The ‘‘Extended Voter’’ is used to provide the voting procedure based on the four digital states. It also provides more accuracy and, therefore, a higher fault-free probability. A standard TMR voter is used because of the low glitch probability and availability as a standard cell. The ‘‘Pre-Data Out’’ block is used to select the type of circuit protection. In case that the circuit is in the single mode, the data will be taken directly from the primary flip-flop without involving any dependency on the redundant circuit. In case that the system operates in the protection mode with activated redundancy, the ‘‘Pre-Data Out’’ receives the data which
Fig. 8. Fault propagation in a DMR circuit with self-voting.
V. Petrovic´ et al. / Microelectronics Reliability 54 (2014) 1613–1626
1620
Fig. 9. Self-voting DMR circuit with the latchup protection switches.
Fig. 10. ‘‘Data Keep’’ logic for a DMR circuit.
Table 4 Logic states of the DMR ‘‘Data Keep’’. Active power domains
Input data primary flip-flop (Latch)
FF1 output
FF2 output
In case that a memory element (flip-flop) is faulty, ‘‘Data Keep’’ still operates correctly and the failure-free probability is:
Main output
Primary
X
FF1
–
FF1
Primary and redundant
0 0 0 0 1 1 1 1
0 0 1 1 0 0 1 1
0 1 0 1 0 1 0 1
0 0 0 1 0 1 1 1
PDK1FF ¼ ð1 PFF ÞPFF PDS PV PEV PPDO
When a voter is faulty, ‘‘Data Keep’’ operates correctly only if all the other components are fault-free:
PDK1V ¼ ð1 PV ÞP2FF PDS PEV PPDO
PDK0 ¼
ð9Þ
Finally, even if a flip-flop and a voter are faulty at the same time, the ‘‘Data Keep’’ block provides the correct logic state:
PDK1FF1V ¼ ð1 PFF Þð1 PV ÞPFF PDS PEV PPDO
is dependent on the primary and redundant circuits. It is important to note that in this circuit type, a ‘‘voting loop’’ approach has been implemented. In case that a fault occurs in the ‘‘Pre-Data Out’’ block, the main voter detects an error and the ‘‘Extended Voter’’ immediately corrects the error. On the other hand, if an error occurs in the main voter, the ‘‘Extended Voter’’ should correct it. The probability that the complete ‘‘Data Keep’’ block operates correctly is given by:
P2FF PDS PV PEV PPDO
ð8Þ
ð7Þ
ð10Þ
Having assumed that the fault-free probability of ‘‘Data Snapshot’’ is identical to the fault-free probability of a flip-flop (PDS = PFF), the ‘‘Extended Voter’’ fault-free probability is identical to the voter fault-free probability (PEV = PV), and the fault-free probability of ‘‘Pre-Data Out’’ is ideal (PPDO = 1), the failure-free probability of the ‘‘Data Keep’’ block is defined by:
PDK ¼ PDK0 þ 2PDK1FF þ P DK1V þ 2P DK1FF1V ¼ P2FF PFF P V PV þ 2ð1 PFF ÞPFF PFF PV PV þ ð1 PV ÞP2FF PFF PV þ 2ð1 PFF Þð1 P V ÞPFF PFF P V
V. Petrovic´ et al. / Microelectronics Reliability 54 (2014) 1613–1626
1621
¼ P3FF P2V þ 2ð1 PFF ÞP2FF P2V þ ð1 PV ÞP3FF PV þ 2ð1 PFF Þð1 PV ÞP2FF PV ¼ ð2 PFF ÞP2FF PV
ð11Þ
Now, the failure-free probability of a DMR circuit with latchup protection can be estimated by covering all the cases of the circuit correct operation. The ideal case is when all the components have no faults:
P0 ¼ P2FF P2V P2DK
ð12Þ
In case that one memory element is faulty, the DMR circuit still operates correctly:
P1FF ¼ ð1 PFF ÞPFF P2V P2DK
ð13Þ
The following equations describe the cases when the voter or feed-back line (‘‘Data Keep’’) is faulty:
P1V ¼ ð1 P V ÞPV P2FF P2DK
ð14Þ
P1DK ¼ ð1 PDK ÞPDK P2V P2FF
ð15Þ
When both main voters are faulty, the feed-back circuits can take over this task:
P2V ¼ ð1 P V Þ2 P2DK P 2FF
ð16Þ
Finally, the failure-free probability of a DMR circuit with latchup protection is defined by the following equation:
PDMR ¼ P0 þ 2P1FF þ 2P1V þ 2P 1DK þ P2V ¼ P2FF P2V P2DK þ 2ð1 PFF ÞPFF P 2V P2DK þ 2ð1 PV ÞPV P 2FF P2DK þ 2ð1 PDK ÞPDK P2V P2FF þ ð1 PV Þ2 P2DK P2FF ¼ P2FF P2DK þ 2P2FF P2V PDK þ 2PFF P2V P2DK 4P2FF P2V P2DK
Fig. 11. Failure-free probabilities of the TMR and DMR circuits with latchup protection.
circuit with latchup protection is lower than the one of the standard TMR circuit. The main reason for the failure-free probability decrease is the additional ‘‘Data Keep’’ logic. On the other hand, the failure-free probability of DMR circuit with latchup protection is higher in comparison with the one of the standard DMR circuit. In the DMR circuit with latchup protection, the ‘‘Data Keep’’ logic is implemented in the voter feed-back line. It immediately corrects the transient faults. Additionally, one voter provides the input for the other. This is not the case in the TMR circuit with latchup protection where the serial voting is implemented. If a transient fault occurs in the TMR extended circuit, it will not be corrected. The previous analysis explains why the DMR with latchup protection outperforms the TMR with latchup protection in terms of the failure-free probability.
ð17Þ 4. Test circuits design, implementation and verification
3.3. Comparative analysis of redundant circuits Table 5 presents the calculated SEU and SET failure-free probabilities for the DMR and TMR circuits with latchup protection using Eqs. (6) and (17). The DP parameter quantifies the difference between the failure-free probabilities of the TMR and DMR circuits. These probabilities are only calculated for the five realistic and one ideal values of PFF and the two realistic values of PV. Fig. 11 compares the failure-free probabilities of the TMR and DMR circuits as functions of the fault-free probability of a flip-flop. The voter fault-free probability is almost ideal (PV = 0.99). The standard TMR provides always the higher circuit reliability than the standard DMR. However, the failure-free probability of the TMR
Table 5 Calculated failure-free probabilities for redundant circuits. PFF
PV
PTMR
PDMR
DP
0.95
0.99 0.95
0.9591 0.9044
0.9789 0.9472
0.0198 0.0428
0.96
0.99 0.95
0.9715 0.9215
0.9852 0.9570
0.0137 0.0355
0.97
0.99 0.95
0.9817 0.9370
0.9904 0.9661
0.0087 0.0291
0.98
0.99 0.95
0.9897 0.9508
0.9945 0.9743
0.0048 0.0235
0.99
0.99 0.95
0.9954 0.9629
0.9976 0.9817
0.0022 0.0188
0.99 0.95
0.9988 0.9733
0.9995 0.9882
0.0007 0.0149
1
Fault-tolerant ASICs can be implemented using standard design automation tools [33–35] and introducing a few additional steps in the design flow. An extra step compared to the standard design flow is necessary to generate a new net-list including redundant cells and voters. The other two extra steps (definition of the power domains and placement of the SPS standard cells) have to be made in the layout phase [27]. Table 6 presents features of the primary, TMR, and DMR circuits implemented in standard IHP 0.25 lm process [30]. The area and power estimations include the corresponding area and power consumption of a power network controller. These basic circuits are verified by a fault injection and simulation technique described in [36]. The DMR circuit from Fig. 9 has been implemented and tested in radiation environment. The measured signals of DMR circuit with an integrated SPS are depicted in Fig. 12. Duplication (for DMR) or triplication (for TMR) of the original net-list is performed using a parsing script which can be included in the synthesis script as an optional subprogram. The parser is generated by a set of the GnuWin tools [37]. It takes care of the following:
Table 6 Implementation features of the primary, TMR, and DMR circuits. Implemented circuits
Circuit area (lm2)
Power at 100 MHz (lW)
Flip-flop TMR DMR
155.23 1563.23 1303.94
6.881 155.719 128.213
1622
V. Petrovic´ et al. / Microelectronics Reliability 54 (2014) 1613–1626
Fig. 12. Measured signals of the DMR circuit during latchup.
Fig. 13. Illustration of the cell placement and power routing.
a. The names of redundant instances in the design net-list must be unique. Identical (redundant) instances from different power domains are marked after the power domain which they belong to. b. The inputs of a TMR voter have to be connected to the outputs of all three redundant flip-flops. c. Two inputs of a DMR voter have to be connected to the outputs of both redundant flip-flops and the third input has to be connected to the voter’s output (directly or by multiplexer). Memories must include the protection bits (parity bits, Hamming code, etc.) and be attached to the EDAC logic in order to tolerate SEUs. Duplication of memories and EDACs is necessary
for the SEL protection [32]. A simple modification of the technology file suffices in case of redefinition of the power domains. After the power planning has been finished, the placement of SEL protection switches (SPS) can be performed. The SPS cells are automatically placed under the crossover points of the power stripes and cell rows (where the filler cells are usually placed). A SPS cell (or a group of SPS cells) protects only one power domain and corresponding redundant logic. SPS cells may not be placed under all power crossover points. The maximal current needed for driving the connected logic defines the required (minimal) SPS output current and the total number of SPS cells in a design. Fig. 13 presents the illustration of a DMR circuit with placed SPS cells. A SPS cell controls the power supply of all the cells in a redundant row section (for example, the SPS_1 controls the power
V. Petrovic´ et al. / Microelectronics Reliability 54 (2014) 1613–1626
1623
Fig. 14. Power network controller.
Fig. 15. Schematic (left) and layout (right) of the TMR 4-bit shift-register.
domain of FF1_a, SP1_a, FF2_a, SP2_a, FF3_a, and SP3_a). Similar SPS placement can be done in case of the generation of a TMR layout. After the SPS cells have been placed and connected to the corresponding power domains, the design flow continues with the standard steps: cell placement, clock tree synthesis, and routing with all the required timing optimizations. The flow ends by the design rule and layout-versus-schematic checks. As it is already mentioned in Section 2, the power network controller (PNC) switches the circuit between different operational modes (active, latchup protected, or power off) and defines their duration. It consists of three components: a programmable counter, a status/control unit, and a control register. A block diagram of the power network controller is shown in Fig. 14. The programmable counter sets the period of latchup protection mode depending on the chosen technology. The status/control unit
is composed of control interfaces which communicate with the SPS cells via an input signal (TSTART) and two output signals (TSTOP and Poff). If the latchup has been detected, the TSTART signal activates the corresponding control interface which has to save the current counter value. When the counter reaches again the saved value (the protection phase is over), the control interface activates the TSTOP signal. Consequently, the SPS cell switches again on the power supply of the corresponding row section. In this way, it is possible to detect multiple latchup effects. The control register generates the Poff signal and can be accessed by the system processor. The power network controller has an independent clock. Some more complex circuits have been designed, implemented, and verified by SPICE simulations. In order to prove the proposed concept and protection techniques, two 4-bit shift-register circuits equipped with the latchup protection have been designed,
1624
V. Petrovic´ et al. / Microelectronics Reliability 54 (2014) 1613–1626
Fig. 16. Schematic (left) and layout (right) of the DMR 4-bit shift-register.
Fig. 17. Timing simulations of the TMR shift-register signals.
implemented, and tested. Circuit schematic and layout of the TMR shift-register are shown in Fig. 15. Circuit schematic and layout of the DMR shift-register are shown in Fig. 16. The circuits have been verified on a testbench that forces the checkerboard test pattern at the shift-register inputs. Figs. 17 and 18 present SPICE timing simulations of the shift-register signals when one or all redundant sections suffer latchup. The signal names are defined after the flip-flop names from Figs. 15 and 16. The SPS 1 and SPS 3 cells control the a section. This section is switched off after the latchup has been triggered. The b section (controlled by the SPS 2 and SPS 4 cells) and the c section (controlled by the SPS 5 and SPS 6 cells) keep working properly. Of course, the DMR shift-register has just two sections (a and b) and four protection switches (SPS 1 to SPS 4). After all the sections have been switched off and, after some time, switched
on again, the circuit logic states are reset and the shifting process continues. The other two circuits (TMR and DMR 8-bit shift-registers) have been implemented and simulated for sake of comparison. Table 7 presents the circuit area and power consumption of the nonredundant, TMR, and DMR designs. The proposed fault-tolerant DMR circuits have been used for implementation of a middleware (MW) switch processor described in [38]. The fault-tolerant version of this processor includes a DMR net-list, EDAC logic, and latchup protection switches (SPS). Fig. 19 shows a chip layout of the DMR-based MW switch processor including the power network controller. The PNC has been implemented using the IHP’s radiation hardened technology [30]. In order to easy routing, the PNC is placed around the design core. Features of the implemented processor are presented in Table 8.
V. Petrovic´ et al. / Microelectronics Reliability 54 (2014) 1613–1626
1625
Fig. 18. Timing simulations of the DMR shift-register signals.
Table 7 Implementation details of 8-bit shift-registers.
Table 8 Implementation details of the standard [38] and DMR MW switch processor.
Design
Circuit area (lm2)
Power at 100 MHz (lW)
Non-redundant TMR DMR
1636.9 18832.5 12065.7
97.2 344.5 257.7
MW switch processor
Core area (mm2)
Core area with PNC (mm2)
Chip area (mm2)
Power at 80 MHz (mW)
Maximal frequency (MHz)
Standard DMR
28.87 86.49
– 110.25
44.62 144.25
347 1063
100 80
Fig. 19. Chip layout of the DMR MW switch processor.
1626
V. Petrovic´ et al. / Microelectronics Reliability 54 (2014) 1613–1626
The SPS cells do not involve any area overhead as they are placed under power stripes instead of the filler cells. The DMR MW switch processor chip integrates 21335 SPS cells in order to provide safe driving (current supply) of the connected logic. 5. Conclusion The standard design flow has been modified to provide the protection of digital ASICs against the known single event effects (SEU, SET, and SEL) occurring in radiation environment. The modifications of the standard flow (generation of a redundant net-list, definition of the power domains, and placement of the SPS cells) are minor and fully automated. The SPS cells have been characterized and verified by measurements in radiation environment. The results, based on the redundant circuit measurements and the shift-register simulations, have shown that the SEL protection switches integrated with TMR or DMR circuits seem to be a promising solution when it comes to the design of fault-tolerant circuits and systems. The fault-tolerant DMR middleware switch processor has been designed and implemented to prove the proposed concept and protection techniques. The fault injection and simulation results have shown no failure of the processor in all test cases. However, one needs to keep in mind a high price paid for achieving the highest level of fault tolerance: chip area overhead of three to eight times, power increase of two to four times, and maximal frequency reduction of 20–40% depending on the CMOS technology node. There are two reasons for the area and power overhead: one is an additional control logic used to recover the logic state of a redundant circuit after the latchup protection phase is over (the ‘‘Data Keep’’ which consists of a latch, voter, and multiplexer) and the other is a power network controller (PNC) which controls the redundant power domains and the SPS cells. The future work will be focused on improvements of the TMR and DMR circuits and testing of the DMR MW switch processor in a radiation environment. References [1] Dressnandt N, Newcomer M, Rohne O, Passmore S. Radiation hardness: design approach and measurements of the ASDBLR ASIC for the ATLAS TRT. In: IEEE nuclear science symposium conference record, Norfolk, USA; 2002. p. 151–5. [2] Tarrillo J, Chipana R, Chielle E, Kastensmidt FL. Designing and analyzing a SpaceWire router IP for soft errors detection. In: Proc 12th Latin American Test Workshop, Porto de Galinhas, Brazil; 2011. p. 1–6. [3] Jung Y-K. Fault-recovery non-FPGA-based adaptable computing system design. In: Proc second NASA/ESA conference on adaptive hardware and systems, Edinburgh, UK; 2007. p. 709–16. [4] Holmes-Siedle A, Adams L. Handbook of radiation effects. 2nd ed. New York: Oxford University Press; 2002. [5] Nicolaidis M, editor. Soft errors in modern electronic systems. New York: Springer; 2011. [6] Iniewski K, editor. Radiation effects in semiconductors. Boca Raton: CRC Press; 2011. [7] Javanainen A. Particle radiation in microelectronics. PhD thesis, Department of Physics, University of Jyväskylä, Finland, 2012. [8] Troutman RR. Latchup in CMOS technology: the problem and its cure. Boston: Kluwer Academic Publishers; 1986. [9] Voldman SH. Latchup. Chichester: John Wiley & Sons; 2007.
[10] Weste NHE, Harris DM. CMOS VLSI design: a circuits and systems perspective. 4th ed. Boston: Addison-Wesley; 2011. [11] Ye L, Xiaohan G, Weiwei X, Zhiliang H, Killat D. An experimental extracted model for latchup analysis in CMOS process. In: Proc 8th IEEE international conference on ASIC, Hunan, China; 2009. p. 1035–8. [12] Mei-Chen H, Tsai TK, Iyer RK. Fault injection techniques and tools. IEEE Comput 1997;30:75–82. [13] Benso A, Prinetto P. Fault injection techniques and tools for embedded systems reliability evaluation. Dordrecht: Kluwer Academic Publishers; 2003. [14] Sheng W, Xiao L, Mao Z. An automated fault injection technique based on VHDL syntax analysis and stratified sampling. In: Proc 4th IEEE international symposium on electronic design, test, and applications, Hong Kong, 2008. p. 587–91. [15] Sharma A, Singh B. Simulation of fault injection of microprocessor system using VLSI architecture system. In: Proc IEEE TENCON 2009, Singapore; 2009. p. 1–5. [16] Lacoe RC. Improving integrated circuit performance through the application of hardness-by-design methodology. IEEE Trans Nucl Sci 2008;55:1903–25. [17] Baze MP, Buchner SP, McMorrow D. A digital CMOS design technique for SEU hardening. IEEE Trans Nucl Sci 2000;47:2603–8. [18] Lyons RE, Vanderkulk W. The use of triple-modular redundancy to improve computer reliability. IBM J Res Dev 1962;6:200–9. [19] Teifel J. Self-voting dual-modular-redundancy circuits for single-eventtransient mitigation. IEEE Transa Nucl Sci 2008;55:3435–9. [20] Lala PK. A single error correcting and double error detecting coding scheme for computer memory systems. In: Proc 18th IEEE international symposium on defect and fault tolerance in VLSI systems, Cambridge, USA; 2003. p. 235–41. [21] Maki GK, Yeh P-S. Radiation tolerant ultra low power CMOS microelectronics: technology development status. In: Proc NASA earth science technology conference, Hyattsville, USA; 2003. p. 1–4. [22] Rollins N, Wirthlin M, Caffrey M, Graham P. Evaluating TMR techniques in the presence of single event upsets. In: Proc international conference on military and aerospace programmable logic devices, Washington, DC, USA; 2003. p. 1– 5. [23] Shinohara K, Watanabe M. A double or triple module redundancy model exploiting dynamic reconfigurations. In: Proc NASA/ESA conference on adaptive hardware and systems, Noordwijk, Netherlands; 2008. p. 114–21. [24] Estreich DB, Ochoa A, Dutton RW. An analysis of latch-up prevention in CMOS IC’s using an epitaxial-buried layer process. In: Proc IEEE international electron devices meeting, Washington, DC, USA; 1978. p. 230–4. [25] Troutman RR. Epitaxial layer enhancement of N-well guard rings for CMOS circuits. IEEE Electron Dev Lett 1983;4:438–40. [26] Nicolaidis M. A low-cost single-event latchup mitigation scheme. In: Proc 12th IEEE international on-line testing symposium, Lake Como, Italy; 2006. p. 111– 8. [27] Petrovic V, Ilic M, Schoof G, Stamenkovic Z. Design methodology for fault tolerant ASICs. In: Proc 15th IEEE international symposium on design and diagnostics of electronic circuits and systems, Tallinn, Estonia; 2012. p. 8–11. [28] Petrovic V, Ilic M, Schoof G, Stamenkovic Z. Integrated single event latchup protection for asics used in space applications. In: Proc 21th telecommunications forum TELFOR, Belgrade, Serbia; 2013. p. 624–7. [29] Petrovic V, Schoof G, Stamenkovic Z. Redundant circuits with latchup protection. In: Proc 20th IEEE international conference on electronics, circuits, and systems, Abu Dhabi, UAE; 2013. p. 117–20. [30] Leibniz Institute for High Performance microelectronics – IHP, Frankfurt (Oder), Germany. . [31] Stamenkovic Z, Petrovic V, Schoof G. Design flow and techniques for fault tolerant ASICs (invited). In: Proc 20th IEEE international symposium on the physical and failure analysis of integrated circuits, Suzhou, China; 2013. p. 97– 102. [32] Schoof G, Methfessel M, Kraemer R. Fault-tolerant ASIC design for high system dependability. In: Proc 13th international forum on advanced microsystems for automotive applications, Berlin, Germany; 2009. p. 369–82. [33] http://www.synopsys.com/tools. [34] http://www.mentor.com/products. [35] http://www.cadence.com/products. [36] Petrovic V, Ilic M, Schoof G, Stamenkovic Z. SEU and SET fault injection models for fault tolerant circuits. In: Proc 13th biennial baltic electronics conference, Tallinn, Estonia; 2012. p. 73–6). [37] GnuWin packages. . [38] Petrovic V, Ilic M, Schoof G, Montenegro S. Implementation of middleware switch ASIC processor. Telfor J 2012;4:83–8.