1368
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 12, DECEMBER 2006
Sequential Element Design With Built-In Soft Error Resilience Ming Zhang, Member, IEEE, Subhasish Mitra, Senior Member, IEEE, T. M. Mak, Senior Member, IEEE, Norbert Seifert, Senior Member, IEEE, Nicholas J. Wang, Quan Shi, Kee Sup Kim, Member, IEEE, Naresh R. Shanbhag, Fellow, IEEE, and Sanjay J. Patel, Member, IEEE
Abstract—This paper presents a built-in soft error resilience (BISER) technique for correcting radiation-induced soft errors in latches and flip-flops. The presented error-correcting latch and flip-flop designs are power efficient, introduce minimal speed penalty, and employ reuse of on-chip scan design-for-testability and design-for-debug resources to minimize area overheads. Circuit simulations using a sub-90-nm technology show that the presented designs achieve more than a 20-fold reduction in cell-level soft error rate (SER). Fault injection experiments conducted on a microprocessor model further demonstrate that chip-level SER improvement is tunable by selective placement of the presented error-correcting designs. When coupled with error correction code to protect in-pipeline memories, the BISER flip-flop design improves chip-level SER by 10 times over an unprotected pipeline with the flip-flops contributing an extra 7–10.5% in power. When only soft errors in flips-flops are considered, the BISER technique improves chip-level SER by 10 times with an increased power of 10.3%. The error correction mechanism is configurable (i.e., can be turned on or off) which enables the use of the presented techniques for designs that can target multiple applications with a wide range of reliability requirements. Index Terms—Circuit simulation, error correction, fault injection, sequential element design, soft error rate (SER).
I. INTRODUCTION
S
OFT errors are radiation-induced transient errors caused by neutrons from cosmic rays and alpha particles from packaging materials [1]. Soft error protection is very important for enterprise computing and communication applications since the system-level soft error rate (SER) has been rising with technology scaling and increasing system complexity. Several designs today implement extensive error detection and correction (ECC) mainly for on-chip SRAMs and register-files. However, memory protection is not enough for designs manufactured in advanced technologies because soft errors in flip-flops, latches, and combinational logic, also referred to as logic soft errors, are significant contributors to the system-level SER [2], [3]. Logic soft errors pose a major challenge to robust enterprise computing and networking platform designs. For many designs tar-
Manuscript received August 30, 2005. This work was supported in part by MARCO Gigascale Systems Research Center (GSRC). M. Zhang, T. M. Mak, N. Seifert, Q. Shi, and K. S. Kim are with Intel Corporation, Folsom, CA 95630 USA (e-mail:
[email protected]). S. Mitra is with the Department of Electrical Engineering, Stanford University, Stanford, CA 94305 USA. N. J. Wang, N. R. Shanbhag, and S. J. Patel are with the Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, Urbana, IL 61801 USA. Digital Object Identifier 10.1109/TVLSI.2006.887832
Fig. 1. Circuit schematic of a representative single-port D-latch.
geting such platforms, not only the memory elements but also the latches and flip-flops, must be protected from soft errors in order to satisfy the system data integrity requirements. This work presents a new family of sequential element designs with built-in soft error resilience (BISER) that demonstrates four major contributions: 1) error correction technique that achieves a more than 20-fold reduction in the cell-level SER of latches and flip-flops; 2) reuse paradigm that helps lower the power and area penalties of the error correction technique; 3) set of power saving techniques including an economy operation mode; 4) set of fault injection results to illustrate the chip-level effectiveness and power penalty of the BISER technique. The rest of this paper is organized as follows. In Section II, we describe the principle of the new error-correcting designs. Section III describes the reuse paradigm that further reduces the overhead of the presented technique. Section IV presents the key circuit design considerations and the cell-level characterization results. Section V presents the chip-level SER, performance, power, and area impact of the BISER technique based on fault injection results. Other BISER design variations are briefly described in Section VI. Finally, related work and conclusions are described in Sections VII and VIII, respectively. II. ERROR CORRECTION IN SEQUENTIAL ELEMENTS A master–slave flip-flop, as implemented in several cyclebased microprocessor designs, is composed of a master latch followed by a slave latch. A representative single-port latch is shown in Fig. 1. The inverter IN1 is used to generate complementary clock signals locally; the transmission gate TG1 allows
1063-8210/$20.00 © 2006 IEEE
ZHANG et al.: SEQUENTIAL ELEMENT DESIGN WITH BISER
1369
Fig. 2. Block diagram of an error-correcting flip-flop design.
TABLE I TRUTH TABLE OF THE C-ELEMENT
data at the D pin to flow through when the clock signal CLK is high; the inverter IN3 and tristate inverter TI1 form a regenerative loop to store the sampled data when CLK is low. The key to BISER is a new flip-flop design, composed of two flip-flops joined with a C-element as shown in Fig. 2. To illustrate how error correction is achieved by the C-element, consider the scenario where a particle strikes one of the four latches when CLK is low. Note that only one latch is affected by a particle strike under the assumption of a single event upset (SEU). When CLK is high, latches LB and PH1 are transparent, and the same data is stored in these two latches (assuming that no soft error has affected PH2 or LA). As shown in Table I, the C-element acts as an inverter when the outputs of LB and PH1 match, and the flip-flop output Q has the correct value. When CLK turns low, latches LB and PH1 hold the stored logic value inside their feedback loops. The contents of latches LB and PH1 are now prone to soft errors, while LA and PH2 are not error-prone because they are transparent and driven by the preceding logic stages. Suppose that a particle strike flips the logic value stored in PH1. The two inputs of the C-element will be different but the error will not propagate to the C-element output. Using the same principle, error correction is also enabled during the scenario where a particle strike occurs when CLK is high. The purpose of the keeper circuit in Fig. 2 is to fight the leakage current in the C-element when both the pull-up and the pull-down paths in the C-element are shut off, which occurs only when the content of one bistable gets flipped by a particle strike. Depending on the process technology and the clock frequency, the keeper structure may not be required. A particle strike in the
keeper circuit will not cause an error at Q, because both O1 and O2 hold the correct logic values under the SEU assumption, and hence, the Q node is strongly driven by the C-element. Simulations have shown that the error-correcting design can achieve an SER reduction of more than 20-fold when compared to an unprotected flip-flop. More details are provided in Section IV. III. REUSE PARADIGM Scan DFT has become a de facto test standard in the industry because it enables an automated solution to high quality production testing at low cost. In addition, scan is extremely valuable for postsilicon debug activities [4] because it provides access to the internal nodes of an integrated circuit. Fig. 3(a) shows the block diagram of a scan flip-flop design of a microprocessor, comprising system and scan portions. Each portion is a master-slave flip-flop composed of two latches. Note that the two-port latches, such as the PH1 block in the system portion or the LA block in the scan portion, can sample from one of the two data lines depending on which clock signal is active. The scan flip-flop in Fig. 3(a) has two operation modes: test and normal. The scan clocks for the test mode are illustrated in Fig. 4. Clocks SCA and SCB are applied along the scan chain [Fig. 3(b)] alternately to shift a test pattern into latches LA and LB. Next, the UPDATE clock is applied to move the contents of LB to PH1. Thus, a test pattern is written into the system portion. Next, functional clock CLK is applied which captures the system’s response to the test pattern. Finally, the CAPTURE clock is applied to move the contents of PH1 to LA. The system response can then be scanned out by alternately applying clocks SCA and SCB. During normal system operation, the scan portion is shut off by assigning zero values to the scan clocks (SCA, SCB, UPDATE, and CAPTURE), while the system portion is being clocked at full speed. The main reasons for using this style of scan DFT instead of the classical scan flip-flop with a multiplexer are simplified postsilicon debug and at-speed functional testing. The scan portion of the flip-flop design in Fig. 3
1370
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 12, DECEMBER 2006
Fig. 3. (a) Block diagram of a scan flip-flop design. (b) Scan chain.
Fig. 4. Clock waveforms for a scan flip-flop in test mode.
scan flip-flop. However, at the chip level, the TEST signal is directly generated from the available test circuitry. Moreover, the TEST signal remains static (either high or low) in any operation mode. Minimal buffering and routing is required since there are no strict timing constraints to meet. As a result, the EC design does not increase the number of timing-critical signals in the system, and hence, does not require major architectural changes. B. Economy Mode
is sometimes designed to operate at-speed when scan is used for at-speed functional test [5]. A key observation is that scan resources [e.g., latches LA and LB in Fig. 3(a)] are unused during normal operation, but still occupy chip area and consume leakage power. The concept of reusing scan resources for error correction is illustrated in more detail by the error-correcting scan flip-flop design (denoted EC design) in Fig. 5. The EC design is modified based on the scan flip-flop design in Fig. 3 by adding an OR gate to the clock path of LB and an AND gate to the clock path of LA, as well as rerouting the 2-D data port of the latch LA. The EC design has three distinct operation modes: normal, test, and economy. A. Normal and Test Modes When the EC design operates in normal mode, the scan clocks SCA, SCB, UPDATE, and TEST are forced low, while the CAPTURE signal is high. This equivalently converts the scan portion into a master–slave flip-flop that operates in parallel with the system flip-flop. Error correction is then achieved in the same way as described in Section II. The test mode operation of the EC design is activated by forcing the TEST signal high. This ensures that the output of scan portion O2 becomes a “don’t care” to the output of the EC design Q so that the shifting of a test sequence along the scan chain does not interfere with the operation of the EC design. The clock waveforms for the EC design in test mode remain the same, as shown previously in Fig. 4. The EC design requires an extra control signal TEST at the cell level compared to the
The economy mode is motivated by the fact that in a modern chip design environment, a single design often targets several application segments at the same time with apparently conflicting requirements. For example, a mobile application (e.g., a laptop) requires very low power consumption but may not have a stringent requirement for low SER. On the other hand, an enterprise application (e.g., data centers) may be less power constrained, but may have to satisfy very stringent data integrity requirements against soft errors. This provides a motivation to introduce an economy mode into the EC design. The system can adaptively switch between the normal and economy modes depending on the criticality of the application. If the application requires high reliability, the system works under normal mode. Otherwise, the system switches into an economy mode, which significantly reduces the power consumption by turning off part of the EC design circuitry. One way to invoke an economy mode for the EC design is to disable their scan portions by assigning proper values to the scan clocks, as illustrated in Fig. 6. The signal CAPTURE is forced low so that the second clock port C2 of latch LA is always low even though the clock signal CLK is still toggling during economy mode. The signal SCB is forced high so that the clock port C1 of latch LB is always high. The combined assignment of CAPTURE and SCB signal values ensures the scan portion, equivalent to a shadow master–slave flip-flop, has an opaque master latch and transparent slave latch. As a result, the scan portion does not consume any dynamic power caused by internal data or clock activities. The signal TEST is also forced high so
ZHANG et al.: SEQUENTIAL ELEMENT DESIGN WITH BISER
1371
Fig. 5. Error-correcting scan flip-flop design.
Fig. 6. Error-correcting scan flip-flop design in economy mode.
that the C-element is disabled and the value of Q depends solely on the operation of the system portion. This added economy mode roughly halves the dynamic power consumption of the EC design when it runs noncritical applications. IV. CIRCUIT DESIGN CONSIDERATIONS AND RESULTS The scan portion in a scan flip-flop [Fig. 3(a)] typically operates at a lower speed than the system clock during test mode [4]. As a result of this relaxed timing requirement, slow transistors (those with smaller channel widths, longer channel lengths, or higher threshold voltages) are sometimes used in the scan portion to lower leakage power during normal mode without affecting its functionality in test mode. The scan portion in an EC
design, on the other hand, needs to meet the same timing constraints (setup time, CLK-to-Q delay, etc.) as the system flip-flop in order to guarantee the correct operation. This means that transistors at critical locations within the scan portion of the EC design need to be replaced with the same type as those used in the system portion. Due to the presence of both the nMOS and pMOS stacks in the C-element, it is very difficult to match the delay of a C-element with that of an inverter by only sizing up the transistors. Instead, low- transistors are used in the transistor stacks. Furthermore, the keeper circuit needs to be designed with great care. More specifically, the forward inverter is kept at minimal size while the feedback inverter is intentionally weakened by using a larger channel length value. This minimizes the extra
1372
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 12, DECEMBER 2006
TABLE II CELL-LEVEL POWER, AREA, AND SER COMPARISONS
when noncritical applications are running. Note that an SER improvement of more than 20-fold is theoretically possible for the EC design but is beyond the resolution of the simulator we used.
Fig. 7. Simulation test bench for timing and power measurements.
delay caused by contention in the keeper loop when the C-element writes a new value into the keeper circuit. The impact of these circuit modifications are accurately modeled by our simulation methodology, which is described as follows. Comparisons are conducted between the proposed EC design, a reference scan flip-flop (denoted MSFF), and a dual interlocked storage cell [6] flip-flop with scan circuitry (denoted DICE). The DICE design has been chosen for comparison because it is one of the best known classical circuit hardening techniques. Since the circuit styles and transistor sizes may be significantly different across all designs being compared, we adopt a unified approach previously suggested in [7] to analyze the timing and power of the flip-flops under investigation. Fig. 7 illustrates the test bench we used. We assume the external capacitive loads and are both fan-out of 4. We also insert buffers between CLKI (fed by an ideal voltage source) and the CLK pin of the flip-flop, as well as between DI and D. These buffering inverters serve two purposes: 1) provide realistic D and CLK signals to the flip-flop and 2) capture power dissipation differences due to different CLK and D loadings in different flip-flop designs. While optimizing the various flip-flop designs, the objective is to match the timing parameter, D-to-Q delay, with that of the reference design MSFF. We make the following assumptions during the power measurement of all flip-flops: 1) data activity factor (average number of output transitions per clock cycle) is 0.25; 2) low-to-high and high-to-low data transitions are equally likely. The cell layout areas are estimated by an internal tool at Intel, with a worst case error of 5% compared to real layouts. The SERs are obtained from an internal simulator [8]. Table II shows the simulated cell-level power, area, and SER of all the designs. All the measurements are normalized with respect to those of MSFF. The D-to-Q delay parameters are already equalized for all designs and hence not shown. The EC design achieves a 20-fold SER improvement over the reference design (MSFF) with the power and area overhead of 1.43 and 0.17, respectively. This also indicates a 7% power and 26% area savings as compared to the DICE design. As mentioned earlier, these savings are a direct result of reusing existing on-chip resources. Another distinct advantage of the EC design over DICE is the economy mode, which can further lower the overheads
V. CHIP-LEVEL RESULTS In this section, we explore the impact of BISER on a microprocessor by conducting a case study. We focus on the impact that BISER has on SER and power consumption, identifying the tradeoff between these two metrics. The chip-level area penalty caused by BISER is also estimated. To conduct this investigation, we adopt the methodology used in [9], which is summarized below. We use a highly detailed register transfer level model of a deeply pipelined, out-of-order microprocessor similar in complexity to the Alpha 21264. The fault model is a single bit flip of a state element, either a flip-flop or a RAM cell. In order to understand how various logic blocks in the pipeline contribute to the failure rate of the microarchitecture, each flip-flop or RAM cell in the processor is categorized based on the general function provided by that bit of state. The cumulative error coverage as a function of protected states can then be obtained by means of fault injection experiments. Two fault injection campaigns are then performed to characterize the processor under two scenarios: one where all memories are protected and soft errors affect flip-flops only and the other where soft errors affect both flip-flops and in-pipeline memories. A. Scenario I: Soft Errors Affect Flip-Flops Only To emulate this scenario, faults are only injected into flip-flop states within the processor pipeline because memories are assumed to be protected by ECC, and hence, not affected by soft errors. The error coverage as a function of cumulative flip-flop state coverage is shown in Fig. 8. The chip-level power penalty of selectively applying BISER techniques to flip-flops is also estimated based on the cell-level power, total chip power, and the percentage of protected flip-flops. The power and area penalties are listed together with chip-level SER improvement in Table III. The chip-level SER can be improved by ten times with an increased power of 10.3%. B. Scenario II: Soft Errors Affect Both Flip-Flops and Memories To emulate this scenario, faults are injected into both flip-flop and memory states within the processor pipeline. The error coverage plot is shown in Fig. 9. The raw SERs of flip-flops (denoted ) are different from those of RAM cells (de). Since fault injection was conducted into both noted
ZHANG et al.: SEQUENTIAL ELEMENT DESIGN WITH BISER
1373
TABLE IV CHIP-LEVEL POWER, AREA, AND PERFORMANCE PENALTIES OF BISER AS A FUNCTION OF CHIP-LEVEL SER WHEN SOFT ERRORS AFFECT BOTH FLIP-FLOPS AND IN-PIPELINE MEMORIES
Fig. 8. Error coverage versus state coverage when all memories are protected and soft errors affect flip-flops only.
TABLE III CHIP-LEVEL POWER AND AREA PENALTIES OF BISER AS A FUNCTION OF CHIP-LEVEL SER WHEN ALL MEMORIES ARE PROTECTED AND SOFT ERRORS AFFECT FLIP-FLOPS ONLY
rise abruptly in the beginning because most of the error coverage is achieved by protecting the RAMs in certain logic blocks. The curve corresponding to higher flip-flop SER is lower because flip-flops contribute to more errors in that case. A prediction has been made [10] that flip-flop SER will become larger relative to SRAM SER as process technology advances, meaning it will be more important to protect flip-flops. The chip-level power penalty results are shown in Table IV. A ten-fold improvement in chip-level SER can be achieved by paying 7.0%–10.5% power penalty, depending on the flip-flop SER relative to SRAM SER. The higher the flip-flop SER, the more power penalty is needed to achieve the same chip-level SER improvement. VI. OTHER DESIGN VARIATIONS The same principles of BISER have been employed in various sequential element designs. Examples include error-correcting scanout and mux-scan flip-flops [3], and sequential element designs that correct combinational logic soft errors [11]. We present two additional design variations in this section. A. Low-Power EC Design
Fig. 9. Error coverage versus state coverage when errors affect both flip-flops and in-pipeline memories.
flip-flop and RAM states, the error coverage curve depends on the relative SER of flip-flops to RAMs. We consider various scenarios based on the comparative SER study in [10]. The first curve (circle) corresponds to the case where the SER of the flip-flop is the same as that of the RAM cell; the second curve indicates the SER of flip-flop is one-tenth of that of the RAM cell; and so on. All three curves exhibit similar behavior. They start to
The EC design described in Section V inherits the main features of scan clocks in a scan flip-flop, i.e., all scan clocks (SCA, SCB, CAPTURE, and UPDATE) are globally routed and are not timing critical. A low-power error-correcting design named EC-LP is also possible as shown in Fig. 10. The CAPTURE signal in this design is integrated into the clock generation circuit, instead of being globally routed. This configuration reduces clock loading inside the cell and eliminates one pin, which results in lower power consumption. The main difference between EC-LP and EC designs stems from the clock waveforms in test mode, as illustrated in Fig. 11. Note that the EC-LP design does not have an economy mode. B. Error-Correcting Scan Pulse Latch Latches with pulse clocking, also known as pulse latches, have been used in several microprocessors. For example, on the
1374
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 12, DECEMBER 2006
Fig. 10. Low-power error-correcting design.
Fig. 11. Clock waveforms for a low-power error-correcting design in test mode.
Itanium 2 processor, 95% of all static latches are pulse latches [12]. A pulse latch combines the behavioral merits of both an edge triggered flip-flop and a level sensitive latch. On the one hand, it provides a relatively wider transparency window than that of a flip-flop, which allows cycle stealing and skew tolerance. On the other hand, it maintains edge triggering behavior and avoids the relatively long hold times of a level-sensitive latch. Moreover, a pulse latch has speed, power, and area advantages over its flip-flop counter-part: a pulse latch is faster due to the reduced number of logic gate levels between its input and output, consumes less power due to roughly halved clock loading and reduced data switching activities, and occupies smaller silicon area due to smaller transistor counts. Scan pulse latches have been implemented in a microprocessor for production testing and postsilicon debug purposes [8], [12], [13]. Fig. 12 shows a scan pulse latch with system and scan portions. During test operation, the system and scan portions form a master–slave flip-flop to shift the test sequence in and out. During normal operation, SHIFT is assigned a low value so that the system portion is isolated from the scan-in data at SI. Note that data is written from both directions into the latch cell during normal operation. This design eliminates the need for an interrupted feedback inverter, and hence, saves clock power. The design illustrated in Fig. 13 is an error-correcting scan pulse latch. During test mode operation, SCKA and SCKB signals are pulsed high alternately to shift in the test sequence while SHIFT is high. Once the test sequence reaches the
Fig. 12. Scan pulse latch design.
desired location, the system’s response is captured by one or more PCK pulses while SHIFT is low. To shift out the system response, SCKA and SCKB are again alternately pulsed high while SHIFT is kept high and PCK is kept low. During normal mode operation, SCKA, SCKB, and SHIFT signals are kept low while PCK is used to latch system data. An SEU in either the PL1 or PL2 block will be corrected by the C-element as explained earlier. VII. RELATED WORK Prior work for soft error protection in memory and sequential elements can be broadly classified into the following categories: process-level, device-level, circuit-level, and systemlevel techniques. We briefly review and compare some representative techniques in this section. At the process level, silicon-on-sapphire (SOS) and silicon-on-insulator (SOI) technologies have been developed as a soft error mitigation technique for space and military applications. These technologies are immune to radiation-induced
ZHANG et al.: SEQUENTIAL ELEMENT DESIGN WITH BISER
1375
Fig. 13. (a) Error-correcting scan pulse latch. (b) Circuit schematics of the PL1 or PL2 block in (a). (c) Circuit schematics of the C block in (a).
latch-up as the parasitic p-n-p-n structure does not exist anymore due to full electrical isolation of individual transistors. Moreover, the thin silicon film reduces the charge collection depth, which in turn, lowers the sensitivity to soft errors. Prior work has demonstrated the SER improvement of SRAM devices fabricated on SOI over those on bulk: 2 improvement for the 180-nm node and 5 for the 90-nm node [14]. However, the robustness of an SOI device with a floating body may be compromised by an ionizing particle that triggers the inherent and parasitic bipolar transistors, and hence, results in charge amplification. Employing external body contacts could cure this problem but significantly increases the area penalty. At the device level, rad-hard-by-design (RHBD) techniques have been used to lower a device’s sensitivity to radiation. For example, guardbanding around nMOS and/or pMOS transistors greatly reduces the susceptibility of CMOS circuits to radiationinduced latch-up [15]. However, applying such techniques to an existing standard cell library could be time consuming since the layouts of many devices need to be regenerated. At the circuit level, there are two main approaches to reduce the effects of soft errors: 1) increasing the critical charge of a circuit node and 2) adding transistors to enable redundant storage of information. Capacitive hardening [16], resistive hardening [17], and the use of high-drive transistors have been used to increase the critical charge. These techniques tend to increase the power consumption and lower the speed of the circuits. Redundant circuit techniques include the low power Whitaker cell [18], Barry/Dooley design [19], [20], and DICE
design [6]. These redundant transistor-based designs usually require at least twice as many transistors as unprotected circuits, which typically indicates very high area and power penalties. The presented BISER technique overcomes this limitation by reusing exiting on-chip DFT resources. At the system-level, hardware and time redundancy techniques have been proposed to combat soft errors. Classical hardware redundancy techniques include chip-level duplication (as used in HP-Tandem machines [21]), block-level duplication used in IBM Z-Series machines [21], triple modular redundancy [22], parity prediction [23]–[25], application-specific error detection techniques [26]–[28], DIVA [29] and several others. A major benefit of these techniques is that they do not assume any particular error mechanism, and hence, work for most error sources. However, except for the application-specific error detection techniques, the area and power overheads are generally very high. Major time redundancy techniques include error detection with shifted operands (RESO) [30], redundant execution using spare elements (REESE) [31], multithreading for transient error detection [32]–[36], and software implemented hardware fault tolerance (SIHFT) [37], [38]. A key drawback for these techniques is that the performance penalty can be very significant: around 40% for multithreading, and 40%–200% for SIHFT. Furthermore, the power overheads of these techniques are also significant due to redundant execution, although there is a lack of published power overheads. These techniques are mainly applicable for specific designs such as microprocessors.
1376
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 12, DECEMBER 2006
VIII. CONCLUSIONS The BISER paradigm presented in this paper is the key to the development of power, performance, and area efficient soft error protection techniques. BISER has the following unique advantages over other major soft error protection techniques: 1) minimal area and power overheads because resources already present for test and debug are reused for soft error resilience; 2) minimal routing overhead; 3) no requirement for major architectural changes; 4) applicability to any digital design (e.g., microprocessors, network processors, ASICs); 5) broad spectrum of design choices suitable for adaptive applications with a wide range of power and performance tradeoffs; 6) additional power saving techniques such as economy mode operation. ACKNOWLEDGMENT The authors would like to thank K. Ganesh, J. Maiz, P. Shipley, A. Vo, S. Walstra, and V. Zia, and from Intel Corporation for their discussions and assistance during the course of this research. REFERENCES [1] R. C. Baumann, “Soft errors in advanced semiconductor devices-Part I: The three radiation sources,” IEEE Trans. Device Mater. Reliab., vol. 1, no. 1, pp. 17–22, Mar. 2001. [2] M. Zhang and N. R. Shanbhag, “A soft error rate analysis (SERA) methodology,” in Proc. IEEE Int. Conf. Comput.-Aided Des., 2004, pp. 111–118. [3] S. Mitra, M. Zhang, T. M. Mak, N. Seifert, V. Zia, and K. S. Kim, “Logic soft errors: A major barrier to robust platform design,” in Proc. IEEE Int. Test Conf., 2005, pp. 687–696. [4] R. Kuppuswamy, P. DesRosier, D. Feltham, R. Sheikh, and P. Thadikaran, “Full hold-scan systems in microprocessors: Cost/benefit analysis,” Intel Technol. J., vol. 8, pp. 63–72, Feb. 2004. [5] A. Carbine and D. Feltham, “Pentium pro processor design for test and debug,” in Proc. IEEE Int. Test Conf., 1999, pp. 294–303. [6] T. Calin, M. Nicolaidis, and R. Velazco, “Upset hardened memory design for submicron CMOS technology,” IEEE Trans. Nucl. Sci., vol. 43, no. 6, pp. 2874–2878, Dec. 1996. [7] V. Stojanovic, V. G. Oklobdzija, and R. Bajwa, “A unified approach in the analysis of latches and flip-flops for low power systems,” in Proc. Int. Symp. Low Power Electron. Des., 1998, pp. 227–232. [8] N. Seifert, V. Ambrose, P. Shipley, M. Pant, and B. Gill, “Radiation induced clock jitter and race,” in Proc. Int. Phys. Reliab. Symp., 2005, pp. 215–222. [9] N. J. Wang, J. Quek, T. M. Rafacz, and S. J. Patel, “Characterizing the effects of transient faults on a high-performance processor pipeline,” in Proc. Int. Conf. Dependable Syst. Netw., 2004, pp. 61–70. [10] R. C. Baumann, “The impact of technology scaling on soft error rate performance and limits to the efficacy of error correction,” in Dig. IEEE Int. Electron Devices Meeting, 2002, pp. 329–332. [11] S. Mitra, M. Zhang, S. Waqas, N. Seifert, B. Gill, and K. S. Kim, “Combinational logic soft error correction,” in Proc. IEEE Int. Test Conf., 2006, Paper No. 29.2. [12] S. D. Naffziger, G. Colon-Bonet, T. Fischer, R. Riedlinger, T. J. Sullivan, and T. Grutkowski, “The implementation of the Itanium 2 microprocessor,” IEEE J. Solid-State Circuits, vol. 37, no. 11, pp. 1448–1460, Nov. 2002. [13] D. D. Josephson, S. Poehlman, V. Govan, and C. Mumford, “Test methodology for the McKinley processor,” in Proc. Int. Test Conf., 2001, pp. 578–585.
[14] P. Roche and G. Gasiot, “Impacts of front-end and middle-end process modifications on terrestrial soft error rate,” IEEE Trans. Device Mater. Reliab., vol. 5, no. 3, pp. 382–396, Sep. 2005. [15] J. V. Osborn, D. C. Mayer, R. C. Lacoe, S. C. Moss, and S. D. LaLumondiere, “Single event latchup characteristics of three commercial CMOS processes,” in Proc. 7th NASA Symp. VLSI Des., 1998, Paper No. 4.3.1. [16] STMicroelectronics Press Release, “New chip technology from STmicroelectronics eliminates soft error threat to electronic systems,” 2003 [Online]. Available: http://www.st.com/stonline/press/news/year2003/ t1394h.htm [17] L. R. Rockett Jr., “Simulated SEU hardened scaled CMOS SRAM cell design using gated resistors,” IEEE Trans. Nucl. Sci., vol. 39, no. 5, pp. 1532–1541, Oct. 1992. [18] M. N. Liu and S. Whitaker, “Low power SEU immune CMOS memory circuits,” IEEE Trans. Nucl. Sci., vol. 39, no. 6, pp. 1679–1684, Dec. 1992. [19] M. J. Barry, “Radiation resistant SRAM memory cell,” U.S. Patent 5 157 625, Oct. 20, 1992. [20] J. G. Dooley, “SER-immune latch for gate array, standard cell, and other ASIC applications,” U.S. Patent 5 311 070, May 10, 1994. [21] W. Bartlett and L. Spainhower, “Commercial fault tolerance: A tale of two systems,” IEEE Trans. Dependable Secure Comput., vol. 1, no. 1, pp. 87–96, Jan. 2004. [22] D. P. Siewiorek and R. S. Swarz, Reliable Computer Systems Design and Evaluation, 3rd ed. Natick, MA: A. K. Peters, 1998. [23] K. Mohanram, E. S. Sogomonyan, M. Gossel, and N. A. Touba, “Synthesis of low-cost parity-based partially self-checking circuits,” in Proc. IEEE On-Line Testing Symp., 2003, pp. 35–40. [24] N. A. Touba and E. J. McCluskey, “Logic synthesis of multilevel circuits with concurrent error detection,” IEEE Trans. Comput.-Aided Des., vol. 16, no. 7, pp. 783–789, Jul. 1997. [25] C. Zeng, N. R. Saxena, and E. J. McCluskey, “Finite state machine synthesis with concurrent error detection,” in Proc. Int. Test Conf., 1999, pp. 672–680. [26] K. H. Huang and J. A. Abraham, “Algorithm based fault tolerance for matrix operations,” IEEE Trans. Comput., vol. C-33, no. 6, pp. 518–528, Jun. 1984. [27] W. J. Huang, N. R. Saxena, and E. J. McCluskey, “A reliable LZ data compressor on reconfigurable coprocessors,” in Proc. IEEE Field Program. Custom Comput. Mach., 2000, pp. 249–258. [28] J. Y. Jou and J. A. Abraham, “Fault-tolerant FFT networks,” IEEE Trans. Comput., vol. 37, no. 5, pp. 548–561, May 1988. [29] T. M. Austin, “DIVA: A reliable substrate for deep submicron microarchitecture design,” in Proc. Int. Symp. Microarch., 1999, pp. 196–207. [30] J. H. Patel and L. Y. Fung, “Concurrent error detection in ALUs by recomputing with shifted operands,” IEEE Trans. Comput., vol. C-31, no. 7, pp. 589–595, Jul. 1982. [31] J. B. Nickel and A. K. Somani, “REESE: A method of soft error detection in microprocessors,” in Proc. Int. Conf. Dependable Syst. Netw., 2001, pp. 401–410. [32] E. Rotenberg, “AR-SMT: A microarchitectural approach to fault tolerance in microprocessors,” in Proc. Int. Symp. Fault-Tolerant Comput., 1999, pp. 84–91. [33] N. R. Saxena, S. Fernandez-Gomez, W. Huang, S. Mitra, S. Yu, and E. J. McCluskey, “Dependable computing and online testing in adaptive and configurable systems,” IEEE Des. Test. Comput., vol. 17, no. 1, pp. 29–41, Jan. 2000. [34] J. Ray, J. C. Hoe, and B. Falsafi, “Dual use of superscalar datapath for transient-fault detection and recovery,” in Proc. Int. Symp. Microarch., 2001, pp. 214–224. [35] S. S. Mukherjee, M. Kontz, and S. K. Reinhardt, “Detailed design and evaluation of redundant multi-threading alternatives,” in Proc. Int. Symp. Comput. Arch., 2002, pp. 99–110. [36] T. N. Vijaykumar, I. Pomeranz, and K. Cheng, “Transient-fault recovery using simultaneous multithreading,” in Proc. Int. Symp. Comput. Arch., 2002, pp. 87–98. [37] N. Oh, P. P. Shirvani, and E. McCluskey, “Error detection by duplicated instructions in super-scalar processors,” IEEE Trans. Reliab., vol. 51, no. 1, pp. 63–75, Mar. 2002. [38] N. Oh, S. Mitra, and E. J. McCluskey, “ED4I: Error detection by diverse data and duplicated instructions,” IEEE Trans. Comput., vol. 51, no. 2, pp. 180–199, Feb. 2002.
ZHANG et al.: SEQUENTIAL ELEMENT DESIGN WITH BISER
Ming Zhang (S’06–M’07) received the B.S. degree in physics from Peking University, Beijing, China, in 1999 and the M.S. and Ph.D. degrees in electrical engineering from the University of Illinois at UrbanaChampaign (UIUC), Urbana, in 2001 and 2006, respectively. He is currently a Staff Computer-Aided Design (CAD) Engineer with Intel Corporation, Folsom, CA. From 1999 to 2001, he developed microelectromechanical systems for nanolithography applications at the Microelectronics Laboratory of UIUC. From 2004 to 2005, he interned at Intel Corporation and developed soft error resilient circuits and fault-tolerant architectures. His Ph.D. research work included a soft error rate analysis methodology and various classes of soft-error tolerant circuit design techniques. His current research interests include error-resilient and low-power circuits, variation and degradation-tolerant circuits and architectures, and circuit/architecture design for nanotechnology. He has published more than twenty technical papers and holds seven issued or pending U.S. patents. He serves on the program committees of several IEEE conferences and symposia. Dr. Zhang was a recipient of the M. E. Van Valkenburg Research Award for demonstrated excellence in circuit and system research and the University Award for excellence in teaching an undergraduate-level circuit class.
Subhasish Mitra (SM’06) is an Assistant Professor in the Departments of Electrical Engineering and Computer Science, Stanford University, Stanford, CA. His research interests include robust system design, VLSI design and test, computer-aided design (CAD), fault-tolerant computing, and computer architecture. Prior to joining Stanford, he was a Principal Engineer at Intel Corporation, Hillsboro, OR, where he was responsible for developing enabling technologies for robust system design – Design for Excellence (Reliability, Testability, and Debug) – in advanced technologies. He has published more than 90 technical papers in leading conferences and journals, and invented design and test techniques that have shown wide-spread proliferation in the industry. His X-Compact technique for test compression is being used by more than 40 Intel products, and is supported by major CAD tools. Dr. Mitra was a recipient of the IEEE Circuits and Systems Society Donald O. Pederson Award for the best paper published in the IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, a Best Paper Award at the Intel Design and Test Technology Conference for his work on built-in soft error resilience, a Best Paper Award nomination at the Design Automation Conference, a Divisional Recognition Award from the Intel Mobility Group for a breakthrough soft error protection technology, Terman Fellowship from the Stanford School of Engineering, the Sundaram Seshu Scholar Lecturer at the Coordinated Science Laboratory of the University of Illinois at Urbana Champaign, and the Intel Achievement Award, Intel’s highest honor, for the development and deployment of a breakthrough test compression technology that achieved an order of magnitude reduction in scan test cost. He has held consulting positions at several companies, and serves on the organizing and program committees of several IEEE and ACM sponsored conferences, symposia, and workshops.
T. M. Mak (SM’01) received the B.S.E.E. degree from Hong Kong Polytechnic University, Hong Kong, in 1979. He is a Research Scientist with the Design Technology Solution Group, Intel Corporation, Santa Clara, CA, carrying out test research. He is currently serving a second term assignment to mentor MARCO/Focus Center Research Program (FCRP) research. He has been with Intel for over 22 years and has worked on a variety of areas including test development, product engineering, design automation, and design for test. His current research interests range from defect-based testing, fault effects as a result of nanometer technology, circuit level and physical design test issues, I/O interface and analog testing, and fault tolerant and online testing. He currently holds eight patents with
1377
seven more pending. He had served on the program committees of various conferences and workshops. Dr. Mak was a recipient of the SRC Outstanding Industrial Mentor Award in both 1997 and 2004 and the Best Paper Award at the International Test Conference in 2004 and a Best Panel Award from VTS in 2004.
Norbert Seifert (SM’04) received the M.S. degree in physics from Vanderbilt University, Nashville, TN, in 1994, and the Diplom Ingenieur and Ph.D. degrees in physics from the Technical University of Vienna, Vienna, Austria, in 1990 and 1993, respectively. He is currently a Staff Reliability and Design Engineer with Intel Corporation, Hillsboro, OR, where he is responsible for developing a coherent chip-level SER methodology. He is also studying the impact of NBTI on circuit and system performance. He worked in TCAD and circuit design in the Alpha Development Group (DEC/Compaq/HP) from 1997 to 2003. Prior to joining DEC, he studied charge transfer processes in atomic collisions as a postdoctoral associate at North Carolina State University, Raleigh, and computational fluid dynamics of high-power laser material processing as a postdoctoral associate at the Technical University of Vienna. He has worked extensively on the interaction of radiation with matter, in general, and on the response of digital circuits in particular. He has published more than 30 technical papers on this topic and has presented soft error tutorials at several reliability conferences. He actively serves on the organizing and program committees at several IEEE sponsored conferences. He is a frequent reviewer for leading reliability journals and is a co-editor of the September, 2005, IEEE TRANSACTIONS DEVICE AND MATERIALS RELIABILITY Special Issue on Soft Errors and Data Integrity in Terrestrial Computer Systems.
Nicholas J. Wang received the M.S. and B.S. degrees in electrical and computer engineering from the University of Illinois Urbana-Champaign, Urbana Champaign, where he is currently pursuing the Ph.D. degree in electrical and computer engineering. His research interests include fault-tolerant computer architectures.
Quan Shi received the B.S. and M.S. degrees in radio electronics from Beijing Normal University, Beijing, China, in 1986 and 1989, respectively, and the Ph.D. degree in electrical engineering from University of New Mexico, Albuqueque, in 2000. He is currently a Design Engineer with Intel, Hillsboro, OR. Before joining Intel, he was with NASA Institute of Advanced MicroElectronics in Albuquerque, NM. His research interests include circuit hardening techniques, circuit reliability, circuit modeling, and asynchronous circuits.
Kee Sup Kim (M’92) received the Ph.D. degree from the University of Wisconsin-Madison, Madison. Currently, he is Director of DFX at Mobility Group at Intel Corporation, Hillsboro, OR, where he is in charge of developing solutions for design for testability, manufacturability, debug, and reliability for communications products. Previously, he worked on various aspects of testing and DFT for Intel CPU products. He served as an organizing committee member for International Conference on Computer Design. Dr. Kim was a co-recipient of the IEEE Donald O. Peterson Award for his work in test compression.
1378
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 12, DECEMBER 2006
Naresh R. Shanbhag (F’06) received the Ph.D. degree in electrical engineering from the University of Minnesota, Minneapolis, in 1993. Since August 1995, he has been with the Department of Electrical and Computer Engineering, and the Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, Urbana, where he is presently a Professor. From 1993 to 1995, he worked at AT&T Bell Laboratories, Murray Hill, NJ, where he was the lead chip architect for AT&T’s 51.84-Mb/s transceiver chips over twisted-pair wiring for Asynchronous Transfer Mode (ATM)-LAN and very high-speed digital subscriber line (VDSL) chip-sets. His research interests include the design of integrated circuits and systems for broadband communications including low-power/high-performance VLSI architectures for error-control coding, equalization, as well as digital integrated circuit design. He has published more than 90 journal articles, book chapters, and conference publications in this area and holds three U.S. patents. He is also a co-author of the research monograph Pipelined Adaptive Digital Filters (Kluwer, 1994). From 1997–1999, he was a Distinguished Lecturer for the IEEE Circuits and Systems Society. From 1997–1999 and from 1999–2002, he served as an Associate Editor for the IEEE TRANSACTION ON CIRCUITS AND SYSTEMS—PART II: EXPRESS BRIEFS and the IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, respectively. He has served on the technical program committees of various conferences. He is also a co-founder and the Chief Technology Officer of Intersymbol Communications, Inc., (a wholly owned subsidiary of Kodeos Communications, Inc., since March 2006) Champaign, IL, which was founded in 2000, and where he provides strategic directions in the development of mixed-signal receivers for next generation optical fiber links.
Dr. Shanbhag was a recipient of the IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS Best Paper Award in 2001, the IEEE Leon K. Kirchmayer Best Paper Award in 1999, the Xerox Faculty Award in 1999, the National Science Foundation CAREER Award in 1996, and the Darlington Best Paper Award from the IEEE Circuits and Systems Society in 1994.
Sanjay J. Patel (M’99) received the B.S., M.S., and Ph.D. degrees in computer science and engineering from the University of Michigan, Ann Arbor, in 1990, 1992, and 1999, respectively. Currently, he is an Associate Professor in the Electrical and Computer Engineering Department and a Willett Faculty Scholar, University of Illinois at Urbana-Champaign, Urbana. He is also serving as Chief Architect at AGEIA Technologies, St. Louis, MO. His research interests include processor microarchitecture, computer architecture, and high performance and reliable computer systems. He has worked with architecture, hardware verification, logic design, and performance modeling at Digital Equipment Corporation, Intel Corporation, and HAL Computer Systems, as well as provided consultation for Transmeta, Jet Propulsion Laboratory, HAL, Intel, and AGEIA Technologies. Dr. Patel is a member of the IEEE Computer Society.