Resilient Circuits – Enabling Energy-Efficient Performance and Reliability James Tschanz, Keith Bowman, Chris Wilkerson, Shih-Lien Lu, Tanay Karnik Intel Labs, Intel Corporation th JF2-04, 2111 N.E. 25 Avenue Hillsboro, OR 97124 (503) 712-4360
[email protected]
performance, power, and reliability – along with cost to arrive at an optimal design.
ABSTRACT Voltage and frequency margins necessary to ensure correct processor operation under dynamic voltage, temperature, and aging variations result in performance and power overheads. Resilient circuit techniques, including embedded error-detection sequentials and tunable replica circuits, allow these margins to be reduced or eliminated, resulting in reliable, energy-efficient operation.
Many of the variations that impact a design are dynamic in nature and depend on the environment in which the processor is used. Examples of these types of variations include dynamic voltage droop, temperature change across the die, and aging or wearout of transistors or interconnect. Because these variations can be difficult to measure or predict, they are often margined in either voltage or frequency, resulting in a worst-case design methodology that is guaranteed to work under any conditions or in any environment. While this worst-case design delivers the required reliability, it does so at significant cost in either performance or power. We propose circuit-level resiliency as a way of reducing or eliminating these margins, allowing a design to operate at maximum performance or minimum energy while guaranteeing correct, reliable operation.
Categories and Subject Descriptors B.7.1 [Integrated Circuits]: Types and design styles
General Terms Performance, Design, Reliability
Keywords Resilient circuits, dynamic variations, adaptation, parameter variations, timing errors, delay faults, error detection, error recovery, variation-tolerant circuits
2. CIRCUIT-LEVEL RESILIENCY Reducing margins applied for dynamic variations requires a method of sensing the dynamic variation, or the effect of the variation, and then taking action to prevent or correct any circuit failure. There are many different circuit-based approaches that can accomplish the sensing and response: here we divide them into three main categories. All three are shown in the example processor implementation of Figure 1, which includes the variation detection mechanisms, optional error recovery, and dynamic voltage and frequency control for adapting to slowchanging variations.
1. INTRODUCTION Today’s computing applications are more demanding than ever, requiring ever-increasing levels of performance to support complex search routines, image and speech processing, gaming and enhanced user interfaces, and communications. At the same time, this processing power is often needed in a small, mobile form factor, making an energy-efficient design critical. Even high-end server processors, burdened by energy costs and heatremoval issues, are constrained by both power and performance. While process technology scaling makes gains in performance and energy-efficiency possible, it also comes with the side effects of increased variations (both static and dynamic) and reliability issues such as transistor degradation and early-life failures. Designers have the difficult task of balancing these three issues –
2.1 Sensor-based error avoidance The traditional approach for dynamic variation tolerance is based on sensing the variation and adapting the circuit to prevent failure [1]. This typically requires the use of analog sensors for monitoring voltage droop, temperature, or transistor aging, coupled with a method for dynamically changing processor frequency or voltage. These approaches work well for slowchanging variations that do not vary significantly across the die. However, responding to fast-changing variations such as voltage droop requires very fast sensing and response, which is not practical for large designs where even the communication delay between sensor, controller, and adaptive response may take many cycles. An alternative sensor-based approach uses delay sensors on the critical path to detect slow changes in path delay due to
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICCAD’09, November 2–5, 2009, San Jose, California, USA. Copyright 2009 ACM 978-1-60558-800-1/09/11...$10.00.
71
7KHUPDO VHQVRU 9ROWDJH VHQVRU
'9) &RQWURO
[7] provide a lower-overhead alternative which results in a simpler, low power implementation while still allowing response to fast-changing variations.
&ORFN JHQHUDWRU
'LJLWDO&RUH
75&
('6
,QVWUXFWLRQ LVVXH
75&
3LSHOLQH VWDJH
75&
3LSHOLQH VWDJH
75&
3LSHOLQH VWDJH
,QVWUXFWLRQ UHWLUH
Figure 1. Example processor pipeline with resiliency: analog sensors with dynamic frequency control, errordetection sequentials (EDS), and tunable replica circuits (TRC). temperature change or transistor degradation [2-3]. This technique has the advantage of detecting within-die variations, but still suffers from the difficulty of responding to fast-changing variations, and requires a small additional margin in frequency to allow delay change detection.
2.2 Embedded delay-fault detection As described above, it is not always possible to respond to fastchanging variations in time to prevent circuit failure. Reducing the margins for these variations, then, requires circuit-level error detection and correction. Embedding these error detection circuits within the critical paths themselves allows detection of errors due to fast-changing variations, as well as coverage for within-die variations. Error-detection sequentials (EDS) [4-6] have been used on critical paths to detect path delay change caused by variations. These EDS circuits rely on the principle of time-redundancy: the data is sampled by the clock edge as in a conventional flip-flop, and again at a later point in time and the values compared. If the data value has changed within this window (often equal to the high phase of the clock for simplicity), a delay fault is detected and an error signal is generated. Unlike sensor-based error avoidance, the result of this variation is an actual incorrect value within the datapath circuit which must be corrected. Correction techniques vary based on the type of design which is implemented – for a processor pipeline, the error can be treated in a similar way as a branch misprediction. The pipeline is flushed so that the erroneous value is not committed to any permanent state, and the execution is restarted with the instruction that caused the error. Various replay algorithms are possible – transient errors can be corrected by simply replaying the errant instruction at the original clock frequency, while persistent errors can be corrected by reducing clock frequency during replay, guaranteeing that the instruction will execute correctly.
A tunable replica circuit (TRC) consists of a digital delay sensor which can be tuned at test time to match the delay of a critical path in the circuit. This TRC can consist of several types of logic gates and interconnect to allow its delay sensitivity with voltage to match that of the critical path, and multiple TRCs may be included on the die. Unlike “canary circuits” or sensor-based error avoidance, the TRC does not need to fail with a prescribed margin before the critical path fails. However, if the critical path fails, the TRC must be guaranteed to fail. When the TRC reports an error, it is assumed that the critical path contains erroneous data as well, and error recovery is initiated as described above. The TRCs are completely separate from the processor pipeline circuits and thus do not impact performance or require additional min-delay buffer insertion, and their small size and low clocking power make them attractive as compared to an EDS implementation. The drawbacks as compared to EDS are the lack of ability to respond to within-die variations (thus requiring a small within-die margin), and the necessity of tuning the TRC at test time.
2.4 Technique comparison A qualitative comparison of the three techniques described above is given in Table 1. Error-detection sequentials give the largest benefit, but also incur the largest power overhead due to additional clocking power and min-delay buffers. Both EDS and TRC require error-recovery features, resulting in higher complexity, but both allow significant margin reduction as compared to sensor-based error avoidance. Table 1. Comparison of variation response techniques. Sensors
Response to slow variations
9
Response to fast variations Response to WID variations
TRC + EDS + recovery recovery
−
9 9
9 9
−
−
9
Calibration required at test
9
9
−
Power overhead
low
low
high
Complexity
low
high
high
Margin reduction
good
better
best
2.3 Tunable replica circuits
3. MEASUREMENT RESULTS
Embedded error-detection sequentials provide the maximum margin reduction because they detect not only fast-changing but also within-die variations which cause differences in critical path delays. However, these benefits come at the cost of increased power consumption (especially clock power consumption which is increased as a result of the double-sampling operation) as well as the necessity of adding additional min-delay buffers to ensure that critical paths do not transition within the error-detection window. Tunable replica circuits combined with error recovery
Both EDS and TRC techniques have been implemented on testchip designs in 65nm and 45nm technology generations. These designs allow measurement of margins required for dynamic variations, and performance improvement which is possible using resiliency techniques.
72
Tunable replica circuits are implemented on a 45nm testchip [7] and demonstrate accurate detection of dynamic variations, including voltage droop and aging (Fig 2). Because the TRC shares the same power supply and clock network as the critical
2009 IEEE/ACM International Conference on Computer-Aided Design Digest of Technical Papers
path circuits that it is monitoring, TRC aging accurately tracks critical path aging, even with arbitrary power up/down cycles and clock gating. Thus the margins in frequency necessary for aging can be greatly reduced, allowing higher performance or lower power. 8%
Delay change
7% 6% 5% 4% 3%
4. CONCLUSION In an era where performance and power are heavily constrained, a worst-case design methodology no longer suffices. Resilient circuit techniques coupled with error recovery allow designs to operate at the most energy-efficient point and adapt to operating conditions and to dynamic variations. Embedded error-detection sequentials offer the largest benefit, due to the capability of responding to fast variations and critical path activation differences. However, tunable replica circuits can capture most dynamic variations at lower area and power overhead than EDS. Either technique results in significant margin reduction over a conventional design, providing higher performance and energy efficiency.
2% 1% 0%
5. REFERENCES 0
1
2
3
4
5
6
7
8
9
10
Time (a.u.)
Figure 2. Delay degradation of TRC for multiple cycles of accelerated stress and recovery.
Measurements of error-detection sequentials on a 65nm testchip design [6], including error recovery, demonstrate a throughput gain of 21% which is possible by eliminating margins for voltage droop, temperature, and critical path activation. Alternately, this can be traded off to provide 37% power reduction at equal throughput. Thus, resiliency allows a more energy-efficient design at both the low-power and high-performance ends of the spectrum.
Power (mW)
200 150
[1] J. Tschanz, et al., “Adaptive Frequency and Biasing Techniques for Tolerance to Dynamic Temperature-Voltage Variations and Aging,” in IEEE ISSCC Dig. Tech. Papers, Feb. 2007, pp. 292-293. [2] M. Agarwal et al., “Circuit Failure Prediction and its Application to Transistor Aging,” in IEEE VLSI Test Symposium, 2007, pp. 277286. [3] M. Zhang et al., “Design for Resilience to Soft Errors and Variations,” in IEEE On-Line Testing Symposium, 2007, pp. 23-28. [4] P. Franco and E. J. McCluskey, “Delay Testing of Digital Circuits by Output Waveform Analysis,” in Proc. IEEE Intl. Test Conf., Oct. 1991, pp. 798-807. [5] S. Das, et al., “Razor II: In Situ Error Detection and Correction for PVT and SER Tolerance,” IEEE J. Solid-State Circuits, pp. 32-48, Jan. 2009. [6] K. A. Bowman, et al., “Energy-Efficient and Metastability-Immune Resilient Circuits for Dynamic Variation Tolerance,” IEEE J. SolidState Circuits, pp. 49-63, Jan. 2009. [7] J. Tschanz, et al., “Tunable Replica Circuits and Adaptive VoltageFrequency Techniques for Dynamic Voltage, Temperature, and Aging Variation Tolerance,” in IEEE Symp. VLSI Circuits, June 2009.
100 Conventional Design Resilient Design
50 0 1.0
1.5
2.0
2.5
3.0
Throughput (BIPS)
Figure 3. Power vs. throughput for conventional design and resilient design using EDS.
2009 IEEE/ACM International Conference on Computer-Aided Design Digest of Technical Papers
73