micro IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M q M q
M q
M q MQmags q THE WORLD’S NEWSSTAND®
.................................................................................................................................................................................................................
COMPREHENSIVE CIRCUIT FAILURE PREDICTION FOR LOGIC AND SRAM USING VIRTUAL AGING .................................................................................................................................................................................................................
A COMPREHENSIVE FAILURE-PREDICTION TECHNIQUE FOR MANY-CORE PROCESSORS ADDRESSES WEAR OUT IN HARSH ENVIRONMENTS FOR LOGIC AND STATIC RAM USING VIRTUAL AGING. THE DESIGN HAS A SIMPLE IMPLEMENTATION AND DELIVERS LOW COMPLEXITY, LOW OVERHEAD, AND HIGH ACCURACY. THE SYSTEM ENSURES NO CORRUPTIONS OR MISSED ERRORS FROM WEAR-OUT FAILURES AND PREDICTS FAILURES WITHIN 0.4 DAYS FOR LOGIC AND WITHIN MILLISECONDS FOR SRAM.
......
Amir Yazdanbakhsh Georgia Institute of Technology Raghuraman Balasubramanian Tony Nowatzki Karthikeyan Sankaralingam University of Wisconsin–Madison
In the future, especially in harsh environments (such as aerospace, underwater, and military), microprocessors are increasingly likely to fail in the field because of manufacturing test fault escapes and various aging and wear-out phenomena.1,2 Circuit failure prediction techniques employ wear-out device physics principles and empirical measurements3 to predict failures in the field before they occur for logic and static RAM (SRAM). Models of the dominant mechanisms— negative bias temperature instability (NBTI), Hot Carrier Injection (HCI), and timedependent dielectric breakdown (TDDB)— show logic wear out increases the delay of gates because a degraded Vth increases the ðVDD " Vth Þ. However, wear out of SRAM transistors affects the SRAM arrays’ performance parameters (such as read stability, write stability, and read delay) differently. Previous work has shown that read stability is the dominant failure in SRAM arrays because of the wear out.3–5 (The effect of aging on transistors’ mobility is not considered.)
Extensive literature has addressed wearout-prediction inspired by these observations (in the interest of space, we provide one representative citation6). However, as far as we know, no prior work simultaneously addresses both logic and SRAM. Furthermore, they individually suffer from complexity, overhead, and accuracy and generality problems and become particularly ineffective in harsh environments in which wear-out challenges are exacerbated. These prior techniques are discussed further in the “Related Work in Circuit Failure Prediction” sidebar. Our goal is to develop a unifying yet simple mechanism that covers both logic and SRAM and delivers low complexity, low overhead, and high accuracy. To this end, we developed a comprehensive circuit-prediction technique called the Aged Full-Chip Predictor for both logic and SRAM in many-core systems. Aged Full-Chip Predictor allows safe execution up to 0.4 days before logic failures and extends the typical lifetime by 14 months, over a system with ECC for SRAM.
Published by the IEEE Computer Society
0272-1732/15/$31.00 c 2015 IEEE
.......................................................
24
micro IEEE
$
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M q M q
M q
M q MQmags q THE WORLD’S NEWSSTAND®
micro IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M q M q
M q
M q MQmags q THE WORLD’S NEWSSTAND®
..............................................................................................................................................................................................
Related Work in Circuit Failure Prediction Figure A shows the various alternatives for handling wear out in logic and SRAM. Dimitris Gizopoulos and colleagues provide a good overview of detection techniques for logic.1 Logic wear-out prediction is based on canaries,2 in-situ flip-flop techniques,3 delay measurement,4 and built-in self-test (BIST).5 SRAM-based detection and prediction techniques are based on sensors or modifications to the SRAM cell,6,7 complex error-correcting codes (ECCs), and hybrid ECC and cell sizing.8 None of these can simultaneously deliver on low complexity, low overheads, and high accuracy because these techniques operate within only a single computing layer. When done at the circuit level, these techniques suffer from complexity and always remain active. On the other hand, an architecture-levelonly solution suffers from low accuracy because architecture fault models do not capture most physical effects. (In both logic- and SRAM-based directions, there is a body of work on mitigation and repair, which is complementary and somewhat orthogonal to detection and prediction.)
2. J. Tschanz et al., “Tunable Replica Circuits and Adaptive Voltage-Frequency Techniques for Dynamic Voltage, Temperature, and Aging Variation Tolerance,” Proc. Symp. VLSI Circuits, 2009, pp. 112–113. 3. D. Ernst et al., “Razor: A Low-Power Pipeline based on Circuit-Level Timing Speculation,” Proc. 36th Ann. IEEE/ACM Int’l Symp. Microarchitecture, 2003, pp. 7–18. 4. J. Blome et al., “Self-Calibrating Online Wearout Detection,” Proc. 40th Ann. IEEE/ACM Int’l Symp. Microarchitecture, 2007, pp. 109–122. 5. J.C. Smolens et al., “Detecting Emerging Wearout Faults,” 3rd IEEE Workshop Silicon Errors in Logic-System Effects, 2007;
http://jared.smolens.org/documents/first-smolens_____________________________
selse07.pdf. _______ 6. F. Ahmed and L. Milor, “Reliable Cache Design with On-Chip Monitoring of NBTI Degradation in SRAM Cells using BIST,” Proc. 28th VLSI Test Symp., 2010, pp. 63–68. 7. Z. Qi et al., “SRAM-Based NBTI/PBTI Sensor System Design,” Proc. 47th ACM/IEEE Design Automation Conf.,
References
2010, pp. 849–852.
1. D. Gizopoulos et al., “Architectures for Online Error Detection and Recovery in Multicore Processors,” Proc. ACM/
8. Z. Chishti et al., “Improving Cache Lifetime Reliability at
IEEE Design, Automation, and Test in Europe Conf., 2011, pp. 1–6.
Ultra-Low Voltages,” Proc. 42nd Ann. IEEE/ACM Int’l Symp.
Technique operation over time (thickness indicates operational overheads)
Lifetime of a processor
Logic failure
Time (years)
Zero
Causes system corruption Age detection flip-flops
Coverage
Early prediction
Select logic on critical paths
BIST-based prediction
Microarchitecture, 2009, pp. 89–99.
Lifetime of a processor First SRAM failure Causes system corruption Lifetime of a processor with ECC First SRAM failure
Continuous monitoring of gate delay Aged-SDMR
Early prediction
Select logic on critical paths
Second SRAM failure (if chip were active)
Wasted lifetime/lost performance Cannot correct next error Cache block unusable* * processor decommissioned if many blocks become unusable Corrected by ECC
Logic on critical paths Periodic, offline BIST check
Online delay tracking
Time (years) Cache block unusable
Aged-AsymChk
All logic cells
Virtual aging + sampled redundancy
First SRAM failure
Second SRAM failure (if chip were active)
Timely prediction by Aged-AsymChk Decommissioned with little wasted lifetime
Corrected by ECC
Prediction techniques targeting memories (SRAM)
Prediction techniques targeting logic
Figure A. The operation of failure-prediction techniques that target logic and static RAM (SRAM). Compared to other logicdetection techniques, Aged-SDMR has low overhead and coverage on all logic cells. Compared to error-correcting code (ECC) alone, Aged-AsymChk can predict the second failure before it occurs.
Design
Virtual aging to manifest faults
The design of the Aged Full-Chip Predictor leverages three primary mechanisms. We discuss the insight for each and outline their design below. Figure 1 provides an overview of the execution of our comprehensive failure-prediction system.
Our key insight is to virtually wear out the processor and thus manifest a wear-out fault early. We convert the wear-out degradation into a higher-level and easier-to-detect fault; we then expose and detect the fault, which effectively predicts and detects the wear out.
.............................................................
NOVEMBER/DECEMBER 2015
micro IEEE
25
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M q M q
M q
M q MQmags q THE WORLD’S NEWSSTAND®
micro IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M q M q
M q
M q MQmags q THE WORLD’S NEWSSTAND®
.............................................................................................................................................................................................. FAILURE PREDICTION
Time (years)
Execution is divided into epochs S-epochs
L-epochs Aged-SDMR active 1% of the cycles at the start of each L-epoch
Aged-AsymChk is active at the start of each S-epoch Resume processes
Pause all processes Flush cache
Aged-AsymChk
Aged-SDMR
Processor
Virtual aging makes the cells behave as if they are weeks older. Causing eventual failures to manifest as stuck-at faults. BIST test vectors expose these faults. AsymChk ideal to BIST checkers detect the defect. capture stuck-at faults
Processor memories
No modifications to SRAM cells BIST test vectors
SRAM cells
B
ECC
C D
Test mode
BIST check Supply voltage
DVS
Memories
Control
Memory
Virtual ager
A B C D
Logic
Virtual aging makes the cells behave as if they are weeks older. Causing eventual failures to manifest as delay faults. User applications expose these faults as errors. Sampling DMR ideal to Sampling DMR detects the errors. capture delay faults
Processor logic
Near-critical paths
C
B
To processor logic
A B C D
User applications running Sampling DMR active Virtual aging active
BIST check Virtual aging active
A
fast gate Noncritical path
CLK
Capture flop
phased CLK
Clock gate Aging mode
Supply voltage
DVS
Virtual ager
A
Additional logic inserted to cover fast gates Sampled dual modular redundancy D
Checker core
Checker core
Figure 1. Two techniques, based on virtual aging, together provide comprehensive failure prediction. Aged-SDMR detects manifested logic errors using sampling and dual-modular redundancy, whereas Aged-AsymChk detects manifested SRAM errors using asymmetric checking.
All device-level wear-out faults eventually must manifest at a higher abstraction level; thus, any detection technique can be repurposed as a prediction technique. We carry out virtual aging by reducing supply voltage using dynamic voltage scaling. We can tune the prediction’s timeliness by changing the amount of voltage reduction. Virtual aging is instantaneously reversible; resetting to nominal voltage restores the processor’s current age.
Sampled redundancy to expose and detect logic failure
............................................................
26
micro IEEE
We observed that wear out in logic is first exposed as a logic delay fault, and sampled redundancy with execution on a second core can be effective in handling logic transistors. BIST and stuck-at fault models are insufficient for providing full coverage for these delay-driven failures. The key idea of the solution, Aged-SDMR, is to couple cores randomly at randomly chosen periods of time, run one core virtually aged, use the second (redundant) core as a checker core, and couple these using a nonintrusive lightweight mechanism. Because logic
faults start as delay faults, a comprehensive redundant core is necessary for full coverage. Shuou Nomura and colleagues introduced the concept of SamplingþDMR,7 which solves the overhead problem that historically has plagued redundancy. Our key advancement over their work is to use virtual aging during DMR execution to ensure that faults always occur first in a DMR window, thus ensuring no missed errors.
Asymmetric checkers to expose and detect SRAM failure Aged-SDMR cannot be used for SRAM because checkpointing the entire SRAM state is infeasible, especially considering today’s megabyte-sized level-2 caches. However, wear out in SRAMs results in read stability problems, and therefore its effect can be captured by a simple stuck-at fault model. The solution, Aged-AsymChk, leverages this insight and uses established asymmetric checker technology such as BIST to check the SRAMs when they are virtually aged. Specifically, we write known vectors to an SRAM, then read out the values; any mismatch between these indicates an impending failure.
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M q M q
M q
M q MQmags q THE WORLD’S NEWSSTAND®
micro IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
THE WORLD’S NEWSSTAND®
Use of existing techniques The principles of dynamic voltage scaling, sampling, redundancy, and asymmetric checking using BIST are well known. Our work’s implementation and design contribution is a novel use of existing techniques, while avoiding disruptive or intrusive mechanisms and providing comprehensive logic and SRAM wear-out prediction. The implementation requirements are simple or already existent: dynamic voltage scaling capability; separate voltage islands for SRAMs and logic; a reliability manager module added to cores to allow checking of retired instructions; BIST capability in the SRAMs; and a controller (like a cache controller) in the SRAM that allows its contents to be safely evicted prior to being overwritten for BIST.
Implementation We present the organization of our system and the implementation of virtual aging, fault exposure, and fault detection. Within each, we discuss logic and SRAM. Figure 1 shows the high-level overview and details of each individual approach. We focus on SRAM in this article because our previous work covered the logic.8
Overall organization Conceptually, we execute the processor in epochs, where at the start of every epoch we have a window where the processor is virtually aged. As Figure 1 shows, we have two types of epochs: logic epochs (L-epochs), in which only the logic is virtually aged, and SRAM-epochs (S-epochs), in which only SRAM is virtually aged. These never overlap and are executed at different rates.
Virtual aging We virtually age a processor by reducing the supply voltage to both logic and SRAM arrays. Although the enabling mechanism is the same, the failure behavior is different. For SRAM, prior to virtual aging, we must ensure any useful SRAM state is written to some other location. For an SRAM that is part of a cache, the cache controller can be enhanced to evict all dirty lines. Otherwise, it can be done completely in software using instructions like WBINVD (writeback and invalidate
cache) in the AMD 64 architecture. SRAMs in speculative structures such as branch predictor tables can simply be overwritten. Precise interrupts that would start an S-epoch ensure that structures such as load queues and the rename table are empty. We can virtually age large memory structures, such as L2 caches with many SRAM blocks, by applying the S-epochs one SRAM array at a time coordinated with the controller to turn off banks. Effect on logic. The delay of a gate td is inversely proportional to ðVDD " Vth Þ2 . Wear out causes Vth and hence td to increase. Reducing VDD has the same effect and can be calibrated to mimic weeks or months of aging. Effect on SRAM. Consider the basic six-transistor SRAM cell organization. In a newly manufactured cell, the cross-coupled inverters are fairly identical, producing a voltage transfer characteristic as in Figure 2a. The static noise margin (SNM) is the minimum noise or extraneous voltage that can corrupt the stored value. The read failure probability defines this likelihood for a given cell. Owing to wear out, the SRAM’s inverters degrade, reducing the static noise margin as shown in Figures 2b and 2c, which consequently increases the read failure probability. Furthermore, SRAM wear out is asymmetric and depends on the stored value in the SRAM cell. For example, when zero value is stored in the SRAM cell, the p-channel MOS transistor in one of the inverters is subjected to stress, whereas the PMOS transistor in the other one goes into the recovery mode. With extremely high wear out, cells can become stuck at 0 or 1 permanently (see Figure 2d). Virtual aging’s behavior for SRAM is similar to the logic case. The fundamental source for SNM change is decreased ðVDD " Vth Þ due to increased Vth , which can be achieved equivalently by decreasing VDD and can be instantaneously reset back to the current age by resetting to nominal VDD . Figure 3 shows an HSpice simulation of virtual aging’s effectiveness. Using MOS reliability analysis (MOSRA) aging models, we ran simulations of the SRAM cell with various amounts of aging—for the technology and the MOSRA parameters that we considered,
.............................................................
NOVEMBER/DECEMBER 2015
micro IEEE
M q M q
M q
M q MQmags q
27
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M q M q
M q
M q MQmags q THE WORLD’S NEWSSTAND®
micro IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M q M q
M q
M q MQmags q THE WORLD’S NEWSSTAND®
.............................................................................................................................................................................................. FAILURE PREDICTION
1.0
1.0 Read SNM
0.8
0.6
V(QB)
V(QB)
0.8
Read SNM > 0
0.4
0.6
0.4
0.2
0.2
VDD = 1.2 V
VDD = 1.2 V
Age = 0 years
0
0
0.2
Age ≈ 10 years
0.4
(a)
1.0
0.6 V(Q)
0.8
0
1.0
0
0.2
0.4 V(Q)
(b)
0.6
0.8
1.0
1.0
Read SNM ≈ 0
VWL VQ
0.8
0.8
V(QB)
VQB
0.6
0.6
0.4
0.4 0.2
0.2
VDD = 1.2 V Age ≈ 12 years
0 (c)
bit flip
0
0.2
0 0.4
0.6 V(Q)
0.8
1.0
0
50
100
(d)
150
200
250
300
350
400
450
Time (µsec)
Figure 2. Six-transistor (6T) SRAM cell transfer characteristics and the read failure in the SRAM cell. 6T SRAM transfer characteristics for a (a) new chip, (b) positive read static noise margin (SNM) after wear out, and (c) zero read SNM after wear out. (d) Negative (near-zero) read SNM causes the stored value in the SRAM to flip (initial stored value is zero).
............................................................
28
micro IEEE
failure happened at approximately 12 years (626 weeks) for a worst-case stressed cell (that is, one that constantly stores either one or zero in the SRAM cell for the duration of the aging). The MOSRA parameters are TIT 0 ¼ 5e " 8; TITFD ¼ 7:5e " 10; TITTD ¼ 1:45e " 20; TN ¼ 0:5; RelMode ¼ default (both HCI and BTI). At each aging setting, we also ran a simulation with various amounts of voltage reduction. In this case, we first obtained the total amount of stress on transistors during the whole period of the aging with the nominal voltage, which shows itself as shift in the Vth .
Given the shifted Vth values for each transistor, we simulated the SRAM cell with the reduced voltage to observe the aging failure. The dots in the figure indicate the age at which the cell failed for various amounts of voltage reduction. Subtracting this age from 12 years provides the window of advance failure notification. This experiment demonstrates that reducing voltage serves the purpose of virtual aging.
Fault exposure The fault exposure mechanism is what makes all errors visible to the detection mechanism.
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M q M q
M q
M q MQmags q THE WORLD’S NEWSSTAND®
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M q M q
M q
M q MQmags q THE WORLD’S NEWSSTAND®
SRAM. The goal of fault exposure is to condition a failed cell to produce errors. Our main contribution here is based on a simple observation: the read stability problem in failed cells can be abstracted as a stuck-atzero or a stuck-at-one fault if we can write known values into the SRAM and then read them. We reuse the pattern generators in memory BIST to produce and write these values: a simple “March” algorithm that writes all zeros followed by all ones will suffice for Aged-AsymChk.
1.2 1.0 ß 0.8 End of life
Logic. Exposing permanent faults in the critical path is straightforward. Permanent faults keep producing the fault in the circuit. However, based on the input values, some of the faults might be masked. Therefore, we need a mechanism to do more than one sampling to guarantee the detection mechanism’s completeness. Figure 4a shows how degradation affects a critical path, assuming that guardband is added to accommodate aging. As the chip ages, the delay increases and the guardband slack decreases. When the delay degradation overshoots the guardband (3 years in the figure), soft breakdown occurs. Under virtual aging, the additional delay in gates that fall in near-critical paths show up as faults at the flip-flops they drive. This causes a bit-flip (or metastability) at the output of the flip-flops that can propagate to cause an architectural state corruption. These faults are exposed, with no modifications required to the processor. Figure 1 shows an example circuit block highlighting the fact that the critical path is left unmodified. Noncritical paths introduce subtle challenges because gates that are exclusively on noncritical paths (fast gates) can degrade directly to hard breakdown without ever manifesting as a delay fault, thus circumventing the prediction mechanism. Simple clockphase shifting logic can be added to gates on noncritical paths to effectively expose their delays (see Figure 1). Because modifications are only to paths that have much slack, they are not a source of complexity.
Voltage (volt)
micro IEEE
0.6 0.4 0.2 0 100
ß: Predicted ~ 28 weeks in advance with VDD reduced by 45 mV 200
300
400
500
600
Time in weeks
Figure 3. The timing of failure manifestation using virtual aging versus supply voltage. As the supply voltage is reduced (virtual aging), the time when the failure occurs becomes earlier.
Logic. For fault detection in logic, we use a separate checker core that is started on the basis of the checked core’s checkpoint. The checker core operates at regular voltage. As we outlined earlier, we need a full-fledged core to address accuracy problems, because BIST and test-vector-based techniques compromise coverage for delay-based fault models. We also add a simple reliability manager module to every core, which monitors retiring instructions, converts them into a signature, and sends the signature to the checker core using the L2-cache communication network. The checker core’s reliability manager checks the signature against its own computed signatures. Shuou Nomura and colleagues describe the firmware or OS to allow the pairing of arbitrary cores together using the idea of virtual CPUs.7 We assume the same to allow the coupling of cores. SRAM. The detection phase is trivial for Aged-AsymChk, because the BIST controller knows what values to expect—any differences are flagged as impending failures.
Discussion Fault detection The fault-detection mechanism compares measured (read) values against known (written) values to determine when a fault has occurred.
An important question to consider is, compared to prior works, what do we lose or what assumptions are broken or ignored? We make one judicious cross-layer (circuit to
.............................................................
NOVEMBER/DECEMBER 2015
micro IEEE
29
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M q M q
M q
M q MQmags q THE WORLD’S NEWSSTAND®
micro IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M q M q
M q
M q MQmags q THE WORLD’S NEWSSTAND®
.............................................................................................................................................................................................. FAILURE PREDICTION
DQ CLK
Input
DQ
Capture edge
CLK
CLK
Time
CLK
CLK
Input
Capture edge
Clock
In
Input
D
Guardband
Q D
0 years
Degradation
Q D
2.5 years
Timing violation Soft breakdown
Q
3 years
D
Large slack
Q D
Degradation
Q D
Hard breakdown
Q
Fault exposure D
D
Fault manifested Fault exposed
Q
Fault manifested No fault seen
Q
Phased clock 2.5 years + Q' virtual aging (b)
(a)
Fault exposed
Figure 4. Signal integrity in circuits as they age. (a) In near-critical paths, the signal integrity will not hold once the guardband is degraded (a delay fault), and virtual aging alone can detect the problem in advance. (b) In noncritical paths, hard breakdown may occur before a delay fault manifests, but a phased clock on these paths can expose the issue earlier.
architecture layer) assumption: the state or values in the SRAM can be drained using an architectural mechanism, allowing the SRAM’s contents to be overwritten to allow BIST-based stuck-at-fault testing periodically. In the context of a microprocessor execution, this is a reasonable and easy-to-implement assumption. However, the circuit-based techniques attempt to address wear out in isolation and hence avoid such assumptions.
Evaluation Our goal of understanding wear out and the Aged Full-Chip Predictor’s effectiveness is organized around eight questions, of which questions 5 through 8 address overhead and accuracy. ' ' '
............................................................
30
micro IEEE
Q1: Are wear out and its effects measurably observable? Q2: Can voltage reduction virtually manifest wear-out faults? Q3: Are the manifested faults exposed to a higher level?
' ' ' ' '
Q4: Are the faults exposed to the higher level detected? Q5: What are the overheads? Q6: What is the delay to predict the wear out? Q7: When does this technique provably fail to predict wear out? Q8: How does this technique compare to the current state-of-the-art methods?
We examine each question for logic and SRAM. By design, we achieve low complexity, which was our other key goal.
Methodology Our evaluation of the Aged Full-Chip Predictor uses a prototype system we built on the basis of the OpenRISC processor (see Figure 5). For logic and Aged-SDMR, our general philosophy is as follows: '
Use Spice and MOSRA with the 32nm silicon-on-insulator library to evaluate any gate-level effects.
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M q M q
M q
M q MQmags q THE WORLD’S NEWSSTAND®
'
Use gate-level delay-aware simulations to check for timing faults. Use full-system emulation on the field-programmable gate array when actual runtime data is required.
1
Is wear out measurably observable? Is the degradation deterministic?
Logic 1 1
For Aged-AsymChk, our evaluation is similar:
' '
Use Spice and MOSRA to evaluate any gate-level effects, including the noise margin. Use the noise-margin results to determine failures in SRAM reads. Use analytical models and workload measurements to determine the effect of applications on wear out.
One difference is that we run more benchmarks using larger input sets, totaling 35 and spanning SPEC2K, SPEC2006, MediaBench, and Parboil, to capture cache and SRAM effects more representatively.
1
1 0
32-nm lib
2
Delay degradation
SPEC2000, SPEC2K6, Mediabench, Parboil Time (cache Voltage intensive) Usage
32-nm lib
HSpice + Mosra
Voltage transfer characteristics
Degradation indeterministic
Time
Simulation Figure 7 Degradation indeterministic
Vin A1 : Figure 3 (b, c) A2 : Figure 4
@Different utilizations @Supply voltage reduction
3
Can reducing supply voltage virtually manifest wear-out faults? SRAM
SPEC2000
Simulation Time Voltage Switching Activity
HSpice + Mosra
Delay
'
M q M q
M q
M q MQmags q THE WORLD’S NEWSSTAND®
Vout
'
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
Do the manifested faults get exposed to a higher level?
Are the faults exposed to the higher level detected?
4
Logic
SRAM
SPEC2000
No application dependency
Xilinx Zynq FPGA OpenRISC processor
Wornout SRAM
Xilinx Zynq FPGA
CLK
OpenRISC processor
OpenRISC Delay aware simulation processor 1 0 1 Fault vector
Checker
micro IEEE
A3: Figure 4(d) Read failure probability Stuck-at fault BIST captures all stuck-at faults
Architectural error rate
Timing fault rate
HSpice + Mosra
Aged-SDMR results Table 1 summarizes the key results for Aged-SDMR, and Table 2 compares AgedSDMR to three state-of-the-art techniques.9-11
5
What are the overheads? OpenRISC processor CLK
Logic
SRAM Synopsys Design Compiler- STA
Aged-AsymChk results
Fast gates
We address the evaluation questions for Aged-AsymChk in detail below.
Insert capture logic
Understanding degradation (Q1). Degradation in SRAM devices is measurably observable and cannot be statically determined because it depends on the switching activity. Figure 3 previously showed this aging behavior at the cell level. Figure 6a shows the wear out at the application level for every cell in a 64-Kbyte data cache (a two-way set associative, level-1 cache with 64-byte blocks). Here, we quantify and visualize wear-out intensity using a simple model: we count the number of cycles that a cell is 1 as a unit of wear out, and we assume every transition to 0 is "1/ 100th of one unit (modeling NBTI recovery). For all applications, we consider a 200-million-cycle window, and pixel values are normalized to maximum wear out. Two banks form the cache ways, shown side by side. We also determined the average and standard deviation of wear out across all the
6
7
Offline testing period ~ 10 hrs
Power, energy overhead ~ 0
Modified netlist
Reuse BIST
Area, power, energy overheads
No area overhead
What is the delay to predict? Logic
SRAM
Voltage reduction vs. virtual aging
Worst-case error occurrence HMM models No. of samples required
Worst-case prediction latency
Prediction latency, horizon
When does this technique provably fail to predict wear out? SRAM
Device failure analysis False positives/ negatives
Fault models Probabilistic models
8
Duration of 1 BIST test
How does it compare to the current state-of-the-art?
Logic State-of-the-art techniques
Failures that cannot be predicted
Caches with ECC (state-of-the-art)
Aged-SDMR
Analysis Table 1 Overheads area/power
Fault models
Time to predict
SRAM Cell failure probability (fc)
Wear-out rates Analytical Models
Prediction horizon
Table 3 Is ECC sufficient?
Figure 8
Figure 5. Evaluation setup. We built a prototype system based on the OpenRISC processor to evaluate the Aged Full-Chip Predictor.
.............................................................
NOVEMBER/DECEMBER 2015
micro IEEE
Wasted lifetime
31
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M q M q
M q
M q MQmags q THE WORLD’S NEWSSTAND®
micro IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M q M q
M q
M q MQmags q THE WORLD’S NEWSSTAND®
.............................................................................................................................................................................................. FAILURE PREDICTION
Table 1. Aged-SDMR results Evaluation questions
Results
Understanding
Delay degradation in CMOS logic is measurably observable.
degradation (Q1)
Dependent on factors including switching activity (cannot be statically determined).
Manifesting faults (Q2)
Reducing V DD mimics aging. For example, a 50-mV (4.1%) reduction corresponds to predicting up to nine months in advance.
Exposing faults (Q3)
While in Aged-SDMR mode, timing faults indicate impending hard or soft breakdowns. Virtual aging induces timing faults at the rate of between 0 to 9.8%.
Detecting faults (Q4)
Faults introduced in Aged-SDMR mode translate to architectural errors and can be caught without escapes. Empirically, errors were seen in at least 0.02% of cycles and were caught within a few samples.
Estimating
Aged-SDMR has small area (8.9%), power (2.54%), and energy (0.7%) overheads.
overheads (Q5) Delay to predict (Q6)
We can guarantee an upper bound on Aged-SDMR’s prediction latency mathematically, based on defect and sampling rates. The longest latency to predict is 0.4 days.
When the technique does not work (Q7)
Aged-SDMR cannot predict faults that do not start as delay faults. For delay-based faults, missed sites are those that have high switching activity but do not affect the architectural trace (integer benchmarks might do this to the floating-point pipeline). If more than 0.4 days of life remain, Aged-SDMR will still predict correctly. Masking scenario is rare in commercial designs because power/value gating avoids unnecessary switching.
Comparison to
Aged-SDMR is comparable, if not better, on other metrics and also provides generality.
state-of-the-art methods (Q8)
Previous techniques do not provide generality and accuracy, leaving fast gates (30 to 40% of gates) uncovered.
Table 2. A comparison of Aged-SDMR and three state-of-the-art techniques Overheads Area (%)
Power (%)
Time to predict
Prediction horizon
Online wear-out prediction
4.6†
8.6†
4 days
2 years, 4 days
WearMon11
(14‡
Not reported
Varies
Not reported
Technique 9
FIRST10 Not reported 0 1 day 9 months, 1 day* Aged-SDMR 8.94 3.2 0.4 days 9 months, 0.4 days ................................................................................................................................... †
For every eight signals monitored. Rough estimates from field-programmable gate array use numbers reported by the authors. * Assuming a virtual aging mechanism similar to this work. ‡
............................................................
32
micro IEEE
bits with all 35 applications and computed it to be 0.278 and 0.2895. Even simply looking at distributions of wear out among the bits, we observe they sometimes follow a normal
distribution but with large differences in standard deviation and variance across benchmarks (see Figure 6b). These data measurements demonstrate the diversity and
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M q M q
M q
M q MQmags q THE WORLD’S NEWSSTAND®
micro IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M q M q
M q
M q MQmags q THE WORLD’S NEWSSTAND®
substantiate two points—that the degradation is highly application dependent, and that degradation within the different cells of an SRAM block can vary significantly. Manifesting faults (Q2). As we demonstrated earlier, reducing VDD mimics aging (see Figure 3). Empirically, for example, a 45-mV reduction emulated 28 weeks of aging. Exposing faults (Q3) and detecting faults at a higher level (Q4). Figure 2d showed that the end effect of SRAM cell aging is read failure stability. By design, writing 1s and then reading them exposes the wear-out fault under virtual aging.
Delay to predict (Q6). Compared to logic, the delay to predict for SRAM is on the order of milliseconds, because the prediction happens in a single S-epoch and is application independent. The delay guarantees for logic are probabilistic and are for the worst case, because some sampling windows are required to guarantee overlap of the DMR window with a fault occurrence by the application. When the technique does not work (Q7). Failures in SRAM that do not start as read failures cannot be detected. Although these exist and include electromigration, for example, there is evidence that NBTI, which we cover, is dominant. Unlike the logic case, for device
(a)
175-vpr
429-mcf
456-hmmer
60 gzip vpr
50
mcf
Percentage of bits
Estimating overheads (Q5). In terms of area, there is practically no additional overhead— we simply reuse the existing BIST circuitry. In terms of performance slowdown, AgedAsymChk can be run quite infrequently. Because it predicts wear out without memory corruption and is 100 percent accurate, the only requirement is to run at periods less than the age mimicked by virtual aging, which is on the order of weeks. On the basis of our empirical data, the overhead of checking is pessimistically on the order of 1 million cycles. Even assuming that S-epochs are activated as often as every 100 context switches, which at a 5-ms OS scheduling quantum would be half a second, a 1-Ghz processor at one instruction per cycle would have negligible overhead (0.2 percent). Therefore, Aged-AsymChk introduces no significant performance, power, or area overhead to the system.
164-gzip
hmmer
40 30 20 10 0 0.0
(b)
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
IEEE
1.0
Normalized wear-out intensity
Figure 6. Application-level behavior of the wear out in the SRAM cells. (a) Visualization of the SRAM wear out in a 64-Kbyte data cache for four applications. Wear out of each SRAM cell depends on the application behavior. (b) SRAM cells distribution. A point (x, y) indicates that y percent of the bits in the SRAM have the wear-out intensity of x.
faults that adhere to the model, AgedAsymChk is 100 percent correct because it is based on the formal BIST model that can generate vectors with 100 percent coverage.
.............................................................
NOVEMBER/DECEMBER 2015
micro
0.9
33
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M q M q
M q
M q MQmags q THE WORLD’S NEWSSTAND®
micro IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M q M q
M q
M q MQmags q THE WORLD’S NEWSSTAND®
.............................................................................................................................................................................................. FAILURE PREDICTION
Table 3. Defect rates (parts per million) of SRAM arrays Defect rate for ECC (16 data bits, 6 ECC bits)
Defect rate for ECC (256 data bits, 10 ECC bits)
fc ðtÞ
Single failure
Single failure
Double failure
10"7
4,495
0
53,018
1
10"6 10"5
44,055 362,700
0 47
419,881 995,662
72 7,179
10"4
988,903
4,716
999,999
508,041
10"3
1,000,000
373,043
1,000,000
1,000,000
Comparison to state-of-the art methods (Q8) As we mentioned earlier, prior work does not provide low overhead, high accuracy, and low complexity. Quantitatively, Aged-AsymChk either eliminates silent data corruptions for baselines without ECC or it increases the array’s lifetime. We developed an SRAM array defect-rate model to show how we can extend the average proficient lifetime by 14 months, considering common wear-out patterns. We first used a fixed cell-failure model (excluding dynamic sources of wear out such as the application and temperature) and then extended those results, considering timevarying failure rates.
............................................................
34
micro IEEE
Double failure
Failure model preliminaries. Using basic probability, we built a simple analytical model for how wear out affects SRAM array failure. The key input was a cell’s read failure probability at a given time ðfc ðtÞÞ. (The read failure probability indicates the probability that a six-transistor SRAM cell has a read failure at a given time. For example, the read failure probability 10"7 indicates that one SRAM cell out of 107 cells has read failure.) We considered an SRAM made of n blocks and used cache-block granularity single-error correction and double-error detection ECC. We used two cache block sizes with k data bits and e ECC bits: (16, 6) and (256, 10). Also, we define the defect rate as the defective parts per million. Furthermore, the singlefailure defect rate considers one bit failure to be a defect, whereas the double-failure defect rate considers two failures (in a single block) to be a defect. ECC-only arrays are proficient only until the first error, at which point they must be decommissioned to prevent uncor-
rectable errors. Arrays with prediction capability are proficient until just before the second error, extending their lifetime. SRAM array model for fixed defect rates. We can build a defect rate model, based on the binomial probability model, for an SRAM array by calculating the failure probability of bits in a cache block ðfc ðtÞÞ, then the failure probability of blocks in the array. We consider both single-failure (Equation 1) and double-failure (Equation 2) cases below. fblock ; 1ðtÞ ¼ 1 " ð1 " fc ðtÞÞkþe fblock ; 2ðtÞ ¼ 1 " ½ð1 " fc ðtÞÞkþe
ð1Þ
farray ðtÞ ¼ 1 " ½ð1 " fblock;i ðtÞÞn +
ð3Þ
þðk þ eÞ=1 * fc ðtÞÞ * ð1 " fc ðtÞÞkþe"1 + ð2Þ
Equations 1 and 2 calculate the probability that one or two bits, respectively, in a given ðk þ eÞ-bit block are erroneous at a given time. Equation 3 finds the probability that one block in a given SRAM array made of n blocks is faulty at a given time. Table 3 shows the single- and double-failure defect rates for various cell failure probabilities ðfc ðtÞÞ and two extreme granularities of ECC. We can draw three implications from Table 3. First, as expected, fine-grained ECC has a lower defect rate. Second, at low cellfailure probabilities, the number of failures with only a single defect is orders of magnitude more than when allowing prediction. And third, schemes decommissioning arrays and cache blocks at first failure incur wasted lifetime: nearly 100 and 36 percent of coarseand fine-grained ECC, with fc ðtÞ ¼ 10"5 .
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M q M q
M q
M q MQmags q THE WORLD’S NEWSSTAND®
Normalized fc
10 9 8 7 6 5 4 3 2 1
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
Optimistic Linear Pessimistic
0
(a)
M q M q
M q
M q MQmags q THE WORLD’S NEWSSTAND®
Months of added life
micro IEEE
10
20
40 50 30 Age in months
60
70
35 30 25 20 15 10 5 0
(b)
Optimistic Linear Pessimistic
0
20 40 60 80 Percentage of SRAM arrays
100
Figure 7. Wear-out models and added life from effective prediction. (a) The (x, y) point indicates the read failure probability of an SRAM cell normalized to 10"6 (fc ðtÞ is y after x months). (b) The (x, y) point indicates that the lifetime of x percentage of total fabricated SRAM arrays is extended by y months.
Extending results for dynamic wear out. To quantify the wasted lifetime for SRAM arrays, we extend the model to include dynamic SRAM wear out, the primary effect of which is to cause fc ðtÞ to become time dependent (increasing over time). Our extended model must incorporate several issues. First, the wear out of different bits will vary, implying that a single fc ðtÞ no longer models the entire array. Second, depending on the SRAM’s usage, the fc ðtÞ changes to some value by the end of the SRAM array’s lifetime. Third, fc ðtÞ changes at some rate with time to reach this final value. Finally, we must determine when the array is single-failure defective or double-failure defective. These phenomenon are highly application dependent, and we make some simplifying assumptions to capture firstorder effects. First, we assume the highest fc ðtÞ of the bits in a block, thus providing a lower-bound estimate on wasted life. Second, we assume fc ðtÞ changes by one order of magnitude due to wear out—this has strong empirical evidence from circuit literature.3,12 Finally, to model the rate of change of fc ðtÞ, we consider reciprocal, linear change and exponential change as in Figure 7a. Linear change is likely the common case. Exponential and reciprocal represent the worst (pessimistic) case and best (optimistic) case for the benefits of our technique, respectively. We considered a 36-month period discretized at monthly granularity, and we assumed the second error occurs at the end of this period. We used fc ðtÞ at each month to calculate the defect rates, which determine how many arrays are wasted due to early
decommissioning based on the first failure. Figure 7b shows the dynamic wear-out model’s results in terms of months of added life for a percent of the SRAM arrays, which suggests two things. First, the lifetime can be extended significantly to 17, 14, and 7 months on average for the three scenarios. Second, significant fractions of SRAM arrays are improved by 95, 87, and 46 percent, respectively.
B
y providing a unified technique for error prediction in both logic and SRAM settings, which is low overhead and has high fault coverage, the Aged Full-Chip Predictor could serve as an important component for future fault-dominated technologies. The mechanisms behind the concepts of virtual aging and sampling are well understood and easy to implement, making the idea attractive and practical to deploy. One primary implication is that future designs can more aggressively provision the resources for recovering from soft errors (such as ECC in SRAMs), while relying on the Aged FullChip Predictor for the prediction and detection of hard errors. Looking forward, understanding the relationship between delay degradation and failure modes in far-out semiconductor technologies will be the key to using virtual aging to address future reliMICRO ability challenges.
.................................................................... References 1. A. Haggag et al., “Realistic Projections of Product Fails from NBTI and TDDB,” Proc. 44th Ann. IEEE Int’l Reliability Physics Symp., 2006, pp. 541–544.
.............................................................
NOVEMBER/DECEMBER 2015
micro IEEE
35
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M q M q
M q
M q MQmags q THE WORLD’S NEWSSTAND®
micro IEEE
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M q M q
M q
M q MQmags q THE WORLD’S NEWSSTAND®
.............................................................................................................................................................................................. FAILURE PREDICTION
2. A.W. Strong et al., Reliability Wearout Mechanisms in Advanced CMOS Technologies, vol. 12, Wiley-IEEE Press, 2009. 3. K. Kang et al., “Impact of Negative-Bias Temperature Instability in Nanoscale SRAM Array: Modeling and Analysis,” IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, vol. 26, no. 10, 2007, pp. 1770–1781. 4. A. Bansal et al., “Impacts of NBTI and PBTI on SRAM Static/Dynamic Noise Margins and Cell Failure Probability,” Microelectronics Reliability, vol. 49, no. 6, 2009, pp. 642–649. 5. T.T.-H. Kim and Z.H. Kong, “Impact Analysis of NBTI/PBTI on SRAM VMIN and Design Techniques for Improved SRAM VMIN,” J. Semiconductor Tech. and Science, vol. 13, no. 2, 2013, pp. 87–97. 6. S. Kothawade et al., “Mitigating NBTI in the Physical Register File through Stress Prediction,” Proc. IEEE 30th Int’l Conf. Computer Design, 2012, pp. 345–351. 7. S. Nomura et al., “Sampling þ DMR: Practical and Low-Overhead Permanent Fault Detection,” Proc. 38th Ann. Int’l Symp. Computer Architecture, 2011, pp. 201–212. 8. R. Balasubramanian and K. Sankaralingam, “Virtually-Aged Sampling DMR: Unifying Circuit Failure Prediction and Circuit Failure Detection,” Proc. 46th Ann. IEEE/ACM Int’l Symp. Microarchitecture, 2013, pp. 123–135. 9. J. Blome et al., “Self-Calibrating Online Wearout Detection,” Proc. 40th Ann. IEEE/ ACM Int’l Symp. Microarchitecture, 2007, pp. 109–122. 10. J.C. Smolens et al., “Detecting Emerging Wearout Faults,” 3rd IEEE Workshop Silicon Errors in Logic-System Effects, 2007; http://jared.smolens.org/documents/first________________________ smolens-selse07.pdf. ____________ 11. B. Zandian et al., “WearMon: Reliability Monitoring Using Adaptive Critical Path Testing,” Proc. 40th Ann. IEEE/IFIP Int’l Conf. Dependable Systems and Networks, 2010, pp. 151–160. 12. K. Kang et al., “Estimation of Statistical Variation in Temporal NBTI Degradation and Its
Amir Yazdanbakhsh is a PhD student in the School of Computer Science at the Georgia Institute of Technology and a research assistant in the Alternative Computing Technologies (ACT) Lab. His research interests include computer architecture, approximate general-purpose computing, mixed-signal accelerator design, machine learning, and programming languages for hardware design. Yazdanbakhsh has an MS in computer engineering from the University of Wisconsin–Madison and an MS in electrical and computer engineering from the University of Tehran. He is a student member of IEEE. Contact him at
[email protected]. ___________________ Raghuraman Balasubramanian is a digital design engineer at Google. His research interests include microprocessor architecture and circuit design. Balasubramanian has an MS in computer science from the University of Wisconsin–Madison, where he completed the work for this article. Contact him at
[email protected]. _________________ Tony Nowatzki is a PhD student in the Department of Computer Sciences at the University of Wisconsin–Madison and a member of the Vertical Research Group. His research interests include architecture and compiler codesign and mathematical modeling. Nowatzki has an MS in computer science from the University of Wisconsin– Madison. He is a student member of IEEE. Contact him at
[email protected]. __________ Karthikeyan Sankaralingam is an associate professor in the Department of Computer Sciences and the Department of Electrical and Computer Engineering at the University of Wisconsin–Madison, where he also leads the Vertical Research Group. His research interests include microarchitecture, architecture, and very large-scale integration. Sankaralingam has a PhD in computer science from the University of Texas at Austin. He is a senior member of IEEE. Contact him at
[email protected]. ___________
Impact on Lifetime Circuit Performance,” Proc. IEEE/ACM Int’l Conf. Computer-Aided Design, 2007, pp. 730–734.
____________ _______
............................................................
36
micro IEEE
IEEE MICRO
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
M q M q
M q
M q MQmags q THE WORLD’S NEWSSTAND®