Shmoos to find performance range and potential holes (i.e. failed region in ... 09/29/03 TM Mak. Post Silicon Validation. 10. Typical. Shmoo spec. Margin to spec ...
Post Silicon Validation
09/29/03 TM Mak
Post Silicon Validation
1
Pre-Si Validation (recap) • AV (Architectural or Functional) – Verify feature set match specifications – “Focused” assembly lang tests to verify each architectural features
• LV (Logic) – Usually based on the RTL model
• FV (Formal) • PV (Performance, design phase) – Circuit has to perform the other physical spec item: Speed/timing, voltage, temperature 09/29/03 TM Mak
Post Silicon Validation
2
Why Post-Si Validation? • Partially non-functional (buggy) parts may be shipped to customers – Customer replacement (recall)!!
• Customer may have flaky systems or frequent crashes • Silent data corruptions, e.g. bank balances, design simulations • Customer losing confidence (Value of Brand) • Lost sales • Depressed stock!! 09/29/03 TM Mak
Post Silicon Validation
3
Pre vs. Post Si Validation • SRTL validation is MUCH ss..ll..oo..ww..ee..rr than real silicon – Typical full-chip SRTL simulation with checkers ran at 3-5 Hz on a 1GHz machine – We used a compute farm containing thousands of machines running 24/7 to get ~6 billion cycles/week (109 ) – ALL the SRTL simulation cycles we recorded amounted to less than 2 minutes on a single 1 GHz system!
• But pre-silicon validation has some advantages – Fine-grained (cycle-by-cycle) checking – Visibility of internal state (e.g. caches, registers) – APIs to allow event injection
• No amount of dynamic validation is enough to exhaustively test a complex microprocessor – A single dyadic extended-precision FP instruction has 1050 combinations – A 3GHz processor will do 1017 cycles per year
09/29/03 TM Mak
Post Silicon Validation
4
Pre-silicon Validation cycles not that we don’t try (Millions)
6000
~1/4 sec of real time execution
Pentium 4
5000
4000
3000
2000
Fullchip
1000
09/29/03 TM Mak
Post Silicon Validation
51'99
48'99
45'99
42'99
39'99
36'99
33'99
30'99
27'99
24'99
21'99
18'99
15'99
12'99
09'99
06'99
03'99
52'98
49'98
46'98
43'98
40'98
0
5
Cost of a processor bug 10B$
Cost
Lost sales 1B$ Recall
d, n ou is f is e it 100M$ g u siv b a pen r 10M$ ate e ex l e or h T Time-to-market m e th 1M$
Pre-Si 2-4 yrs
09/29/03 TM Mak
Post-Si; Post-production Pre-production 5 yrs 0.5-1 yrs Post Silicon Validation
6
Post-Si Validation • DV (Design) – Verify design sensitivity to various environmental conditions, (e.g. voltage, temperature, frequency, and process variation [to some degree]) with a given test suite -- fullchip level test (diagnostic)
• SV (system) – Verify that product is functional in a given system (designed to facilitate debugging) with real peripherals, BIOS, OS and Applications
• CV (Commercial/Compatibility) – Verify that product is functional across OEM systems, OSes, applications
• CMV (Circuit Marginality) – Verify that product is free from sensitivities to voltage/temp/frequency in a system level operation
• MV (Manufacturing) – Verify that a given product can be manufactured in HVM (high volume manufacturing); Yield is not impacted if circuit is manufactured in high volume, over time: process variation (+ all of the above)
• RV (Reliability) – Verify that the product has a low infant mortality rate and achieve low FIT (failure in time) 09/29/03 TM Mak
Post Silicon Validation
7
Post-Si Validation • • • • • •
DV (Design) SV (system) CV (Commercial/Compatibility) CMV (Circuit Marginality) MV (Manufacturing) RV (Reliability)
09/29/03 TM Mak
Post Silicon Validation
8
Design Validation • Performance characterization – Start when product is healthy (no major functional bugs, initial units reach performance target) – Usually carried out on ATE with defined set of tests • Bench may still be needed for parametrics (esp. special interfaces)
– Parametric shmoos or data collection at specific test conditions – Shmoos to find performance range and potential holes (i.e. failed region in the middle of pass region)
• Also can be viewed as part of the effort to improve product performance – One such parameter is “frequency”!
09/29/03 TM Mak
Post Silicon Validation
9
Typical Shmoo
Margin to spec 09/29/03 TM Mak
Post Silicon Validation
spec 10
Fail Shmoos VDD/Freq Fail Shmoo Pattern List: ALU +----+----+----+----+----+----+----+----+ 40.0 MHz |***************AAAAAABBCCCCCCCCCCCCCCCCCC| 38.5 MHz |*****************AAAAABBCCCCCCCCCCCCCCCCC| 37.0 MHz |********************AAAAABBCCCCCCCCCCCCCC| 35.7 MHz |**********************AAAABBCCCCCCCCCCCCC| 34.5 MHz |************************AAAABBBCCCCCCCCCC| 33.3 MHz |*************************AAAABBBBCCCCCCCC| 32.3 MHz |***************************AAABBBBCCCCCCC| 31.2 MHz |****************************AAABBBBCCCCCC| 30.3 MHz |*****************************AABBBBBCCCCC| 29.4 MHz |******************************AABBBBBCCCC| 27.8 MHz |*******************************ABBBBBCCCC| 27.0 MHz |*******************************ABBBBBCCCC| 26.3 MHz |********************************ABBBBBCCC| 25.0 MHz |********************************ABBBBBCCC| 23.8 MHz |*********************************ABBBBCCC| 23.3 MHz |*********************************DDEEECCC| 22.2 MHz |*********************************DDDEEECC| 20.8 MHz |*********************************DDDDEEEC| 20.0 MHz |**********************************DDDEEEC| 18.9 MHz |**********************************DDDDEEE| 17.9 MHz |**********************************DDDDDEE| 16.9 MHz |**********************************DDDDDDE| +----+----+----+----+----+----+----+----+ VDD 7.0 6.5 6.0 5.5 5.0 4.5 4.0 3.5 3.0
25 26 27 28 29 30 31 32 33 34 36 37 38 40 42 43 45 48 50 53 56 59
ns ns ns ns ns ns ns ns ns ns ns ns ns ns ns ns ns ns ns ns ns ns
Character Cycle Vector Pattern ---------------------------------------------------A 116392 2347 gecc1000 B 89228 5467 geba1113 C 122 55 ext003 D 83535 1178 geba1113 E 83288 855 geba1113
09/29/03 TM Mak
Post Silicon Validation
• Indicates performance curve of various failure modes – Letter order is NOT pattern order
• EG: – ‘A’ is the first speedlimiting failure mode at high frequency & low VDD – ‘D’ and ‘E’ are hard low voltage failures modes even at slow speed.
11
DV: Timing Distribution 4 .999
Cumulative Distribution plot
.99 .95 .90 .75 .50 .25 .10 .05 .01 .001
3 2 1 0
Normal Quantile Plot
calculated_meas_rslt
-2
1.1e-9
09/29/03 TM Mak
1.3e-9
1.5e-9
1.7e-9
Mean Std Dev Std Err Mean
1.36e-9 1.058e-10 1.47e-12
90.0% 75.0% 50.0%
1.5e-9 1.44e-9 1.36e-9
upper 95% Mean lower 95% Mean N
1.3629e-9 1.3571e-9 5180
quartile median quartile
minimum
1.28e-9 1.22e-9 1.15e-9 1.09e-9 1.06e-9
Legend Count Axis
750
250
1.68e-9 1.6e-9 1.54e-9
0.5% 0.0%
-3
Moments
100.0% maximum 99.5% 97.5%
25.0% 10.0% 2.5%
-1
500
Quantiles
force_temperature
Temp range
0 110
Sample sizes Post Silicon Validation
12
Bench measurements • ATE may not have signal acquisition (or generation) capability – Zo – Serial (clock embedded) data
• Data needed for simulation model refinement – For future performance enhancement
09/29/03 TM Mak
Post Silicon Validation
13
Design Validation • Functional test – Pattern coverage (often from AV, or uAV) • May not even fit into ATE memory • How to generate? • How to measure?
– Correlation between System/Bench/ATE – Wide enough skews/sample size • Process, Vcc, Temperature,
• Surprises need to be contained – Errata, Spec change, etc.
• All anomaly are investigated by designers as part of silicon debug effort – Root cause issues – Provide inputs for new stepping design changes 09/29/03 TM Mak
Post Silicon Validation
14
Problems seen in a Shmoo VOL/IOL Shmoo
10.00 9.50 9.00 8.50 8.00 7.50 7.00 6.50 6.00 5.50 5.00 VOL
mA mA mA mA mA mA mA mA mA mA mA
Pattern: CMOS-OUTS V +---------+---------+---------+---------+ | *** ***| | **** ***| | ****** ***| | *********************************| >| **********************************| | ****** **************************| | ***********************************| | **************************** *****| | ********************** *************| | ************************************| | *************************************| +---------+---------+---------+---------+ 0.35V 0.40V 0.45V 0.50V 0.55V
• Test Margin, Test Stability, Holes, Performance Cliffs… • Debug is needed! 09/29/03 TM Mak
Post Silicon Validation
15
Post-Si Validation • • • • • •
DV (Design) SV (system) CV (Commercial/Compatibility) CMV (Circuit Marginality) MV (Manufacturing) RV (Reliability)
09/29/03 TM Mak
Post Silicon Validation
16
System Validation • Target major CPU attributes: – architecture, such as ISA, floating point unit, data space – micro-architecture, boundary conditions between microarchitectural units – Multi-processor, such as memory coherency, consistency and synchronization
• With these methods – – – –
Directed or Focused test Random Instruction Test Dataspace Test Opcode coverage Tests
• Validation Platform to ease debug and validation
09/29/03 TM Mak
Post Silicon Validation
17
Validation Platform • Scope, Logic Analyzer probe connection all built-in • Specific hardware to generate bus traffic patterns • Software controllable clock board that can step system frequency in 1MHz step • Software programmable voltage regulators for both CPU and chipsets • In-Target Probe (ITP) debug port • PCI link for large # of synthetic IO agents • UP, MP configurations
09/29/03 TM Mak
Post Silicon Validation
18
In Target Probe • Aka In Circuit Emulator – Software/hardware co-debug – Breakpoints, internal status dump, etc. – Intel may make hardware interface; third party to provide tools/GUI etc.
• Utilize the JTAG port to gain access to internal DFD control and data – Chip information is under NDA – Most CPU manufacturer provide similar capability to aid OEM customers to use their products
09/29/03 TM Mak
Post Silicon Validation
19
Focus vs Random test •
Focus tests can target specific processor features – – – –
•
Generate test conditions human test writers don’t think of Generate wild boundary conditions –
•
–
New instruction set added to architecture (MMX™, SSE™) •
Multiprocessor, Cache-coherency tests Self-modifying/Cross modifying code tests
– –
•
Test writer decides what results to check. Checking is explicit and very specific
No simulator required Tests are contrived and programmer written
09/29/03 TM Mak
Where it is difficult to determine EXACT test conditions
Don’t know what your looking for. You just know that there is (or maybe) a problem.
TLB shootdown, MP algorithms, Exhaustive dataspace testing
Self-Checking – –
• •
•
Tight control of test execution flow – –
•
Local APIC Cache geometry (1M/2M vs 256K) Paging Modes (4K/2M/4M, Mode A/B/C/PSE36) Data prefetcher logic
Algorithmic –
•
•
Fails in CV, where it’s difficult to debug Speedpath failure in CMV (look for a slower speed limiter)
You know most of the preconditions for a particular failures, but don’t quite have everything – –
•
Difficult to foresee all test conditions
Add in known preconditions and randomize other elements Use automated test generation to find failure condition
You have several sets of test conditions to test and you don’t want to rewrite tests to do it
Post Silicon Validation
20
Random Instruction Testing • randomize sets of processor features to execute with randomly generated instructions • involves more than just random instructions. – Architectural attributes (GDT, LDTs, Paging, MTRR, memory spanning and data patterns) are also randomized. Examples include • DPLs in segment descriptors randomized between 0 and 3 • PTE present bit randomized (present/not present) • Fixed/variable MTRRs randomized
– We’re really doing Random Architectural Testing
• Results were compared with memory space map between actual system and architectural simulated map • Random test code “doesn’t make any sense” – No apparent relationship between test instructions
• RIT tools are key enabler
09/29/03 TM Mak
Post Silicon Validation
21
More on RIT tools • For long pipeline processors, need to warm up the long pipeline – Truly random instructions tend to cause frequent exceptions or other control flow discontinuities – Typical RIT tools would encounter a pipeline hazard within 3 to 20 instructions on average
• Need to avoid false failures – Getting undefined processor states or other differences between architectural simulators and real silicon
• Need to fully propagate architectural states to memory image file without affecting randomness of instruction stream • Need to have high throughput to find those rare or highly intermittent bugs!
09/29/03 TM Mak
Post Silicon Validation
22
System validation -- Dataspace • Dataspace validation is ensuring that arithmetic operations generate the correct results: – To check the result, an inverse operation may be used: • If (A*B equals C), then (C/B = (A ± rounding error)) • must be able to guarantee that the inverse operation could not have an error that would exactly offset an error in the operation under test (aliasing).
– Use an algorithm that does not use the instruction/hardware under test: • A * 4 = A SHL 2
– Use a “Golden System” to check your answers: • A * B on our SUT = A * B on our “Golden System” • Defining what constitutes a “Golden System” is another tutorial in itself
09/29/03 TM Mak
Post Silicon Validation
23
Dataspace – lots of spaces FF FE F0-FD Ex Dx Cx Bx Ax 9x 8x 7x 6x 5x 4x 3x 2x 1x 02-0F 01 00
000000
000001
00000207FFFF
0800000FFFFF
10000017FFFF
1800001FFFFF
20000027FFFF
2800002FFFFF
30000037FFFF
3800003FFFFF
40000047FFFF
4800004FFFFF
50000057FFFF
5800005FFFFF
60000067FFFF
6800006FFFFF
70000077FFFF
7800007FFFFD
7FFFFE
7FFFFF
INF
SNaN
SNaN
SNaN
SNaN
SNaN
SNaN
SNaN
SNaN
SNaN
QNaN
QNaN
QNaN
QNaN
QNaN
QNaN
QNaN
QNaN
QNaN
QNaN
0
DEN
DEN
DEN
DEN
DEN
DEN
DEN
DEN
DEN
DEN
DEN
DEN
DEN
DEN
DEN
DEN
DEN
DEN
DEN
• • •
This table represents the dataspace for a single precision operand (32-bits). The yellow areas represent the special portions of the dataspace: Zero, Infinity, Denormals, and NaNs. These rows account for less than 1% of the dataspace. The green areas representing “interesting” areas because they contain boundary conditions. These areas also account for less than 1% of the dataspace. The blue areas represent the remaining 98%+ of the dataspace
09/29/03 TM Mak
Post Silicon Validation
24
Why dataspace validation? • Two recent arithmetic disasters : – On February 25, 1991, a Patriot Missile failed to intercept a scud missile. 28 U.S. soldiers were killed. • Cause: Accumulated rounding errors occurred in the routine that added up tenths of seconds the system was up since reboot. After 100 hours of operation, the clock was approximately 1/3 of a second slow.
– On June 4, 1996, an Ariane Rocket (valued at about $500 million) exploded 40 seconds after lift off. • Cause: The routine that converted a 64-bit floating-point value into a 16-bit integer overflowed causing the guidance program to crash.
• Intel has had an arithmetic “disaster” as well: – The disaster was a bug known as “FDIV” – More people probably know about FDIV than the two previously mentioned disasters 09/29/03 TM Mak
Post Silicon Validation
25
The FDIV bug • In late 1994, the Intel Pentium® Processor had what Intel referred to as a “flaw”. • 5 PLA entries were missing that were used in divide ops: – The entries were used to predict intermediate quotient values – ms digits of the missing entries were 1.0001, 1.0100, 1,0111, 1.1010, 1.1101 – Actually affected FDIV, FDIVP, FDIVR, FDIVRP, FIDIV, FIDIVR, FPREM, FPREM1, FPTAN, FPATAN
• Intel’s reaction aggravated the situation: – Erratum Title: “Slight precision loss for floating-point divides on specific operand pairs.” – “The statistical fraction of the total input number space prone to failure is 1.14x10 -10.” – “Statistical characterization yields a probability that about one in nine billion randomly fed operand pairs on the divide instruction will produce results with reduced precision.” – “The occurrence of the anomaly depends upon….the way in which the final result of the application is interpreted.”
• The bottom line: – This “flaw” cost Intel $490 million! 09/29/03 TM Mak
Post Silicon Validation
26
Why are dataspace bugs so bad? • Dataspace bugs are data integrity bugs! • They can be really, really, really hard to find • If you accidentally hit one, you probably won’t notice it: – You’re probably not going to get a blue screen or any other sign that an incorrect result has been generated.
• What can happen with inaccurate data? – A loss of accuracy in the 13th binary digit would probably not show up in your check book – Consider the Patriot Missile and Ariane Rocket disasters
• A spec update won’t solve a dataspace bug: – “Never add those two values” is not an acceptable workaround – Replacement would be financially catastrophic!
09/29/03 TM Mak
Post Silicon Validation
27
Analog Validation – Bus Signaling
• Signal integrity check is significant as system bus speed towards Gbps level – Esp. with MP
• Component level testing is insufficient to guarantee that system will operate with margin • DFT can be programmed through BIOS for varying time strobes and Vt levels in varying system environment VCTERM
VCTERM 1.89 in (48 mm)
iA Processor
0.8 in
iA Processor
2.12 in (54 mm)
iA Chipset
3.35 in (85 mm) 0.1 in
iA Processor
09/29/03 TM Mak
iA Processor
Post Silicon Validation
All dimensions are preliminary . Maximum trace length = ~6.7 in.
28
Post-Si Validation • • • • • •
DV (Design) SV (system) CV (Commercial/Compatibility) CMV (Circuit Marginality) MV (Manufacturing) RV (Reliability)
09/29/03 TM Mak
Post Silicon Validation
29
Silicon problem can show up in unique system/apps • Some Cards Come with their OWN BIOS => replace System BIOS calls
Written for specific O/S and Computer
Application
Written to support ONE O/S type of computer ! Specific to each computer (knows H/W)
BIOS
Can be different for each computer 09/29/03 TM Mak
Post Silicon Validation
H/W
SCSI BIOS
SCSI
30
Commercial/Compatibility Validation • Open platform challenges – Selling components imply compatibility with a wide diversity of hardware, software (contrary to other closed system architecture) – E.g. can’t ask Microsoft to change OS when there is a CPU/OS issue
• Validate microprocessors using robust combinations of commercial hardware and software across various platforms – Find CPU bugs and related issues – Identify and alert other groups when non-CPU issues are found, effecting program health (I.e., apps, drivers, Chipset, platform)
09/29/03 TM Mak
Post Silicon Validation
31
CV – issues and challenges • Testing covers all common user applications and configurations, from legacy through bleeding edge technology and custom apps – Gbit Ethernet, SATA, Fibre channel, RAID, PCI, USB, 1394, ACPI, UDMA/66/100, AGP4x/8x, MMX/SSE apps, Xeon debug, Geyserville, soft DVD, video conference, video capture/editing, web server, cv RIT, LVDS , all NW topologies, , , video server gaming applications
industry benchmarks for
stress…
– unique OS’s: Linux, WinXP, Win2K, WinNT, SCO, WinMe/98, Netware, etc.
• Need highly automated, non-intrusive, video capture and test management tool • Debugging is often difficult due to unavailability of source codes
09/29/03 TM Mak
Post Silicon Validation
32
Post-Si Validation • • • • • •
DV (Design) SV (system) CV (Commercial/Compatibility) CMV (Circuit Marginality) MV (Manufacturing) RV (Reliability)
09/29/03 TM Mak
Post Silicon Validation
33
Circuit Marginality Validation • Overclocking!! • Ask: – What is preventing us from running to the next bin (frequency target) – Do we hit the right combinations of worse case instructions or data combination/permutation?
• Find silicon critical paths – Won’t be surprised that real paths are VERY different than simulated paths
• System failures are tips of the iceberg: happen to stumble on something….but it may not be the worst case!! – Crosstalk, power noise type of problems are tricky
09/29/03 TM Mak
Post Silicon Validation
34
CMV Test Suites • RIT • Focus tests – Move lots of data, lots of cache/memory interaction – Create high power and large power transients – Directed random
• Commercial OSes and applications • Need close correlation with ATE platforms – Critical paths test need for speed binning
09/29/03 TM Mak
Post Silicon Validation
35
Sources of mis-correlation • ATE power system has far better control over ordinary VRM • Thermal control also may have different capability • Test codes are different – Need capture of system code to manufacturing tests to ensure worst case test are screened
09/29/03 TM Mak
Post Silicon Validation
S9K voltage error envelope
CMV voltage error envelope
36
Post-Si Validation • • • • • •
DV (Design) SV (system) CV (Commercial/Compatibility) CMV (Circuit Marginality) MV (Manufacturing) RV (Reliability)
09/29/03 TM Mak
Post Silicon Validation
37
Manufacturing Validation • Yield and manufacturability cannot be verified with samples – SIU/TIU (Sort/Test interface unit, probe card loadboards, etc.) – Equipment (prober, handlers) – Process (documentation, training) – A wide variety of production material
• Risk reduction with MV data and learning from similar processes, products, packages • All yield issues, manufacturability issues have to be addressed before ramping volume – Risk in tying up millions$$ of WIP 09/29/03 TM Mak
Post Silicon Validation
38
Post-Si Validation • • • • • •
DV (Design) SV (system) CV (Commercial/Compatibility) CMV (Circuit Marginality) MV (Manufacturing) RV (Reliability)
09/29/03 TM Mak
Post Silicon Validation
39
Reliability Validation • ESD • Infant Mortality • Life Test – Device degradation – Electro-migration, self-heat
• Package reliability – Temp Cycle – 85oC/85 RH – Steam
09/29/03 TM Mak
Post Silicon Validation
40
Electro-Static-Discharge
-
- -
-
-
IESD
- -
Charge Device Model -
-
-
-
-
-
-
-
IESD
-
-
Machine Model
di/dt is large
-
-
-
IESD
-
-
-
Human Body Model 09/29/03 TM Mak
Post Silicon Validation
41
Infant Mortality and Life-test Infant Mortality (declining failure rate) Due to Latent Reliability Defects Goals: 500 DPM within 0-30 days & 200 FIT within 0-1 year
Failure Rate
Cumulative Fallout Vs. Time (follows a lognormal distribution)
Scope of Burn In
0hr
12hr 6hr 24hr
48hr
Wearout (increasing failure rate) Due to oxide wearout, EM, hot-e, etc Goal: