Advanced/Parallel Computer. Architecture. Fundamentals of Computer. Design.
ECE7660 2005. Instruction Set Architecture. Applications. Operating System.
ECE7660/CSC7100 Advanced/Parallel Computer Architecture
Fundamentals of Computer Design
ECE7660 2005
CA = ISA + Organization Applications Operating System Compiler
Firmware
Instruction Set Architecture I/O System Instruction Set Processor Datapath & Control Digital Design Circuit Design Layout
software
The ISA abstracts the H/W and S/W interface and allows many implementation of varying cost and hardware performance to run the same S/W ECE7660 2005
instruction set
Recap: CA Topics Input/Output and Storage Disks, WORM, Tape
RAID
Memory Hierarchy
Coherence, Bandwidth, Latency
L2 Cache
Network Communication
L1 Cache
VLSI
Addressing, Protection, Exception Handling
Instruction Set Architecture
Pipelining, Hazard Resolution, Superscalar, Reordering, Prediction, Speculation, Vector, Dynamic Compilation
Other Processors
Emerging Technologies Interleaving Bus protocols
DRAM
Pipelining and Instruction Level Parallelism
ECE7660 2005
Recap: Parallel CA Topics
P
M
P
S
M
° ° °
P
M
P
M
Interconnection Network
Processor-Memory-Switch
Multiprocessors Networks and Interconnections
ECE7660 2005
Shared Memory, Message Passing, Data Parallelism Network Interfaces Topologies, Routing, Bandwidth, Latency, Reliability
Today’s Lecture Performance of a Computer • Performance metrics: time, power, etc • Performance measurement • Principles of performance-oriented computer design
Cost factor for price-performance optimization • Price is what you sell a finished good for • Cost is the amount you spent to produce it, including overhead • What factors determine the cost
ECE7660 2005
Limiting Factors of Technology Moore’s Law tells the number of transistors in a die increases steadily over the year Why can’t we manufacture a “huge” die to accommodate more dies?? • Small is fast • Small is cheap
ECE7660 2005
Costs of Integrated Circuits Die cost =
Cost _ per _ wafter Die _ per _ wafer × Yield
Dies per wafer =
wafer _ area Die _ area 8’’ wafer contains 563 R20K uP (0.18u)
Ex: Find the number of dies per 30cm wafer for a die that is 0.7cm on a side? Die _ per _ wafer =
π ×(Wafer _ diameter / 2) 2 π ×Wafer _ diameter −
Die _ area
2 × Die _ area
=1347 ECE7660 2005
Costs of ICs: Die Yield Die yield: percentage of good dies on a wafer number • Wafer yield ~ 100% • Defects per unit area: 0.4 ~ 0.8 per cm^2 • For today’s multilevel metal CMOS processes, alpha ~ 4.0
Die _ yield = Wafer _ yield × (1 +
Defects _ per _ unit _ area× Die _ area
α
) −α
Die Cost is goes roughly with the cube of the area. ECE7660 2005
Cost of ICs: Example Find the die yield for dies that are 1cm on a side and 0.7 cm on a side, assuming a defect density of 0.6 per cm^2 For large die, the yield = (1+0.6x1/4.0)-4=0.57 For small die, the yield = (1+0.6x0.49/4.0)-4=0.75 A 30cm wafer:
366 good 1cm dies 1014 good 0.7cm dies
Most of todays’ 32 bit and 64bit uP in 0.25u process fall in between these two sizes ECE7660 2005
Real World Examples (Old) Chip
Metal Line layers width
Wafer Defect cost /cm2
Area Dies/ Yield Die Cost mm2 wafer
386DX
2
0.90
$900
1.0
43
360
71%
$4
486DX2
3
0.80
$1200
1.0
81
181
54%
$12
PowerPC 601
4
0.80
$1700
1.3
121
115
28%
$53
HP PA 7100
3
0.80
$1300
1.0
196
66
27%
$73
DEC Alpha
3
0.70
$1500
1.2
234
53
19%
$149
SuperSPARC
3
0.70
$1700
1.6
256
48
13%
$272
Pentium
3
0.80
$1500
1.5
296
40
9%
$417
From "Estimating IC Manufacturing Costs,” by Linley Gwennap, Microprocessor Report, August 2, 1993, p. 15
ECE7660 2005
Updated Cost of ICs A 30cm wafer with 4 to 6 metal layers costs $5K~$6K. Assume the wafer costs $5500 • The cost of 0.7cm die costs $5.42 • The cost of 1cm die costs $15.03 Implications: • Die size is the sole factor under the control of computer designers • Since the alpha in die_yield is around 4, the cost of die would grow with the 4th power of the die size • In practice, the number of defects per unit area is small, the cost per die grows as the square of the die area ECE7660 2005
Other Costs IC cost = Die cost + Testing cost + Packaging cost Final test yield Packaging Cost: depends on pins, heat dissipation
Chip 386DX 486DX2 PowerPC 601 HP PA 7100 DEC Alpha SuperSPARC Pentium
ECE7660 2005
Die cost $4 $12 $53 $73 $149 $272 $417
Package pins type 132 QFP 168 PGA 304 QFP 504 PGA 431 PGA 293 PGA 273 PGA
cost $1 $11 $3 $35 $30 $20 $19
Test & Assembly $4 $12 $21 $16 $23 $34 $37
Total $9 $35 $77 $124 $202 $326 $473
Learning Curve in Pricing Structure Understanding cost and pricing structure of the industry and market is key to cost-sensitive design of computers; The Learning Curve: manufacturing costs decrease over time, best measured by change in yield DRAM chip: drops by 40% per year Per MB: $5000 in1977 to $0.08 in 2001
Cost of Pentium III
ECE7660 2005
Price for a Computer System Distribution of cost in a system • Processor/memory board ? • I/O devices? • Cabinet, power, cables, etc
Cost vs Price
ECE7660 2005
Today’s Lecture Performance of a Computer • Performance metrics: time, power, etc • Performance measurement • Performance-Guided Computer Design Cost and Price • Price is what you sell a finished good for • Cost is the amount you spent to produce it, including overhead • What factors determine the cost ECE7660 2005
The bottom line: Performance Airplane
DC to Paris
Range
Speed (m.p.h.)
Passenge rs
Throughput (p x m.p.h.)
Cost
Boeing 747 BAC/Sud Concorde
6.5 hours
4150
610
470
286,700
???
3 hours
4000
1350
132
178200
???
Douglas DC-8
7.3 hours
8720
544
146
79,424
???
°Different measurements lead to different results ° Time to do the task (Execution Time) – execution time, response time, latency ° Tasks per day, hour, week, sec, ns. .. (Performance) – throughput, bandwidth ° Cost renders the measurement more complex ° The bottom-line performance measurement is CPU execution time.
ECE7660 2005
Execution Time: A Performance Metric Execution time of a program • Wall-clock time, response time, or elapse time --- the latency to complete a task, including disk access, memory accesses, i/o activities, os overhead • CPU time: cpu is computing, not including the time waiting for I/O or running other programs - User cpu time vs system cpu time " X is n times faster than Y" means ExTime(Y) Performance(X) -------------- = ---------------------ExTime(X) Performance(Y)
ECE7660 2005
Benchmark Performance How we evaluate differences • Different systems • Changes to a single system Provide a target • Benchmarks should represent large class of important programs • Improving benchmark performance should help many programs For better or worse, benchmarks shape a field Good ones accelerate progress • good target for development Bad benchmarks hurt progress • help real programs v. sell machines/papers? • Inventions that help real programs don’t help benchmark ECE7660 2005
Programs to Evaluate Performance (Toy)
Benchmarks
• Small but easy to compile an run on simulators, convenient in early designing stage of a new machine • 10-100 line • e.g.,: sieve, puzzle, quicksort
Synthetic Benchmarks • attempt to match average frequencies of real workloads • e.g., Whetstone(Algol60ÆFortran), Dhrystone
Kernels • Time critical excerpts of Real programs • e.g., Livermore loops(21 loops), Linpack(linear algebra) • Popular in scientific computing • Isolating perf. of individual features of a machine
Real programs • e.g., gcc, spice ECE7660 2005
Benchmark Suites 1987 RISC industry mired in “bench marketing”: (“That is 8 MIPS machine, but they claim 10 MIPS!”) SPEC: Systems Performance Evaluation Committee • Standardized benchmark application suites established by computer industry in 1988. http://www.spec.org • Create standard list of programs, inputs, reporting: some real programs, includes OS calls, some I/O • SPEC benchmarks for different application classes
ECE7660 2005
SPEC first round First round 1989; 10 programs = 4 for integer + 6 for FP, single number to summarize performance One program (matrix300): 99% of time in single line of code New front-end compiler could improve dramatically 800 700 600 500 400 300 200 100
tomcatv
fpppp
matrix300
eqntott
li
nasa7
doduc
spice
epresso
gcc
0
Benchmark
ECE7660 2005
SPEC Evolution Second round; SpecInt92 (6 integer programs) and SpecFP92 (14 floating point programs) Compiler Flags unlimited. March 93 of DEC 4000 Model 610: spice: unix.c:/def=(sysv,has_bcopy,”bcopy(a,b,c)=memcpy (b,a,c)” wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200 nasa7: /norecu/ag=a/ur=4/ur2=200/lc=blas Add SPECbase: don’t allow program-specific optimization flags. Third round; 1995; new set of programs • “benchmarks useful for 3 years” Forth round: SPEC2000
ECE7660 2005
Benchmark Suites Desktop Benchmark Suites: • CPU-intensive benchmarks: SPEC89 Æ SPEC92 Æ SPEC95 Æ SPEC2000(11 int CINT & 14 fp CFP2000): SPEC2000 CFP2000 real programs modified for portability and highlighting CPU • Graphics-intensive benchmarks: SPECviewperf for systems supporting the OpenGL graphics library, SPECapc for applications with intensive use of graphics
Server Benchmark Suites: • CPU-throughput benchmarks: SPEC CPU2000 Æ SPECrate • I/O-intensive benchmarks: SPECSFS for file server, SPECWeb for web server • Transaction-processing (TP) benchmarks: TPC (Transaction Processing Council, www.tpc.org): TCP-A (85) Æ TCP-C (complex query) Æ TCP-H (ad-hoc decision support)Æ TCP-R (business decision support) Æ TCPW (web-oriented). - E.g. TPC-W is a Web-based transaction benchmark that simulates the activities of a business-oriented transactional Web server.
Embedded Benchmarks: EEMBC (“embassy suites”) ECE7660 2005
How to Summarize Results? Program 1: 1 sec on machine A, 10 sec on B, 20 sec on C Program 2: 1000 sec on A 100 sec, 20 sec on C What are your conclusions?
• A is 10 times faster than B for program1. • B is 10 times faster than A for Program2. • Total execution time: a consistent summary measure B is 1001/110=9.1 times faster than A. C is 25 times faster than A for P1 and P2 C is 2.75 times faster than B for P1 and P2 Workload: need to consider the percentage/frequency of teach program in the total job. ECE7660 2005
How to Summarize Performance
Suppose n programs have execution time
ti
where
n
∑
Suppose the workload is ( w1 , w 2 ,.... w n ) where
i = 1, 2 ,... n .
wi = 1 .
i =1
n
1 Arithmetic Mean AM ( t i ) = n
1 ti = ( t1 + t 2 + Λ + t n ) n
∑
i =1
(ti ) =
Weighted Arithmetic mean WAM
n
∑
t i w i = t1 w 1 + t 2 w 2 + Λ + t n w n
i =1
Geometric Mean GM ( t i ) =
n
∏
n
ti =
1
n
1
i =1
Weighted Geometric Mean Harmonic Mean HM ( t i ) =
WGM
n n
∑
i =1
Weighted Harmonic Mean GHM
( t i ) = t1
=
1 ti
w1
× t2
w2
× Λ × tn
wn
n 1 1 1 + +Λ + t1 t 2 tn
(ti ) =
n n
∑
i =1
wi ti
=
n w1 w2 w + +Λ + n t1 t2 tn
ECE7660 2005
Performance Measure Assume
Program P1 P2 Arith weight:
Computers A B 1.0 10.0 1000.0 100.0
C 20.0 20.0
W(1) 0.50 0.50
500.5 ? ?
20.0 ? ?
if W(1) If W(2) If W(3)
55.0 ? ?
Normalized to Machine A P1 1.0 10.0 20.0 P2: 1.0 0.1 0.02 Arith weight 1.0 5.05 0.63 Geo mean 1.0 1.0 0.63
ECE7660 2005
1
t1 × t 2 × Λ × t n = t1 n × t 2 n × Λ × t n n
Weights W(2) W(3) 0.909 0.999 0.091 0.001
Same for diff. normalization
Example Normalized to A Normalized to B Time on A
Time on B
A
B
A
B
program1
1
10
1
10
0.1
1
program2
1000
100
1
0.1
10
1
Arithmetic mean
500.5
55
1
5.05
5.05
1
Geometric mean
31.6
31.6
1
1
1
1
The difficulty arises from the use of arithmetic mean of normalized time. Geometric mean is independent of which data series we use for normalized time because
GM ( xi ) x = GM ( i ) GM ( yi ) yi ECE7660 2005
How to Summarize Performance 3 means: arithmetic, geometric and harmonic. Which one is correct? Arithmetic mean (or weighted arithmetic mean) tracks execution time. Harmonic mean (or weighted harmonic mean) of rates (e.g., MFLOPS) tracks execution time Normalized execution time is handy for scaling performance. But, • do not take the arithmetic mean of normalized execution time, use of the geometric wards all improvements equally: program A going from 2 seconds to 1 second as important as program B going from 2000 seconds to 1000 seconds. ECE7660 2005
Quantitative Principles of Computer Design
Amdahl’s Law --- Make the common case fast CPU time equation
ECE7660 2005
Amdahl's Law Speedup due to enhancement E:
Speedup(E) =
ExTime(without E) Performance(with E) = ExTime(with E) Performance(without E) F
F S Suppose that enhancement E accelerates a fraction F of the task by a factor S and the remainder of the task is unaffected then,
F ExTime(with E) = ((1 - F) + ) × ExTime(without E) S Speedup(E) =
ECE7660 2005
ExTime(without E) 1 1 = ≤ F F 1- F ((1 - F) + ) × ExTime(without E) (1 - F) + S S
Example: Suppose a person wants to travel from city A to city B by city C. The routes from A to C are in mountains and the routes from C to B are in desert. The distances from A to C , and from C to B are 80 miles and 200 miles, respectively. From A to C, walk at speed of 4 mph From C to B, walk or drive (at speed of 100 mph) Question: How long will it take for the entire trip How much faster from A to B by a car as opposed to walk
ECE7660 2005
Example: Suppose an enhancement runs 10 times faster than the original machine, but is only usable 40% of the time. Question: what is the overall speedup? Answer: Fraction_enhance = 0.4 Speedup_enhanced = 10 Speedup_overall = 1/(0.6+0.4/10) = 1.56
ECE7660 2005
Metrics of performance
Answers per month Operations per second
Application Programming Language Compiler ISA
(millions) of Instructions per second – MIPS (millions) of (F.P.) operations per second – MFLOP/s
Datapath Control
Megabytes per second
Function Units Transistors Wires Pins
Cycles per second (clock rate)
ECE7660 2005
Relating CPU Time Metrics CPU execution time = CPU clock cycles/pgm X clock cycle time or CPU execution time = CPU clock cycles/pgm ÷ clock rate CPU clock cycles/pgm = Instructions/pgm X avg. clock cycles per instr. or CPI = CPU clock cycles/pgm ÷ Instructions/pgm CPI tells us something about the Instruction Set Architecture, the Implementation of that architecture, and the program measured
ECE7660 2005
Aspects of CPU Performance CPU time
= Seconds
= Instructions x Cycles
Program
instr. count
Program
Instruction
CPI
clock rate
x Seconds Cycle
Program Compiler Instr. Set Arch. Organization Technology ECE7660 2005
Aspects of CPU Performance CPU time
= Seconds Program
= Instructions x Cycles Program
instr count (IC) CPI Program
X
X
Compiler
X
(x)
Instr. Set.
X
X
Organization Technology ECE7660 2005
X
Instruction
clock rate
X X
x Seconds Cycle
Organizational Trade-offs
Application Programming Language Compiler ISA
Instruction Mix
Datapath Control
CPI
Function Units Cycle Time
Transistors Wires Pins
ECE7660 2005
CPI – How to compute? “Average cycles per instruction”
CPI = (CPU Time × Clock Rate) / Instruction Count = Clock Cycles / Instruction Count “CPU clock cycles”
n
ClockCycleTime × ∑ CPI i × Ci
CPU time =
i =1
"instruction frequency" n
CPI =
∑ CPI
i
× Fi
where
i =1
Invest resources where time is Spent!
ECE7660 2005
Fi =
Ci Instruction _ Count
Example Our favorite program runs in 10 sec on machine A, which has a 400MHz clock. We are trying to design a machine B with faster clock rate so as to reduce the execution time to 6 sec. The increase of clock rate will affect the rest of the CPU design, causing B to require 1.2 times as many clock cycles as machine A for this program. What clock rate should be?
ECE7660 2005
Answer: CPU time A = CPU clock cycle A / clock rate A ==> CPU clock cycle A = 10 sec x 400 x 10^6 Clock rate B = CPU clock cycle B / CPU time B = 1.2*400*10^6 / 6 = 800 MHz
ECE7660 2005
Example Base Machine (Reg / Reg) and Instruction frequencies in the execution of a program: Op ALU Load Store Branch
Freq 43% 21% 12% 24%
Cycles 1 2 2 2
Question: What is the average CPI of the machine
ECE7660 2005
Example: Suppose we have two implementations of the same instruction set. Machine A has a clock cycle time of 10 ns and an average CPI of 2.0 for some program. Machine B has a clock cycle time of 20 ns and an average CPI of 1.2 for the same program. Which is faster? And by how much? Let I denote the number of instructions of the program CPU time A = I * 2.0* 10 = 20 I CPU time B = I * 1.2 *20 = 24 I Machine A is 1.2 faster than B . ECE7660 2005
Example ISA has 3 kinds of instructions: Instruction class A B C
CPI for this instruction class 1 2 3
One program has 2 code sequences: Code Sequence
Instruction counts for instruction class A B C
1 2
2 4
1 1
2 1
Which code sequence has more instructions? Which will be faster? What is the CPI for each sequence? S.1 has 5 instructions; S.2 has 6. S.1 needs 2x1+1x2+2x3=10 cycles; S.2 needs 4x1+1x2+1x3=9 cycles. S.1 has CPI=10/5=2; S.2 has CPI=9/6=1.5 ECE7660 2005
Marketing Metrics MIPS
= Instruction Count / Time * 10^6 = Clock Rate / CPI * 10^6
•Million Instructions Per Seconds •machines with different instruction sets ? •programs with different instruction mixes ? •dynamic frequency of instructions • uncorrelated with performance
MFLOP/S = FP Operations / Time * 10^6
•Million Floating-point Operations Per Second •machine dependent •often not where time is spent ECE7660 2005
Example: Assume we build an optimizing compiler for the load/store machine. The compiler discards 50% of the ALU instructions. 1) What is the CPI_opt ? 2) Ignoring system issues and assuming a 20 ns clock cycle time (50 MHz clock rate). What is the MIPS rating for optimized code versus unoptimized code? Does the MIPS rating agree with the rating of execution time?
ECE7660 2005
Performance Evaluation Summary CPU time
= Seconds Program
= Instructions x Cycles Program
Instruction
x Seconds Cycle
Time is the measure of computer performance! Good products created when have: • Good benchmarks • Good ways to summarize performance If not good benchmarks and summary, then choice between improving product for real programs vs. improving product to get more sales=> sales almost always wins. Remember Amdahl’s Law: Speedup is limited by unimproved part of program.
ECE7660 2005
Summary of IC Costs IC cost = Die cost =
Die cost + Testing cost + Packaging cost Final test yield Wafer cost Dies per Wafer × Die yield
Dies per wafer =
π (Wafer_dia m/2)2 π × Wafer_diam − − Test_Die Die_Area 2 ⋅ Die_Area
⎧⎪ ⎛ Defect_Density × Die_area ⎞ Die Yield = Wafer_yield × ⎨1 + ⎜ ⎟ α ⎝ ⎠ ⎪⎩
Die Cost goes roughly with die area4 ECE7660 2005
−α
⎫⎪ ⎬ ⎪⎭