ECE7660/CSC7100 Advanced/Parallel Computer Architecture ...

18 downloads 5816 Views 351KB Size Report
Advanced/Parallel Computer. Architecture. Fundamentals of Computer. Design. ECE7660 2005. Instruction Set Architecture. Applications. Operating System.
ECE7660/CSC7100 Advanced/Parallel Computer Architecture

Fundamentals of Computer Design

ECE7660 2005

CA = ISA + Organization Applications Operating System Compiler

Firmware

Instruction Set Architecture I/O System Instruction Set Processor Datapath & Control Digital Design Circuit Design Layout

software

ƒ The ISA abstracts the H/W and S/W interface and allows many implementation of varying cost and hardware performance to run the same S/W ECE7660 2005

instruction set

Recap: CA Topics Input/Output and Storage Disks, WORM, Tape

RAID

Memory Hierarchy

Coherence, Bandwidth, Latency

L2 Cache

Network Communication

L1 Cache

VLSI

Addressing, Protection, Exception Handling

Instruction Set Architecture

Pipelining, Hazard Resolution, Superscalar, Reordering, Prediction, Speculation, Vector, Dynamic Compilation

Other Processors

Emerging Technologies Interleaving Bus protocols

DRAM

Pipelining and Instruction Level Parallelism

ECE7660 2005

Recap: Parallel CA Topics

P

M

P

S

M

° ° °

P

M

P

M

Interconnection Network

Processor-Memory-Switch

Multiprocessors Networks and Interconnections

ECE7660 2005

Shared Memory, Message Passing, Data Parallelism Network Interfaces Topologies, Routing, Bandwidth, Latency, Reliability

Today’s Lecture ƒ Performance of a Computer • Performance metrics: time, power, etc • Performance measurement • Principles of performance-oriented computer design

ƒ Cost factor for price-performance optimization • Price is what you sell a finished good for • Cost is the amount you spent to produce it, including overhead • What factors determine the cost

ECE7660 2005

Limiting Factors of Technology ƒ Moore’s Law tells the number of transistors in a die increases steadily over the year ƒ Why can’t we manufacture a “huge” die to accommodate more dies?? • Small is fast • Small is cheap

ECE7660 2005

Costs of Integrated Circuits Die cost =

Cost _ per _ wafter Die _ per _ wafer × Yield

Dies per wafer =

wafer _ area Die _ area 8’’ wafer contains 563 R20K uP (0.18u)

Ex: Find the number of dies per 30cm wafer for a die that is 0.7cm on a side? Die _ per _ wafer =

π ×(Wafer _ diameter / 2) 2 π ×Wafer _ diameter −

Die _ area

2 × Die _ area

=1347 ECE7660 2005

Costs of ICs: Die Yield ƒ Die yield: percentage of good dies on a wafer number • Wafer yield ~ 100% • Defects per unit area: 0.4 ~ 0.8 per cm^2 • For today’s multilevel metal CMOS processes, alpha ~ 4.0

Die _ yield = Wafer _ yield × (1 +

Defects _ per _ unit _ area× Die _ area

α

) −α

Die Cost is goes roughly with the cube of the area. ECE7660 2005

Cost of ICs: Example ƒ Find the die yield for dies that are 1cm on a side and 0.7 cm on a side, assuming a defect density of 0.6 per cm^2 For large die, the yield = (1+0.6x1/4.0)-4=0.57 For small die, the yield = (1+0.6x0.49/4.0)-4=0.75 A 30cm wafer:

366 good 1cm dies 1014 good 0.7cm dies

Most of todays’ 32 bit and 64bit uP in 0.25u process fall in between these two sizes ECE7660 2005

Real World Examples (Old) Chip

Metal Line layers width

Wafer Defect cost /cm2

Area Dies/ Yield Die Cost mm2 wafer

386DX

2

0.90

$900

1.0

43

360

71%

$4

486DX2

3

0.80

$1200

1.0

81

181

54%

$12

PowerPC 601

4

0.80

$1700

1.3

121

115

28%

$53

HP PA 7100

3

0.80

$1300

1.0

196

66

27%

$73

DEC Alpha

3

0.70

$1500

1.2

234

53

19%

$149

SuperSPARC

3

0.70

$1700

1.6

256

48

13%

$272

Pentium

3

0.80

$1500

1.5

296

40

9%

$417

From "Estimating IC Manufacturing Costs,” by Linley Gwennap, Microprocessor Report, August 2, 1993, p. 15

ECE7660 2005

Updated Cost of ICs ƒ A 30cm wafer with 4 to 6 metal layers costs $5K~$6K. Assume the wafer costs $5500 • The cost of 0.7cm die costs $5.42 • The cost of 1cm die costs $15.03 ƒ Implications: • Die size is the sole factor under the control of computer designers • Since the alpha in die_yield is around 4, the cost of die would grow with the 4th power of the die size • In practice, the number of defects per unit area is small, the cost per die grows as the square of the die area ECE7660 2005

Other Costs IC cost = Die cost + Testing cost + Packaging cost Final test yield Packaging Cost: depends on pins, heat dissipation

Chip 386DX 486DX2 PowerPC 601 HP PA 7100 DEC Alpha SuperSPARC Pentium

ECE7660 2005

Die cost $4 $12 $53 $73 $149 $272 $417

Package pins type 132 QFP 168 PGA 304 QFP 504 PGA 431 PGA 293 PGA 273 PGA

cost $1 $11 $3 $35 $30 $20 $19

Test & Assembly $4 $12 $21 $16 $23 $34 $37

Total $9 $35 $77 $124 $202 $326 $473

Learning Curve in Pricing Structure ƒ Understanding cost and pricing structure of the industry and market is key to cost-sensitive design of computers; ƒ The Learning Curve: manufacturing costs decrease over time, best measured by change in yield DRAM chip: drops by 40% per year Per MB: $5000 in1977 to $0.08 in 2001

Cost of Pentium III

ECE7660 2005

Price for a Computer System ƒ Distribution of cost in a system • Processor/memory board ? • I/O devices? • Cabinet, power, cables, etc

ƒ Cost vs Price

ECE7660 2005

Today’s Lecture ƒ Performance of a Computer • Performance metrics: time, power, etc • Performance measurement • Performance-Guided Computer Design ƒ Cost and Price • Price is what you sell a finished good for • Cost is the amount you spent to produce it, including overhead • What factors determine the cost ECE7660 2005

The bottom line: Performance Airplane

DC to Paris

Range

Speed (m.p.h.)

Passenge rs

Throughput (p x m.p.h.)

Cost

Boeing 747 BAC/Sud Concorde

6.5 hours

4150

610

470

286,700

???

3 hours

4000

1350

132

178200

???

Douglas DC-8

7.3 hours

8720

544

146

79,424

???

°Different measurements lead to different results ° Time to do the task (Execution Time) – execution time, response time, latency ° Tasks per day, hour, week, sec, ns. .. (Performance) – throughput, bandwidth ° Cost renders the measurement more complex ° The bottom-line performance measurement is CPU execution time.

ECE7660 2005

Execution Time: A Performance Metric ƒ Execution time of a program • Wall-clock time, response time, or elapse time --- the latency to complete a task, including disk access, memory accesses, i/o activities, os overhead • CPU time: cpu is computing, not including the time waiting for I/O or running other programs - User cpu time vs system cpu time " X is n times faster than Y" means ExTime(Y) Performance(X) -------------- = ---------------------ExTime(X) Performance(Y)

ECE7660 2005

Benchmark Performance ƒ How we evaluate differences • Different systems • Changes to a single system ƒ Provide a target • Benchmarks should represent large class of important programs • Improving benchmark performance should help many programs ƒ For better or worse, benchmarks shape a field ƒ Good ones accelerate progress • good target for development ƒ Bad benchmarks hurt progress • help real programs v. sell machines/papers? • Inventions that help real programs don’t help benchmark ECE7660 2005

Programs to Evaluate Performance ƒ (Toy)

Benchmarks

• Small but easy to compile an run on simulators, convenient in early designing stage of a new machine • 10-100 line • e.g.,: sieve, puzzle, quicksort

ƒ Synthetic Benchmarks • attempt to match average frequencies of real workloads • e.g., Whetstone(Algol60ÆFortran), Dhrystone

ƒ Kernels • Time critical excerpts of Real programs • e.g., Livermore loops(21 loops), Linpack(linear algebra) • Popular in scientific computing • Isolating perf. of individual features of a machine

ƒ Real programs • e.g., gcc, spice ECE7660 2005

Benchmark Suites ƒ 1987 RISC industry mired in “bench marketing”: (“That is 8 MIPS machine, but they claim 10 MIPS!”) ƒ SPEC: Systems Performance Evaluation Committee • Standardized benchmark application suites established by computer industry in 1988. http://www.spec.org • Create standard list of programs, inputs, reporting: some real programs, includes OS calls, some I/O • SPEC benchmarks for different application classes

ECE7660 2005

SPEC first round ƒ First round 1989; 10 programs = 4 for integer + 6 for FP, single number to summarize performance ƒ One program (matrix300): 99% of time in single line of code ƒ New front-end compiler could improve dramatically 800 700 600 500 400 300 200 100

tomcatv

fpppp

matrix300

eqntott

li

nasa7

doduc

spice

epresso

gcc

0

Benchmark

ECE7660 2005

SPEC Evolution ƒ Second round; SpecInt92 (6 integer programs) and SpecFP92 (14 floating point programs) Compiler Flags unlimited. March 93 of DEC 4000 Model 610: spice: unix.c:/def=(sysv,has_bcopy,”bcopy(a,b,c)=memcpy (b,a,c)” wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200 nasa7: /norecu/ag=a/ur=4/ur2=200/lc=blas ƒ Add SPECbase: don’t allow program-specific optimization flags. ƒ Third round; 1995; new set of programs • “benchmarks useful for 3 years” ƒ Forth round: SPEC2000

ECE7660 2005

Benchmark Suites ƒ Desktop Benchmark Suites: • CPU-intensive benchmarks: SPEC89 Æ SPEC92 Æ SPEC95 Æ SPEC2000(11 int CINT & 14 fp CFP2000): SPEC2000 CFP2000 real programs modified for portability and highlighting CPU • Graphics-intensive benchmarks: SPECviewperf for systems supporting the OpenGL graphics library, SPECapc for applications with intensive use of graphics

ƒ Server Benchmark Suites: • CPU-throughput benchmarks: SPEC CPU2000 Æ SPECrate • I/O-intensive benchmarks: SPECSFS for file server, SPECWeb for web server • Transaction-processing (TP) benchmarks: TPC (Transaction Processing Council, www.tpc.org): TCP-A (85) Æ TCP-C (complex query) Æ TCP-H (ad-hoc decision support)Æ TCP-R (business decision support) Æ TCPW (web-oriented). - E.g. TPC-W is a Web-based transaction benchmark that simulates the activities of a business-oriented transactional Web server.

ƒ Embedded Benchmarks: EEMBC (“embassy suites”) ECE7660 2005

How to Summarize Results? Program 1: 1 sec on machine A, 10 sec on B, 20 sec on C Program 2: 1000 sec on A 100 sec, 20 sec on C What are your conclusions?

• A is 10 times faster than B for program1. • B is 10 times faster than A for Program2. • Total execution time: a consistent summary measure ƒ B is 1001/110=9.1 times faster than A. ƒ C is 25 times faster than A for P1 and P2 ƒ C is 2.75 times faster than B for P1 and P2 ƒ Workload: need to consider the percentage/frequency of teach program in the total job. ECE7660 2005

How to Summarize Performance

ƒ Suppose n programs have execution time

ti

where

n



ƒ Suppose the workload is ( w1 , w 2 ,.... w n ) where

i = 1, 2 ,... n .

wi = 1 .

i =1

n

1 ƒ Arithmetic Mean AM ( t i ) = n

1 ti = ( t1 + t 2 + Λ + t n ) n



i =1

(ti ) =

ƒ Weighted Arithmetic mean WAM

n



t i w i = t1 w 1 + t 2 w 2 + Λ + t n w n

i =1

ƒ Geometric Mean GM ( t i ) =

n



n

ti =

1

n

1

i =1

ƒ Weighted Geometric Mean ƒ Harmonic Mean HM ( t i ) =

WGM

n n



i =1

ƒ Weighted Harmonic Mean GHM

( t i ) = t1

=

1 ti

w1

× t2

w2

× Λ × tn

wn

n 1 1 1 + +Λ + t1 t 2 tn

(ti ) =

n n



i =1

wi ti

=

n w1 w2 w + +Λ + n t1 t2 tn

ECE7660 2005

Performance Measure ƒ Assume

Program P1 P2 Arith weight:

Computers A B 1.0 10.0 1000.0 100.0

C 20.0 20.0

W(1) 0.50 0.50

500.5 ? ?

20.0 ? ?

if W(1) If W(2) If W(3)

55.0 ? ?

Normalized to Machine A P1 1.0 10.0 20.0 P2: 1.0 0.1 0.02 Arith weight 1.0 5.05 0.63 Geo mean 1.0 1.0 0.63

ECE7660 2005

1

t1 × t 2 × Λ × t n = t1 n × t 2 n × Λ × t n n

Weights W(2) W(3) 0.909 0.999 0.091 0.001

Same for diff. normalization

Example Normalized to A Normalized to B Time on A

Time on B

A

B

A

B

program1

1

10

1

10

0.1

1

program2

1000

100

1

0.1

10

1

Arithmetic mean

500.5

55

1

5.05

5.05

1

Geometric mean

31.6

31.6

1

1

1

1

ƒ The difficulty arises from the use of arithmetic mean of normalized time. ƒ Geometric mean is independent of which data series we use for normalized time because

GM ( xi ) x = GM ( i ) GM ( yi ) yi ECE7660 2005

How to Summarize Performance ƒ 3 means: arithmetic, geometric and harmonic. Which one is correct? ƒ Arithmetic mean (or weighted arithmetic mean) tracks execution time. ƒ Harmonic mean (or weighted harmonic mean) of rates (e.g., MFLOPS) tracks execution time ƒ Normalized execution time is handy for scaling performance. But, • do not take the arithmetic mean of normalized execution time, use of the geometric wards all improvements equally: program A going from 2 seconds to 1 second as important as program B going from 2000 seconds to 1000 seconds. ECE7660 2005

Quantitative Principles of Computer Design

ƒ Amdahl’s Law --- Make the common case fast ƒ CPU time equation

ECE7660 2005

Amdahl's Law Speedup due to enhancement E:

Speedup(E) =

ExTime(without E) Performance(with E) = ExTime(with E) Performance(without E) F

F S Suppose that enhancement E accelerates a fraction F of the task by a factor S and the remainder of the task is unaffected then,

F ExTime(with E) = ((1 - F) + ) × ExTime(without E) S Speedup(E) =

ECE7660 2005

ExTime(without E) 1 1 = ≤ F F 1- F ((1 - F) + ) × ExTime(without E) (1 - F) + S S

Example: Suppose a person wants to travel from city A to city B by city C. The routes from A to C are in mountains and the routes from C to B are in desert. The distances from A to C , and from C to B are 80 miles and 200 miles, respectively. From A to C, walk at speed of 4 mph From C to B, walk or drive (at speed of 100 mph) Question: How long will it take for the entire trip How much faster from A to B by a car as opposed to walk

ECE7660 2005

Example: Suppose an enhancement runs 10 times faster than the original machine, but is only usable 40% of the time. Question: what is the overall speedup? Answer: Fraction_enhance = 0.4 Speedup_enhanced = 10 Speedup_overall = 1/(0.6+0.4/10) = 1.56

ECE7660 2005

Metrics of performance

Answers per month Operations per second

Application Programming Language Compiler ISA

(millions) of Instructions per second – MIPS (millions) of (F.P.) operations per second – MFLOP/s

Datapath Control

Megabytes per second

Function Units Transistors Wires Pins

Cycles per second (clock rate)

ECE7660 2005

Relating CPU Time Metrics ƒ CPU execution time = CPU clock cycles/pgm X clock cycle time ƒ or CPU execution time = CPU clock cycles/pgm ÷ clock rate ƒ CPU clock cycles/pgm = Instructions/pgm X avg. clock cycles per instr. ƒ or CPI = CPU clock cycles/pgm ÷ Instructions/pgm ƒ CPI tells us something about the Instruction Set Architecture, the Implementation of that architecture, and the program measured

ECE7660 2005

Aspects of CPU Performance CPU time

= Seconds

= Instructions x Cycles

Program

instr. count

Program

Instruction

CPI

clock rate

x Seconds Cycle

Program Compiler Instr. Set Arch. Organization Technology ECE7660 2005

Aspects of CPU Performance CPU time

= Seconds Program

= Instructions x Cycles Program

instr count (IC) CPI Program

X

X

Compiler

X

(x)

Instr. Set.

X

X

Organization Technology ECE7660 2005

X

Instruction

clock rate

X X

x Seconds Cycle

Organizational Trade-offs

Application Programming Language Compiler ISA

Instruction Mix

Datapath Control

CPI

Function Units Cycle Time

Transistors Wires Pins

ECE7660 2005

CPI – How to compute? “Average cycles per instruction”

CPI = (CPU Time × Clock Rate) / Instruction Count = Clock Cycles / Instruction Count “CPU clock cycles”

n

ClockCycleTime × ∑ CPI i × Ci

CPU time =

i =1

"instruction frequency" n

CPI =

∑ CPI

i

× Fi

where

i =1

Invest resources where time is Spent!

ECE7660 2005

Fi =

Ci Instruction _ Count

Example Our favorite program runs in 10 sec on machine A, which has a 400MHz clock. We are trying to design a machine B with faster clock rate so as to reduce the execution time to 6 sec. The increase of clock rate will affect the rest of the CPU design, causing B to require 1.2 times as many clock cycles as machine A for this program. What clock rate should be?

ECE7660 2005

Answer: CPU time A = CPU clock cycle A / clock rate A ==> CPU clock cycle A = 10 sec x 400 x 10^6 Clock rate B = CPU clock cycle B / CPU time B = 1.2*400*10^6 / 6 = 800 MHz

ECE7660 2005

Example Base Machine (Reg / Reg) and Instruction frequencies in the execution of a program: Op ALU Load Store Branch

Freq 43% 21% 12% 24%

Cycles 1 2 2 2

Question: What is the average CPI of the machine

ECE7660 2005

Example: Suppose we have two implementations of the same instruction set. Machine A has a clock cycle time of 10 ns and an average CPI of 2.0 for some program. Machine B has a clock cycle time of 20 ns and an average CPI of 1.2 for the same program. Which is faster? And by how much? Let I denote the number of instructions of the program CPU time A = I * 2.0* 10 = 20 I CPU time B = I * 1.2 *20 = 24 I Machine A is 1.2 faster than B . ECE7660 2005

Example ISA has 3 kinds of instructions: Instruction class A B C

CPI for this instruction class 1 2 3

One program has 2 code sequences: Code Sequence

Instruction counts for instruction class A B C

1 2

2 4

1 1

2 1

Which code sequence has more instructions? Which will be faster? What is the CPI for each sequence? S.1 has 5 instructions; S.2 has 6. S.1 needs 2x1+1x2+2x3=10 cycles; S.2 needs 4x1+1x2+1x3=9 cycles. S.1 has CPI=10/5=2; S.2 has CPI=9/6=1.5 ECE7660 2005

Marketing Metrics MIPS

= Instruction Count / Time * 10^6 = Clock Rate / CPI * 10^6

•Million Instructions Per Seconds •machines with different instruction sets ? •programs with different instruction mixes ? •dynamic frequency of instructions • uncorrelated with performance

MFLOP/S = FP Operations / Time * 10^6

•Million Floating-point Operations Per Second •machine dependent •often not where time is spent ECE7660 2005

Example: Assume we build an optimizing compiler for the load/store machine. The compiler discards 50% of the ALU instructions. 1) What is the CPI_opt ? 2) Ignoring system issues and assuming a 20 ns clock cycle time (50 MHz clock rate). What is the MIPS rating for optimized code versus unoptimized code? Does the MIPS rating agree with the rating of execution time?

ECE7660 2005

Performance Evaluation Summary CPU time

= Seconds Program

= Instructions x Cycles Program

Instruction

x Seconds Cycle

ƒ Time is the measure of computer performance! ƒ Good products created when have: • Good benchmarks • Good ways to summarize performance ƒ If not good benchmarks and summary, then choice between improving product for real programs vs. improving product to get more sales=> sales almost always wins. ƒ Remember Amdahl’s Law: Speedup is limited by unimproved part of program.

ECE7660 2005

Summary of IC Costs IC cost = Die cost =

Die cost + Testing cost + Packaging cost Final test yield Wafer cost Dies per Wafer × Die yield

Dies per wafer =

π (Wafer_dia m/2)2 π × Wafer_diam − − Test_Die Die_Area 2 ⋅ Die_Area

⎧⎪ ⎛ Defect_Density × Die_area ⎞ Die Yield = Wafer_yield × ⎨1 + ⎜ ⎟ α ⎝ ⎠ ⎪⎩

Die Cost goes roughly with die area4 ECE7660 2005

−α

⎫⎪ ⎬ ⎪⎭