CS 104 Computer Organization and Design

CS 104 Computer Organization and Design Memory Hierarchy

CS104: Memory Hierarchy [Adapted from A. Roth]

1

Memory Hierarchy App

App

App

• Basic concepts

System software Mem

CPU

I/O

• Technology background • Organizing a single memory component • ABC • Write issues • Miss classification and optimization

• Organizing an entire memory hierarchy • Virtual memory • Highly integrated into real hierarchies, but… • …won’t talk about until later CS104: Memory Hierarchy [Adapted from A. Roth]

2

Admin • Midterm 2: Graded Min: 32% Avg: 83.8% Max: 104%

As: 37.9% Bs: 34.5% Cs: 15.5% Ds: 3.4% Fs: 8.6%

• Reading: • P+H: Chapter 5

• Homework: • Homework 5, coming soon… • Due Friday April 13th


3

How Do We Build Insn/Data Memory? PC

IM

intRF

DM

• Register file? Just a multi-ported SRAM • 32 32-bit registers → 1Kb = 128B • Multiple ports make it bigger and slower but still OK

• Insn/data memory? Just a single-ported SRAM? • – – •

Uh, umm… it’s 232B = 4GB!!!! It would be huge, expensive, and pointlessly slow And we can’t build something that big on-chip anyway Good news: most ISAs now 64-bit → memory is 264B = 16EB


4

So What Do We Do? Actually… PC

IM

intRF

64KB

DM 16KB

• “Primary” insn/data memory are single-ported SRAMs… • “primary” = “in the pipeline” • Key 1: they contain only a dynamic subset of “memory” • Subset is small enough to fit in a reasonable SRAM • Key 2: missing chunks fetched on demand (transparent to program) • From somewhere else… (next slide) • Program has illusion that all 4GB (16EB) of memory is physically there • Just like it has the illusion that all insns execute atomically CS104: Memory Hierarchy [Adapted from A. Roth]

5

But… PC

IM

intRF

DM 16KB

64KB

4GB(16EB)?

• If requested insn/data not found in primary memory • Doesn’t the place it comes from have to be a 4GB (16EB) SRAM? • And won’t it be huge, expensive, and slow? And can we build it?


6

Memory Overview • Functionality data

address

• • • •

“Like a big array…” N-bit address bus (on N-bit machine) Data bus: typically read/write on same bus Can have multiple ports: address/data bus pairs

M

• Access time: • Access latency ~ #bits * #ports2


7

Memory Performance Equation • For memory component M thit M

%miss tmiss

• Access: read or write to M • Hit: desired data found in M • Miss: desired data not found in M • Must get from another component • No notion of “miss” in register file • Fill: action of placing data in M • %miss (miss-rate): #misses / #accesses • thit: time to read data from (write data to) M • tmiss: time to read data into M

• Performance metric: average access time tavg = thit + %miss * tmiss CS104: Memory Hierarchy [Adapted from A. Roth]

8

Memory Hierarchy tavg = thit + %miss * tmiss • Problem: hard to get low thit and %miss in one structure • Large structures have low %miss but high thit • Small structures have low thit but high %miss

• Solution: use a hierarchy of memory structures • Known from the very beginning “Ideally, one would desire an infinitely large memory capacity such that any particular word would be immediately available … We are forced to recognize the possibility of constructing a hierarchy of memories, each of which has a greater capacity than the preceding but which is less quickly accessible.” Burks,Goldstine,VonNeumann “Preliminary discussion of the logical design of an electronic computing instrument” IAS memo 1946 CS104: Memory Hierarchy [Adapted from A. Roth]

9

Abstract Memory Hierarchy pipeline M1 M2 M3

M4

M

• Hierarchy of memory components • Upper levels: small → low thit, high %miss • Going down: larger → higher thit, lower %miss

• Connected by buses • Ignore for the moment

• Make average access time close to M1’s • • • •

How? Most frequently accessed data in M1 M1 + next most frequently accessed in M2, etc. Automatically move data up-down hierarchy


10

Why Memory Hierarchy Works • 10/90 rule (of thumb) • 10% of static insns/data account for 90% of accessed insns/data • Insns: inner loops • Data: frequently used globals, inner loop stack variables

• Temporal locality • Recently accessed insns/data likely to be accessed again soon • Insns: inner loops (next iteration) • Data: inner loop local variables, globals • Hierarchy can be “reactive”: move things up when accessed

• Spatial locality • Insns/data near recently accessed insns/data likely accessed soon • Insns: sequential execution • Data: elements in array, fields in struct, variables in stack frame • Hierarchy is “proactive”: move things up speculatively CS104: Memory Hierarchy [Adapted from A. Roth]

11

Exploiting Heterogeneous Technologies pipeline SRAM

• Apparent problem – Lower level components must be huge – Huge SRAMs are difficult to build and expensive

SRAM

• Solution: don’t use SRAM for lower levels SRAM?

DRAM

• Cheaper, denser storage technologies • Will be slower than SRAM, but that’s OK • Won’t be accessed very frequently • We have no choice anyway • Upper levels: SRAM → expensive (/B), fast • Going down: DRAM, Disk → cheaper (/B), fast

DISK CS104: Memory Hierarchy [Adapted from A. Roth]

12

Memory Technology Overview • Latency • • • •

SRAM: 2MB cache

• Design cache hierarchy all at once • Intel: small I$/D$ → large L2/L3, preferably on-chip • AMD/IBM: large I$/D$ → small (or off-chip or no) L2/L3 CS104: Memory Hierarchy [Adapted from A. Roth]

75

Cache Hierarchy Examples 64KB I$ 8KB I/D$

64KB D$

1.5MB L2

L3 tags • Intel 486: unified 8KB I/D$, no L2 • IBM Power 5: private 64KB I$/D$, shared 1.5M L2, L3 tags • Dual core: L2, L3 tags shared between both cores CS104: Memory Hierarchy [Adapted from A. Roth]

76

Designing a Complete Memory Hierarchy • SRAM vs. embedded DRAM… • Good segue to main memory • How is that component designed? • What is DRAM? • What makes some DRAM “embedded”?


77

Brief History of DRAM • DRAM (memory): a major force behind computer industry • Modern DRAM came with introduction of IC (1970) • Preceded by magnetic “core” memory (1950s) • More closely resembles today’s disks than memory • And by mercury delay lines before that (ENIAC) • Re-circulating vibrations in mercury tubes “the one single development that put computers on their feet was the invention of a reliable form of memory, namely the core memory… It’s cost was reasonable, it was reliable, and because it was reliable it could in due course be made large” Maurice Wilkes Memoirs of a Computer Programmer, 1985


78

SRAM • SRAM: “6T” cells address

• 6 transistors per bit • 4 for the CCI • 2 access transistors

• Static • CCIs hold state

• To read • Equalize, swing, amplify

• To write • Overwhelm ~data1

data1 ~data0

data0


~B

B

79

DRAM • DRAM: dynamic RAM address

• Bits as capacitors • Transistors as ports • “1T” cells: one access transistor per bit

• “Dynamic” means • Capacitors not connected to pwr/gnd • Stored charge decays over time • Must be explicitly refreshed

• Designed for density data1 data0

+ ~6–8X denser than SRAM – But slower too


80

DRAM Operation I • Read: similar to cache read address

• Phase I: pre-charge bitlines to 0.5V • Phase II: decode address, enable wordline • Capacitor swings bitline voltage up(down) • Sense-amplifier interprets swing as 1(0) – Destructive read: word bits now discharged

• Write: similar to cache write

write

• Phase I: decode address, enable wordline • Phase II: enable bitlines • High bitlines charge corresponding capacitors data

– What about leakage over time?


81

DRAM Operation II address

• Solution: add set of D-latches (row buffer) • Read: two steps • Step I: read selected word into row buffer • Step IIA: read row buffer out to pins • Step IIB: write row buffer back to selected word + Solves “destructive read” problem

r-I

• Write: two steps

r r/w-I r/w-II

DL

data

DL

• Step IA: read selected word into row buffer • Step IB: write data into row buffer • Step II: write row buffer back to selected word + Also solves leakage problem


82

DRAM Refresh • DRAM periodically refreshes all contents address

• Loops through all words • Reads word into row buffer • Writes row buffer back into DRAM array • 1–2% of DRAM time occupied by refresh

DL

DL

data CS104: Memory Hierarchy [Adapted from A. Roth]

83

Access Time and Cycle Time • DRAM access slower than SRAM • Not electrically optimized for speed, buffered access • SRAM access latency: 1–3ns • DRAM access latency: 30–50ns

• DRAM cycle time also longer than access time • Access time: latency, time of a single access • Cycle time: bandwidth, time between start of consecutive accesses • SRAM: cycle time ≤ access time • Begin second access as soon as first access finishes • Cycle time < access time? SRAM can be pipelined • DRAM: cycle time = 2 * access time • Why? Can’t begin new access while DRAM is refreshing row CS104: Memory Hierarchy [Adapted from A. Roth]

84

DRAM Organization • Large capacity: e.g., 64–256Mb Row address

DRAM bit array

• Arranged as square + Minimizes wire length + Maximizes refresh efficiency

• Embedded (on-chip) DRAM • That’s it + Huge data bandwidth

row buffer a bunch of data CS104: Memory Hierarchy [Adapted from A. Roth]

85

Commodity DRAM • Commodity (standalone) DRAM

address

RAS

[11:2]

12to4K decoder

[23:12]

• Cheap packages → few pins • Narrow data interface: e.g., 8 bits • Narrow address interface: N/2 bits

4K x 4K bits 16Mbits 2MBytes

row buffer 4 1Kto1 muxes

CAS

• Two-level addressing • Level 1: RAS high • Upper address bits on address bus • Read row into row buffer • Level 2: CAS high • Lower address bits on address bus • Mux row buffer onto data bus

data


86

Moore’s Law Year

Capacity

$/MB

Access time

1980

64Kb

$1500

250ns

1988

4Mb

$50

120ns

1996

64Mb

$10

60ns

2004

1Gb

$0.5

35ns

• (Commodity) DRAM capcity • 16X every 8 years is 2X every 2 years • Not quite 2X every 18 months but still close


87

Commodity DRAM Components 64b 8b 1Gb 1Gb 1Gb 1Gb 1Gb 1Gb 1Gb 1Gb x x x x x x x x 1B 1B 1B 1B 1B 1B 1B 1B 0 1 2 3 4 5 6 7

• DIMM: dual inline memory module • Array of DRAM chips on printed circuit board with 64-bit data bus • Example: 1GB DDR SDRAM-DIMM (8 1Gb DRAM chips) • SDRAM: synchronous DRAM • Read/write row buffer chunks on data bus clock edge + No need to send consecutive column addresses • DDR: double-data rate, transfer data on both clock edges CS104: Memory Hierarchy [Adapted from A. Roth]

88

Memory Bus • Memory bus: connects CPU package with main memory • Has its own clock • Typically slower than CPU internal clock: 100–500MHz vs. 3GHz • SDRAM operates on this clock • Is often itself internally pipelined • Clock implies bandwidth: 100MHz → start new transfer every 10ns • Clock doesn’t imply latency: 100MHz !→ transfer takes 10ns • Bandwidth is more important: determines peak performance


89

Effective Memory Latency Calculation • CPU ↔ main memory interface: L2 miss blocks • What is tmiss-L2?

• Parameters • L2 with 128B blocks • DIMM with 20ns access, 40ns cycle SDRAMs • 200MHz (and 5ns latency) 64-bit data and address buses • 5ns (address) + 20ns (DRAM access) + 16 * 5ns (bus) = 105ns • Roughly 300 cycles on 3GHz processor • Where did we get 16 bus transfers? 128B / (8B / transfer) • Calculation assumes memory is striped across DRAMs in 1B chunks B0 1Gb B1 1Gb B2 1Gb B3 1Gb B4 1Gb B5 1Gb B6 1Gb B7 1Gb B8 B9 B10 B11 B12 B13 B14 B15 x x x x x x x x 1B 1B 1B 1B 1B 1B 1B 1B 0 1 2 3 4 5 6 7 CS104: Memory Hierarchy [Adapted from A. Roth]

90

Memory Latency and Bandwidth • Nominal clock frequency applies to CPU and caches • Careful when doing calculations • Clock frequency increases don’t reduce memory or bus latency • May make misses come out faster • At some point memory bandwidth may become a bottleneck • Further increases in clock speed won’t help at all

• You all saw this in hwk4 without knowing….


91

Memory/Clock Frequency Example • Parameters • • • •

1 GHz CPU, base CPI = 1 I$: 1% miss rate, 32B blocks (ignore D$, L2) Data bus: 64-bit 100MHz (10ns latency), ignore address bus DRAM: 10ns access, 20ns cycle

• What are CPI and MIPS including memory latency? • • • • •

Memory system cycle time = bus latency to transfer 32B = 40ns Memory system access time = 50ns (10ns DRAM access + bus) 1 GHz clock → 50ns = 50 cycles CPI+memory = 1 + (0.01*50) = 1 + 0.5 = 1.5 MIPS+memory = 1 GHz / 1.5 CPI = 1000 MHz / 1.5 CPI = 667


92

Memory/Clock Frequency Example • What are CPI and MIPS if CPU clock frequency is doubled? • • • •

Memory parameters same: 50ns access, 40ns cycle 2GHz clock → 50ns = 100 cycles CPI+memory = 1 + (0.01*100) = 1 + 1 = 2 MIPS+memory = 2GHz / 2 CPI = 2000MHz / 2 CPI = 1000

• What is the peak MIPS if we can only change clock? • Available bandwidth: 32B/40ns = 0.8B/ns • Needed bandwidth: 0.01*32B/cycle = 0.32B/cycle * X cycle/ns • Memory is a bottleneck at 0.8/0.32 cycle/ns = 2.5 GHz • No sustained speedup possible after that point • 2.5 GHz clock → 50ns = 125 cycles • CPI+memory = 1 + (0.01*125) = 1 + 1.25 = 2.25 • MIPS+memory = 2.5GHz / 2.25 CPI = 2500MHz / 2.25 CPI = 1111 CS104: Memory Hierarchy [Adapted from A. Roth]

93

Why Is Bandwidth So Important? • What if you had to do some large calculation? • Or some large streaming media calculation? • Basically, data set does not fit in L2 • All accesses would miss and chew up memory bandwidth • You could easily prefetch and avoid the misses… • …but prefetching would chew up the same memory bandwidth • Prefetching reduces latency, not bandwidth! • A 100 MHz, 8B bus has bandwidth of 800MB/s • Scientific calculation: 1 8B load + 1 8B store / element • Peak processing rate: 50M elements/s (puny) • Media stream: 1 4B load + 1 4B store / pixel • Peak processing rate: 100M pixels/s • Just enough for SVGA at 60 frames/s, not more CS104: Memory Hierarchy [Adapted from A. Roth]

94

Increasing Memory Bandwidth • Memory bandwidth can already bottleneck a single CPU • What about multi-core? • How to increase memory bandwidth?

• Higher frequency memory bus, DDR • Processors are doing this

• Wider processor-memory interface • Multiple DIMMs – Can get expensive: only want bandwidth, pay for capacity too • Multiple on chip memory “channels” • Processors are doing this too


95

Main Memory As A Cache Parameter

I$/D$

L2

L3

Main Memory

thit

2ns

10ns

30ns

100ns

tmiss

10ns

30ns

100ns

10ms (10M ns)

Capacity

8–64KB

128KB–2MB

1–9MB

64MB–64GB

Block size

16–32B

32–256B

256B

4KB+

Associativity

1–4

4–16

16

full

Replacement

NMRU

NMRU

NMRU

“working set”

Prefetching?

Maybe

Probably

Probably

Either

• How would you internally organize main memory • tmiss is outrageously long, reduce %miss at all costs • Full associativity: isn’t that difficult to implement? • Yes … in hardware, main memory is “software-managed” CS104: Memory Hierarchy [Adapted from A. Roth]

96

Summary App

App

App

System software Mem

CPU

I/O

• tavg = thit + %miss * tmiss • thit and %miss in one component? Difficult

• Memory hierarchy • Capacity: smaller, low thit → bigger, low%miss • 10/90 rule, temporal/spatial locality • Technology: expensive→cheaper • SRAM →DRAM→Disk: reasonable total cost

• Organizing a memory component • ABC, write policies • 3C miss model: how to eliminate misses?

• What about bandwidth?


97

CS 104 Computer Organization and Design

CS 104 Computer Organization and Design

Suggest Documents

CS 104 Computer Organization and Design

CS 104 Computer Organization and Design

MCA-104 COMPUTER ORGANIZATION AND ASSEMBLY ...

CS 310H: Computer Organization and Programming

CS104 Computer Organization and Design

Computer Organization and Design (3e)

COMPUTER ORGANIZATION AND DESIGN FUNDAMENTALS

Basic Computer Organization and Design

COMPUTER ORGANIZATION AND DESIGN FUNDAMENTALS

Basic Computer Organization & Design

FREE [DOWNLOAD] COMPUTER ORGANIZATION AND DESIGN ...

CIS 371 Computer Organization and Design

Computer Organization and Design: The Hardware/Software ...

PDF DOWNLOAD Parallel Computer Organization and Design ...

ECE232: Hardware Organization and Design Computer ...

CSE 471: Computer Design and Organization

CS : COMPUTER SCIENCE AND INFORMATION TECHNOLOGY CS

CS 200: Computer Organization - College of Engineering, Forestry ...

SYLLABUS-ECE329-Computer Organization &Design

CS 4504 – Computer Organization Prof. Kirk W. Cameron ... - People

Computer Organization and Architecture

Computer Organization and Architecture

Computer Organization and Architecture

COMPUTER ORGANIZATION AND ARCHITECTURE