but which is less quickly accessible.” Burks,Goldstine,VonNeumann. “Preliminary
discussion of the logical design of an electronic computing instrument”.
CS 104 Computer Organization and Design Memory Hierarchy
CS104: Memory Hierarchy [Adapted from A. Roth]
1
Memory Hierarchy App
App
App
• Basic concepts
System software Mem
CPU
I/O
• Technology background • Organizing a single memory component • ABC • Write issues • Miss classification and optimization
• Organizing an entire memory hierarchy • Virtual memory • Highly integrated into real hierarchies, but… • …won’t talk about until later CS104: Memory Hierarchy [Adapted from A. Roth]
2
Admin • Midterm 2: Graded Min: 32% Avg: 83.8% Max: 104%
As: 37.9% Bs: 34.5% Cs: 15.5% Ds: 3.4% Fs: 8.6%
• Reading: • P+H: Chapter 5
• Homework: • Homework 5, coming soon… • Due Friday April 13th
CS104: Memory Hierarchy [Adapted from A. Roth]
3
How Do We Build Insn/Data Memory? PC
IM
intRF
DM
• Register file? Just a multi-ported SRAM • 32 32-bit registers → 1Kb = 128B • Multiple ports make it bigger and slower but still OK
• Insn/data memory? Just a single-ported SRAM? • – – •
Uh, umm… it’s 232B = 4GB!!!! It would be huge, expensive, and pointlessly slow And we can’t build something that big on-chip anyway Good news: most ISAs now 64-bit → memory is 264B = 16EB
CS104: Memory Hierarchy [Adapted from A. Roth]
4
So What Do We Do? Actually… PC
IM
intRF
64KB
DM 16KB
• “Primary” insn/data memory are single-ported SRAMs… • “primary” = “in the pipeline” • Key 1: they contain only a dynamic subset of “memory” • Subset is small enough to fit in a reasonable SRAM • Key 2: missing chunks fetched on demand (transparent to program) • From somewhere else… (next slide) • Program has illusion that all 4GB (16EB) of memory is physically there • Just like it has the illusion that all insns execute atomically CS104: Memory Hierarchy [Adapted from A. Roth]
5
But… PC
IM
intRF
DM 16KB
64KB
4GB(16EB)?
• If requested insn/data not found in primary memory • Doesn’t the place it comes from have to be a 4GB (16EB) SRAM? • And won’t it be huge, expensive, and slow? And can we build it?
CS104: Memory Hierarchy [Adapted from A. Roth]
6
Memory Overview • Functionality data
address
• • • •
“Like a big array…” N-bit address bus (on N-bit machine) Data bus: typically read/write on same bus Can have multiple ports: address/data bus pairs
M
• Access time: • Access latency ~ #bits * #ports2
CS104: Memory Hierarchy [Adapted from A. Roth]
7
Memory Performance Equation • For memory component M thit M
%miss tmiss
• Access: read or write to M • Hit: desired data found in M • Miss: desired data not found in M • Must get from another component • No notion of “miss” in register file • Fill: action of placing data in M • %miss (miss-rate): #misses / #accesses • thit: time to read data from (write data to) M • tmiss: time to read data into M
• Performance metric: average access time tavg = thit + %miss * tmiss CS104: Memory Hierarchy [Adapted from A. Roth]
8
Memory Hierarchy tavg = thit + %miss * tmiss • Problem: hard to get low thit and %miss in one structure • Large structures have low %miss but high thit • Small structures have low thit but high %miss
• Solution: use a hierarchy of memory structures • Known from the very beginning “Ideally, one would desire an infinitely large memory capacity such that any particular word would be immediately available … We are forced to recognize the possibility of constructing a hierarchy of memories, each of which has a greater capacity than the preceding but which is less quickly accessible.” Burks,Goldstine,VonNeumann “Preliminary discussion of the logical design of an electronic computing instrument” IAS memo 1946 CS104: Memory Hierarchy [Adapted from A. Roth]
9
Abstract Memory Hierarchy pipeline M1 M2 M3
M4
M
• Hierarchy of memory components • Upper levels: small → low thit, high %miss • Going down: larger → higher thit, lower %miss
• Connected by buses • Ignore for the moment
• Make average access time close to M1’s • • • •
How? Most frequently accessed data in M1 M1 + next most frequently accessed in M2, etc. Automatically move data up-down hierarchy
CS104: Memory Hierarchy [Adapted from A. Roth]
10
Why Memory Hierarchy Works • 10/90 rule (of thumb) • 10% of static insns/data account for 90% of accessed insns/data • Insns: inner loops • Data: frequently used globals, inner loop stack variables
• Temporal locality • Recently accessed insns/data likely to be accessed again soon • Insns: inner loops (next iteration) • Data: inner loop local variables, globals • Hierarchy can be “reactive”: move things up when accessed
• Spatial locality • Insns/data near recently accessed insns/data likely accessed soon • Insns: sequential execution • Data: elements in array, fields in struct, variables in stack frame • Hierarchy is “proactive”: move things up speculatively CS104: Memory Hierarchy [Adapted from A. Roth]
11
Exploiting Heterogeneous Technologies pipeline SRAM
• Apparent problem – Lower level components must be huge – Huge SRAMs are difficult to build and expensive
SRAM
• Solution: don’t use SRAM for lower levels SRAM?
DRAM
• Cheaper, denser storage technologies • Will be slower than SRAM, but that’s OK • Won’t be accessed very frequently • We have no choice anyway • Upper levels: SRAM → expensive (/B), fast • Going down: DRAM, Disk → cheaper (/B), fast
DISK CS104: Memory Hierarchy [Adapted from A. Roth]
12
Memory Technology Overview • Latency • • • •
SRAM: 2MB cache
• Design cache hierarchy all at once • Intel: small I$/D$ → large L2/L3, preferably on-chip • AMD/IBM: large I$/D$ → small (or off-chip or no) L2/L3 CS104: Memory Hierarchy [Adapted from A. Roth]
75
Cache Hierarchy Examples 64KB I$ 8KB I/D$
64KB D$
1.5MB L2
L3 tags • Intel 486: unified 8KB I/D$, no L2 • IBM Power 5: private 64KB I$/D$, shared 1.5M L2, L3 tags • Dual core: L2, L3 tags shared between both cores CS104: Memory Hierarchy [Adapted from A. Roth]
76
Designing a Complete Memory Hierarchy • SRAM vs. embedded DRAM… • Good segue to main memory • How is that component designed? • What is DRAM? • What makes some DRAM “embedded”?
CS104: Memory Hierarchy [Adapted from A. Roth]
77
Brief History of DRAM • DRAM (memory): a major force behind computer industry • Modern DRAM came with introduction of IC (1970) • Preceded by magnetic “core” memory (1950s) • More closely resembles today’s disks than memory • And by mercury delay lines before that (ENIAC) • Re-circulating vibrations in mercury tubes “the one single development that put computers on their feet was the invention of a reliable form of memory, namely the core memory… It’s cost was reasonable, it was reliable, and because it was reliable it could in due course be made large” Maurice Wilkes Memoirs of a Computer Programmer, 1985
CS104: Memory Hierarchy [Adapted from A. Roth]
78
SRAM • SRAM: “6T” cells address
• 6 transistors per bit • 4 for the CCI • 2 access transistors
• Static • CCIs hold state
• To read • Equalize, swing, amplify
• To write • Overwhelm ~data1
data1 ~data0
data0
CS104: Memory Hierarchy [Adapted from A. Roth]
~B
B
79
DRAM • DRAM: dynamic RAM address
• Bits as capacitors • Transistors as ports • “1T” cells: one access transistor per bit
• “Dynamic” means • Capacitors not connected to pwr/gnd • Stored charge decays over time • Must be explicitly refreshed
• Designed for density data1 data0
+ ~6–8X denser than SRAM – But slower too
CS104: Memory Hierarchy [Adapted from A. Roth]
80
DRAM Operation I • Read: similar to cache read address
• Phase I: pre-charge bitlines to 0.5V • Phase II: decode address, enable wordline • Capacitor swings bitline voltage up(down) • Sense-amplifier interprets swing as 1(0) – Destructive read: word bits now discharged
• Write: similar to cache write
write
• Phase I: decode address, enable wordline • Phase II: enable bitlines • High bitlines charge corresponding capacitors data
– What about leakage over time?
CS104: Memory Hierarchy [Adapted from A. Roth]
81
DRAM Operation II address
• Solution: add set of D-latches (row buffer) • Read: two steps • Step I: read selected word into row buffer • Step IIA: read row buffer out to pins • Step IIB: write row buffer back to selected word + Solves “destructive read” problem
r-I
• Write: two steps
r r/w-I r/w-II
DL
data
DL
• Step IA: read selected word into row buffer • Step IB: write data into row buffer • Step II: write row buffer back to selected word + Also solves leakage problem
CS104: Memory Hierarchy [Adapted from A. Roth]
82
DRAM Refresh • DRAM periodically refreshes all contents address
• Loops through all words • Reads word into row buffer • Writes row buffer back into DRAM array • 1–2% of DRAM time occupied by refresh
DL
DL
data CS104: Memory Hierarchy [Adapted from A. Roth]
83
Access Time and Cycle Time • DRAM access slower than SRAM • Not electrically optimized for speed, buffered access • SRAM access latency: 1–3ns • DRAM access latency: 30–50ns
• DRAM cycle time also longer than access time • Access time: latency, time of a single access • Cycle time: bandwidth, time between start of consecutive accesses • SRAM: cycle time ≤ access time • Begin second access as soon as first access finishes • Cycle time < access time? SRAM can be pipelined • DRAM: cycle time = 2 * access time • Why? Can’t begin new access while DRAM is refreshing row CS104: Memory Hierarchy [Adapted from A. Roth]
84
DRAM Organization • Large capacity: e.g., 64–256Mb Row address
DRAM bit array
• Arranged as square + Minimizes wire length + Maximizes refresh efficiency
• Embedded (on-chip) DRAM • That’s it + Huge data bandwidth
row buffer a bunch of data CS104: Memory Hierarchy [Adapted from A. Roth]
85
Commodity DRAM • Commodity (standalone) DRAM
address
RAS
[11:2]
12to4K decoder
[23:12]
• Cheap packages → few pins • Narrow data interface: e.g., 8 bits • Narrow address interface: N/2 bits
4K x 4K bits 16Mbits 2MBytes
row buffer 4 1Kto1 muxes
CAS
• Two-level addressing • Level 1: RAS high • Upper address bits on address bus • Read row into row buffer • Level 2: CAS high • Lower address bits on address bus • Mux row buffer onto data bus
data
CS104: Memory Hierarchy [Adapted from A. Roth]
86
Moore’s Law Year
Capacity
$/MB
Access time
1980
64Kb
$1500
250ns
1988
4Mb
$50
120ns
1996
64Mb
$10
60ns
2004
1Gb
$0.5
35ns
• (Commodity) DRAM capcity • 16X every 8 years is 2X every 2 years • Not quite 2X every 18 months but still close
CS104: Memory Hierarchy [Adapted from A. Roth]
87
Commodity DRAM Components 64b 8b 1Gb 1Gb 1Gb 1Gb 1Gb 1Gb 1Gb 1Gb x x x x x x x x 1B 1B 1B 1B 1B 1B 1B 1B 0 1 2 3 4 5 6 7
• DIMM: dual inline memory module • Array of DRAM chips on printed circuit board with 64-bit data bus • Example: 1GB DDR SDRAM-DIMM (8 1Gb DRAM chips) • SDRAM: synchronous DRAM • Read/write row buffer chunks on data bus clock edge + No need to send consecutive column addresses • DDR: double-data rate, transfer data on both clock edges CS104: Memory Hierarchy [Adapted from A. Roth]
88
Memory Bus • Memory bus: connects CPU package with main memory • Has its own clock • Typically slower than CPU internal clock: 100–500MHz vs. 3GHz • SDRAM operates on this clock • Is often itself internally pipelined • Clock implies bandwidth: 100MHz → start new transfer every 10ns • Clock doesn’t imply latency: 100MHz !→ transfer takes 10ns • Bandwidth is more important: determines peak performance
CS104: Memory Hierarchy [Adapted from A. Roth]
89
Effective Memory Latency Calculation • CPU ↔ main memory interface: L2 miss blocks • What is tmiss-L2?
• Parameters • L2 with 128B blocks • DIMM with 20ns access, 40ns cycle SDRAMs • 200MHz (and 5ns latency) 64-bit data and address buses • 5ns (address) + 20ns (DRAM access) + 16 * 5ns (bus) = 105ns • Roughly 300 cycles on 3GHz processor • Where did we get 16 bus transfers? 128B / (8B / transfer) • Calculation assumes memory is striped across DRAMs in 1B chunks B0 1Gb B1 1Gb B2 1Gb B3 1Gb B4 1Gb B5 1Gb B6 1Gb B7 1Gb B8 B9 B10 B11 B12 B13 B14 B15 x x x x x x x x 1B 1B 1B 1B 1B 1B 1B 1B 0 1 2 3 4 5 6 7 CS104: Memory Hierarchy [Adapted from A. Roth]
90
Memory Latency and Bandwidth • Nominal clock frequency applies to CPU and caches • Careful when doing calculations • Clock frequency increases don’t reduce memory or bus latency • May make misses come out faster • At some point memory bandwidth may become a bottleneck • Further increases in clock speed won’t help at all
• You all saw this in hwk4 without knowing….
CS104: Memory Hierarchy [Adapted from A. Roth]
91
Memory/Clock Frequency Example • Parameters • • • •
1 GHz CPU, base CPI = 1 I$: 1% miss rate, 32B blocks (ignore D$, L2) Data bus: 64-bit 100MHz (10ns latency), ignore address bus DRAM: 10ns access, 20ns cycle
• What are CPI and MIPS including memory latency? • • • • •
Memory system cycle time = bus latency to transfer 32B = 40ns Memory system access time = 50ns (10ns DRAM access + bus) 1 GHz clock → 50ns = 50 cycles CPI+memory = 1 + (0.01*50) = 1 + 0.5 = 1.5 MIPS+memory = 1 GHz / 1.5 CPI = 1000 MHz / 1.5 CPI = 667
CS104: Memory Hierarchy [Adapted from A. Roth]
92
Memory/Clock Frequency Example • What are CPI and MIPS if CPU clock frequency is doubled? • • • •
Memory parameters same: 50ns access, 40ns cycle 2GHz clock → 50ns = 100 cycles CPI+memory = 1 + (0.01*100) = 1 + 1 = 2 MIPS+memory = 2GHz / 2 CPI = 2000MHz / 2 CPI = 1000
• What is the peak MIPS if we can only change clock? • Available bandwidth: 32B/40ns = 0.8B/ns • Needed bandwidth: 0.01*32B/cycle = 0.32B/cycle * X cycle/ns • Memory is a bottleneck at 0.8/0.32 cycle/ns = 2.5 GHz • No sustained speedup possible after that point • 2.5 GHz clock → 50ns = 125 cycles • CPI+memory = 1 + (0.01*125) = 1 + 1.25 = 2.25 • MIPS+memory = 2.5GHz / 2.25 CPI = 2500MHz / 2.25 CPI = 1111 CS104: Memory Hierarchy [Adapted from A. Roth]
93
Why Is Bandwidth So Important? • What if you had to do some large calculation? • Or some large streaming media calculation? • Basically, data set does not fit in L2 • All accesses would miss and chew up memory bandwidth • You could easily prefetch and avoid the misses… • …but prefetching would chew up the same memory bandwidth • Prefetching reduces latency, not bandwidth! • A 100 MHz, 8B bus has bandwidth of 800MB/s • Scientific calculation: 1 8B load + 1 8B store / element • Peak processing rate: 50M elements/s (puny) • Media stream: 1 4B load + 1 4B store / pixel • Peak processing rate: 100M pixels/s • Just enough for SVGA at 60 frames/s, not more CS104: Memory Hierarchy [Adapted from A. Roth]
94
Increasing Memory Bandwidth • Memory bandwidth can already bottleneck a single CPU • What about multi-core? • How to increase memory bandwidth?
• Higher frequency memory bus, DDR • Processors are doing this
• Wider processor-memory interface • Multiple DIMMs – Can get expensive: only want bandwidth, pay for capacity too • Multiple on chip memory “channels” • Processors are doing this too
CS104: Memory Hierarchy [Adapted from A. Roth]
95
Main Memory As A Cache Parameter
I$/D$
L2
L3
Main Memory
thit
2ns
10ns
30ns
100ns
tmiss
10ns
30ns
100ns
10ms (10M ns)
Capacity
8–64KB
128KB–2MB
1–9MB
64MB–64GB
Block size
16–32B
32–256B
256B
4KB+
Associativity
1–4
4–16
16
full
Replacement
NMRU
NMRU
NMRU
“working set”
Prefetching?
Maybe
Probably
Probably
Either
• How would you internally organize main memory • tmiss is outrageously long, reduce %miss at all costs • Full associativity: isn’t that difficult to implement? • Yes … in hardware, main memory is “software-managed” CS104: Memory Hierarchy [Adapted from A. Roth]
96
Summary App
App
App
System software Mem
CPU
I/O
• tavg = thit + %miss * tmiss • thit and %miss in one component? Difficult
• Memory hierarchy • Capacity: smaller, low thit → bigger, low%miss • 10/90 rule, temporal/spatial locality • Technology: expensive→cheaper • SRAM →DRAM→Disk: reasonable total cost
• Organizing a memory component • ABC, write policies • 3C miss model: how to eliminate misses?
• What about bandwidth?
CS104: Memory Hierarchy [Adapted from A. Roth]
97