Center for Information Services and High Performance Computing (ZIH)
Detecting Memory-Boundedness with Hardware Performance Counters ICPE, Apr 24th 2017
Daniel Molka (
[email protected])
Robert Schöne (
[email protected]) Daniel Hackenberg (
[email protected]) Wolfgang E. Nagel (
[email protected])
Outline Motivation Benchmarks for memory subsystem performance Identification of meaningful hardware performance counters Summary
Daniel Molka
2
Scaling of Parallel Applications Core 0
Core 1
L1
L1
L2
L2
...
Core n L1 L2
Last Level Cache
...
Point-to-Point Interconnect
RAM
RAM
Memory Controller
Other I/O Processors
Multiple levels of cache to bridge processor-DRAM performance gap Last Level Cache (LLC) and memory controller usually shared between cores
Daniel Molka
3
Scaling of Parallel Applications
Core 0
Core 1
L1
L1
L2
L2
...
Core n L1 L2
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Last Level Cache
...
Point-to-Point Interconnect
RAM
RAM
Memory Controller
Other I/O Processors
Typically integrated memory controllers in every processor Performance of memory accesses depends on distance from the data
Daniel Molka
4
Scaling of Parallel Applications
Core 0
Core 1
L1
L1
L2
L2
...
Core n L1 L2
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Last Level Cache
...
Point-to-Point Interconnect
RAM
RAM
Memory Controller
Other I/O Processors
Daniel Molka
5
Potential Bottlenecks Local memory hierarchy
Multicore Processor Core
Core
L1
L1
L2
L2
– Limited data path widths
– Access latencies
Remote memory accesses
Multicore Processor
Core
Core
Core
L1
L1
L1
L2
L2
L2
...
Core
...
Last Level Cache
Last Level Cache
Memory
Memory
– Additional latency
– Interconnect bandwidth
Performance of on-chip transfers and remote cache accesses Saturation of shared resources Need to understand which hardware characteristics determine the application performance Requires knowledge about: – Peak achievable performance of individual components
– Component utilization at application runtime Daniel Molka
6
L1 L2
Outline Motivation Benchmarks for memory subsystem performance Identification of meaningful hardware performance counters Summary
Daniel Molka
7
Common Memory Benchmarks Multicore Processor Core
Core
L1
L1
L2
L2
Core
Core
Core
L1
L1
L1
L2
L2
L2
...
Last Level Cache
L1
L1
L2
L2
...
L1
– Latency: Lmbench
L2
Remote memory accesses – STREAM and Lmbench plus numactl
Memory
Multicore Processor Core
– Bandwidth: STREAM
Core
Last Level Cache
Memory
Core
Local memory hierarchy
Multicore Processor
Multicore Processor
Core
Core
Core
L1
L1
L1
L2
L2
L2
...
On-chip transfers and remote cache accesses
Core
...
Last Level Cache
Last Level Cache
Memory
Memory
– Not covered by common tools
L1
– Not easily extendable
L2
Saturation of shared resources – Also covered by STREAM
Daniel Molka
8
ZIH Development BenchIT Includes measurement of core-to-core transfers – Example:
L1
L2
L3
RAM
Opteron 6176 memory latency
Daniel Molka
9
ZIH Development BenchIT Includes measurement of core-to-core transfers – Example:
L1
L2
L3
RAM
Opteron 6176 memory latency
transfers within one processor
Daniel Molka
10
ZIH Development BenchIT Includes measurement of core-to-core transfers – Example: Opteron 6176 memory latency
L1
L2
L3
RAM
remote cache accesses main memory (NUMA)
transfers within one processor
Daniel Molka
11
ZIH Development BenchIT Includes measurement of core-to-core transfers – Example: Opteron 6176 memory latency
L1
L2
L3
RAM
remote cache accesses
not covered by other benchmark suites
main memory (NUMA)
transfers within one processor
Sophisticated data placement enables performance measurements for individual components in the memory hierarchy Also considers state transitions of coherence protocols
Daniel Molka
12
Scaling of Shared Resources in Multi-core Processors 300 250 200 150 100 50 0
main memory bandwidth [GB/s]
bandwidth [GB/s]
last level cache 70 60 50 40 30 20 10 0
1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 number of cores number of cores Xeon E5-2680 v3 Xeon E5-2670 Xeon X5670 Opteron 6274 Opteron 2435
On some processors, the bandwidth of the last level cache scales linearly with the number of cores that access it concurrently
The DRAM bandwidth can typically be saturated without using all cores
Daniel Molka
13
Saturation of Shared Resources Core 0
Core 1
L1
L1
L2
L2
...
Core n L1 L2
Last Level Cache
...
Point-to-Point Interconnect
RAM
RAM
Memory Controller
Other I/O Processors
Multiple levels of cache to bridge processor-DRAM performance gap Last Level Cache (LLC) and memory controller usually shared between cores
Daniel Molka
14
Outline Motivation Benchmarks for memory subsystem performance Identification of meaningful hardware performance counters Summary
Daniel Molka
15
Hardware Performance Counters
Core 0 Core 1 L1
L1
L2
L2
...
Core n L1 L2
Last Level Cache
Per-core counters – Record events that occur within the individual cores, e.g., pipeline stalls, misses in the local L1 and l2 caches
Uncore counters – Monitor shared resources
Point-to-Point Interconnect
Memory Controller
– Events cannot be attributed to a certain core
...
RAM
RAM
Accessible via PAPI Other I/O Processors
Daniel Molka
16
Properties of Hardware Performance Counters Not designed with performance analysis in mind – Included for verification purposes, not guaranteed to work – Some events are poorly documented Some countable events can have different origins – E.g., stalls in execution can happen because of long latency operations as well as memory accesses Unclear whether counters are good indicators for capacity utilization – Are 10 million cache misses per second too much?
Not stable between processor generations => Methodology to identify meaningful events is needed
Daniel Molka
17
Identification of Meaningful Performance Counter Events Component Utilization: – Use Micro-benchmarks to stress individual components – Identify performance monitoring events that correlate with the component utilization – Determine peak event ratios
Estimate performance Impact: – Extended latency benchmark to determine which stall counters best represent delays that are caused by memory accesses
– Search for events that represent stalls caused by limited bandwidth
Daniel Molka
18
Measuring Component Utilization
Required are performance counter events that: – Generate one event for a certain amount of transferred data (e.g., for every load and store or per cache line)
Daniel Molka
19
Measuring Component Utilization
Required are performance counter events that: – Generate one event for a certain amount of transferred data (e.g. for every load and store or per cache line)
– Clearly separate the levels in the memory hierarchy Daniel Molka
20
Measuring Component Utilization
Required are performance counter events that: – Generate one event for a certain amount of transferred data (e.g. for every load and store or per cache line)
– Clearly separate the levels in the memory hierarchy Daniel Molka
21
Measuring Component Utilization
Required are performance counter events that: – Generate one event for a certain amount of transferred data (e.g. for every load and store or per cache line)
– Clearly separate the levels in the memory hierarchy Daniel Molka
22
Measuring Component Utilization
Required are performance counter events that: – Generate one event for a certain amount of transferred data (e.g. for every load and store or per cache line)
– Clearly separate the levels in the memory hierarchy Daniel Molka
23
Measuring Component Utilization Example: Events that count L3 accesses (per cache line)
Good counters available for all levels in the memory hierarchy, except: – L1 accesses only per load resp. store instruction (different widths) – Writes to DRAM only counted per package Daniel Molka
24
Estimating Performance Impact of Memory Accesses A high component utilization indicates a potential performance problem
But, the actual effect on the performance cannot easily be quantified Modified latency benchmarks to check which stall counters provide good estimates for the delays caused by memory accesses additional multiplications between loads Two versions:
– independent operations that can overlap with memory accesses reported number of stalls should decrease accordingly – multiplications that are part of the dependency chain Ideal counter reports same results as for latency benchmark
Daniel Molka
25
Haswell – Stall Counters
Daniel Molka
26
Haswell – Stall Counters
Daniel Molka
27
Estimating Performance Impact of Memory Accesses Also need to consider stalls cycles in bandwidth-bound scenarios – Best reflected by events that indicate full request queues, but far from an optimal correlation – Events for loads and stores can overlap, but do not have to
On Haswell the following events can be used to categorize stall cycles, but accuracy is limited: CPU_CLK_UNHALTED – CYCLE_ACTIVITY:CYCLES_NO_EXECUTE
productive cycles
CPU_CLK_UNHALTED
active cycles
CYCLE_ACTIVITY :CYCLES_NO_EXECUTE
max(RESOURCE_STALLS:SB, CYCLE_ACTIVITY:STALLS_L1D_PENDING)
max(RESOURCE_STALLS:SB, L1D_PEND_MISS:FB_FULL + OFFCORE_REQUESTS_BUFFER :SQ_FULL)
bandwidth bound
memory bound
stall cycles
stall cycles – memory bound
memory bound – bandwidth bound
latency bound
Daniel Molka
other stall reason
28
Outline Motivation Benchmarks for memory subsystem performance Identification of meaningful hardware performance counters Summary
Daniel Molka
29
Summary Raw performance counter data typically difficult to interpret Selecting the (most) relevant events is not a trivial task Some events do not show the expected behavior – E.g., the LDM_PENDING event – Verification needed before relying on the reported event rates
The presented micro-benchmark based approach can be used to tackle these challenges
Acknowledgment: This work has been funded in a part by the European Union’s Horizon 2020 program in the READEX project and by the Bundesministerium für Bildung und Forschung via the research project Score-E
Daniel Molka
30
Thank You For Your Attention
Daniel Molka
31