Memory Performance and SPEC OpenMP Scalability on Quad-Socket ...

7 downloads 0 Views 1MB Size Report
Performance of on-chip transfers and remote cache accesses. Saturation of shared resources. ⇒Need to understand which hardware characteristics determine.
Center for Information Services and High Performance Computing (ZIH)

Detecting Memory-Boundedness with Hardware Performance Counters ICPE, Apr 24th 2017

Daniel Molka ([email protected])

Robert Schöne ([email protected]) Daniel Hackenberg ([email protected]) Wolfgang E. Nagel ([email protected])

Outline Motivation Benchmarks for memory subsystem performance Identification of meaningful hardware performance counters Summary

Daniel Molka

2

Scaling of Parallel Applications Core 0

Core 1

L1

L1

L2

L2

...

Core n L1 L2

Last Level Cache

...

Point-to-Point Interconnect

RAM

RAM

Memory Controller

Other I/O Processors

Multiple levels of cache to bridge processor-DRAM performance gap Last Level Cache (LLC) and memory controller usually shared between cores

Daniel Molka

3

Scaling of Parallel Applications

Core 0

Core 1

L1

L1

L2

L2

...

Core n L1 L2

Mem

Mem

Mem

Mem

Mem

Mem

Mem

Mem

Mem

Mem

Mem

Mem

Mem

Mem

Mem

Mem

Last Level Cache

...

Point-to-Point Interconnect

RAM

RAM

Memory Controller

Other I/O Processors

Typically integrated memory controllers in every processor Performance of memory accesses depends on distance from the data

Daniel Molka

4

Scaling of Parallel Applications

Core 0

Core 1

L1

L1

L2

L2

...

Core n L1 L2

Mem

Mem

Mem

Mem

Mem

Mem

Mem

Mem

Mem

Mem

Mem

Mem

Mem

Mem

Mem

Mem

Last Level Cache

...

Point-to-Point Interconnect

RAM

RAM

Memory Controller

Other I/O Processors

Daniel Molka

5

Potential Bottlenecks Local memory hierarchy

Multicore Processor Core

Core

L1

L1

L2

L2

– Limited data path widths

– Access latencies

Remote memory accesses

Multicore Processor

Core

Core

Core

L1

L1

L1

L2

L2

L2

...

Core

...

Last Level Cache

Last Level Cache

Memory

Memory

– Additional latency

– Interconnect bandwidth

Performance of on-chip transfers and remote cache accesses Saturation of shared resources Need to understand which hardware characteristics determine the application performance Requires knowledge about: – Peak achievable performance of individual components

– Component utilization at application runtime Daniel Molka

6

L1 L2

Outline Motivation Benchmarks for memory subsystem performance Identification of meaningful hardware performance counters Summary

Daniel Molka

7

Common Memory Benchmarks Multicore Processor Core

Core

L1

L1

L2

L2

Core

Core

Core

L1

L1

L1

L2

L2

L2

...

Last Level Cache

L1

L1

L2

L2

...

L1

– Latency: Lmbench

L2

Remote memory accesses – STREAM and Lmbench plus numactl

Memory

Multicore Processor Core

– Bandwidth: STREAM

Core

Last Level Cache

Memory

Core

Local memory hierarchy

Multicore Processor

Multicore Processor

Core

Core

Core

L1

L1

L1

L2

L2

L2

...

On-chip transfers and remote cache accesses

Core

...

Last Level Cache

Last Level Cache

Memory

Memory

– Not covered by common tools

L1

– Not easily extendable

L2

Saturation of shared resources – Also covered by STREAM

Daniel Molka

8

ZIH Development BenchIT Includes measurement of core-to-core transfers – Example:

L1

L2

L3

RAM

Opteron 6176 memory latency

Daniel Molka

9

ZIH Development BenchIT Includes measurement of core-to-core transfers – Example:

L1

L2

L3

RAM

Opteron 6176 memory latency

transfers within one processor

Daniel Molka

10

ZIH Development BenchIT Includes measurement of core-to-core transfers – Example: Opteron 6176 memory latency

L1

L2

L3

RAM

remote cache accesses main memory (NUMA)

transfers within one processor

Daniel Molka

11

ZIH Development BenchIT Includes measurement of core-to-core transfers – Example: Opteron 6176 memory latency

L1

L2

L3

RAM

remote cache accesses

not covered by other benchmark suites

main memory (NUMA)

transfers within one processor

Sophisticated data placement enables performance measurements for individual components in the memory hierarchy Also considers state transitions of coherence protocols

Daniel Molka

12

Scaling of Shared Resources in Multi-core Processors 300 250 200 150 100 50 0

main memory bandwidth [GB/s]

bandwidth [GB/s]

last level cache 70 60 50 40 30 20 10 0

1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 number of cores number of cores Xeon E5-2680 v3 Xeon E5-2670 Xeon X5670 Opteron 6274 Opteron 2435

On some processors, the bandwidth of the last level cache scales linearly with the number of cores that access it concurrently

The DRAM bandwidth can typically be saturated without using all cores

Daniel Molka

13

Saturation of Shared Resources Core 0

Core 1

L1

L1

L2

L2

...

Core n L1 L2

Last Level Cache

...

Point-to-Point Interconnect

RAM

RAM

Memory Controller

Other I/O Processors

Multiple levels of cache to bridge processor-DRAM performance gap Last Level Cache (LLC) and memory controller usually shared between cores

Daniel Molka

14

Outline Motivation Benchmarks for memory subsystem performance Identification of meaningful hardware performance counters Summary

Daniel Molka

15

Hardware Performance Counters

Core 0 Core 1 L1

L1

L2

L2

...

Core n L1 L2

Last Level Cache

Per-core counters – Record events that occur within the individual cores, e.g., pipeline stalls, misses in the local L1 and l2 caches

Uncore counters – Monitor shared resources

Point-to-Point Interconnect

Memory Controller

– Events cannot be attributed to a certain core

...

RAM

RAM

Accessible via PAPI Other I/O Processors

Daniel Molka

16

Properties of Hardware Performance Counters Not designed with performance analysis in mind – Included for verification purposes, not guaranteed to work – Some events are poorly documented Some countable events can have different origins – E.g., stalls in execution can happen because of long latency operations as well as memory accesses Unclear whether counters are good indicators for capacity utilization – Are 10 million cache misses per second too much?

Not stable between processor generations => Methodology to identify meaningful events is needed

Daniel Molka

17

Identification of Meaningful Performance Counter Events Component Utilization: – Use Micro-benchmarks to stress individual components – Identify performance monitoring events that correlate with the component utilization – Determine peak event ratios

Estimate performance Impact: – Extended latency benchmark to determine which stall counters best represent delays that are caused by memory accesses

– Search for events that represent stalls caused by limited bandwidth

Daniel Molka

18

Measuring Component Utilization

Required are performance counter events that: – Generate one event for a certain amount of transferred data (e.g., for every load and store or per cache line)

Daniel Molka

19

Measuring Component Utilization

Required are performance counter events that: – Generate one event for a certain amount of transferred data (e.g. for every load and store or per cache line)

– Clearly separate the levels in the memory hierarchy Daniel Molka

20

Measuring Component Utilization

Required are performance counter events that: – Generate one event for a certain amount of transferred data (e.g. for every load and store or per cache line)

– Clearly separate the levels in the memory hierarchy Daniel Molka

21

Measuring Component Utilization

Required are performance counter events that: – Generate one event for a certain amount of transferred data (e.g. for every load and store or per cache line)

– Clearly separate the levels in the memory hierarchy Daniel Molka

22

Measuring Component Utilization

Required are performance counter events that: – Generate one event for a certain amount of transferred data (e.g. for every load and store or per cache line)

– Clearly separate the levels in the memory hierarchy Daniel Molka

23

Measuring Component Utilization Example: Events that count L3 accesses (per cache line)

Good counters available for all levels in the memory hierarchy, except: – L1 accesses only per load resp. store instruction (different widths) – Writes to DRAM only counted per package Daniel Molka

24

Estimating Performance Impact of Memory Accesses A high component utilization indicates a potential performance problem

But, the actual effect on the performance cannot easily be quantified Modified latency benchmarks to check which stall counters provide good estimates for the delays caused by memory accesses additional multiplications between loads Two versions:

– independent operations that can overlap with memory accesses reported number of stalls should decrease accordingly – multiplications that are part of the dependency chain Ideal counter reports same results as for latency benchmark

Daniel Molka

25

Haswell – Stall Counters

Daniel Molka

26

Haswell – Stall Counters

Daniel Molka

27

Estimating Performance Impact of Memory Accesses Also need to consider stalls cycles in bandwidth-bound scenarios – Best reflected by events that indicate full request queues, but far from an optimal correlation – Events for loads and stores can overlap, but do not have to

On Haswell the following events can be used to categorize stall cycles, but accuracy is limited: CPU_CLK_UNHALTED – CYCLE_ACTIVITY:CYCLES_NO_EXECUTE

productive cycles

CPU_CLK_UNHALTED

active cycles

CYCLE_ACTIVITY :CYCLES_NO_EXECUTE

max(RESOURCE_STALLS:SB, CYCLE_ACTIVITY:STALLS_L1D_PENDING)

max(RESOURCE_STALLS:SB, L1D_PEND_MISS:FB_FULL + OFFCORE_REQUESTS_BUFFER :SQ_FULL)

bandwidth bound

memory bound

stall cycles

stall cycles – memory bound

memory bound – bandwidth bound

latency bound

Daniel Molka

other stall reason

28

Outline Motivation Benchmarks for memory subsystem performance Identification of meaningful hardware performance counters Summary

Daniel Molka

29

Summary Raw performance counter data typically difficult to interpret Selecting the (most) relevant events is not a trivial task Some events do not show the expected behavior – E.g., the LDM_PENDING event – Verification needed before relying on the reported event rates

The presented micro-benchmark based approach can be used to tackle these challenges

Acknowledgment: This work has been funded in a part by the European Union’s Horizon 2020 program in the READEX project and by the Bundesministerium für Bildung und Forschung via the research project Score-E

Daniel Molka

30

Thank You For Your Attention

Daniel Molka

31

Suggest Documents