Center for Information Services and High Performance Computing (ZIH)
Memory Performance and SPEC OpenMP Scalability on Quad-Socket x86_64 Systems ICA3PP, October 26 2011
Daniel Molka (
[email protected]) Robert Schöne (
[email protected]) Daniel Hackenberg (
[email protected]) Matthias S. Müller (
[email protected])
Motivation Increasing core count in commodity SMP systems because of multicore processors – Shared resources within packages (LLC, memory controller, etc.) – Complex NUMA topology in multisocket configurations
Application performance usually does not scale linearly with number of cores – Software problem or hardware limitation? – Potential to conserve energy?
Daniel Molka
2
Outline Test Systems – Processor configuration – NUMA Topology
Microbenchmarks – Latency – Bandwidth
Spec OMP2001 Performance – Single socket scaling – Multi socket scaling
Conclusions
Daniel Molka
3
Test Systems Quad AMD Opteron 6172
48 cores
Quad Intel Xeon X7560
32 cores (SMT disabled)
– 2100 MHz
– 2266 MHz (w/o Turbo)
– 64+64 KiB L1 per core
– 32+32 KiB L1 per core
– 512 KiB L2 per core
– 256 KiB L2 per core
12 MiB L3 per processor – Mostly exclusive – 2 MiB used for HT Assist
64 GiB DDR3-1333 – 4 DDR3 channels per socket
24 MiB L3 per processor – Inclusive of L1 and L2
256 GiB DDR3-1066 – 4 channel SMI per socket, each with 2 DDR3 channels
Daniel Molka
4
Processor configuration 8-Core Xeon X7560 (Nehalem-EX)
12-Core Opteron 6172 (Magny-Cours) MCM
L2
L2
L2
L2
L2
L2
L2
SMI
SMB
SMB
IMC (2 Channel)
QPI
SMI
SMI
IMC (2 Channel)
SMB
SMB
MCM with two 6-core dies
DDR3 H
I/O
DDR3 C
I/O
DDR3 C
x16
x8
x16
NC
L2
Shared L3 Cache (inclusive)
IMC (2 Channel)
HyperTransport x8
L2
DDR3 G
L2
DDR3 B
x16
x16
DDR3 B
DDR3 A
L2
Shared L3 Cache (non-inclusive)
HyperTransport
x8 x8
IMC (2 Channel)
L2
SMI
Shared L3 Cache (non-inclusive)
L2
DDR3 F
L2
DDR3 E
L2
DDR3 A
L2
8-Core Die Core Core Core Core Core Core Core Core 0 6 7 1 2 3 4 5 L1 L1 L1 L1 L1 L1 L1 L1
DDR3 D
L2
x16
L2
x8
L2
x8 x8
L2
Six-Core Die Core Core Core Core Core Core 6 7 8 9 10 11 L1 L1 L1 L1 L1 L1
DDR3 D
Six-Core Die Core Core Core Core Core Core 0 1 2 3 4 5 L1 L1 L1 L1 L1 L1
Monolithic 8-core die
– Connected via HyperTransport – Dual DDR3 controller per die
– 4 SMI channels per socket
– Two L3 partitions
– Single large L3
4x HyperTransport 3.0 per socket
4 QPI links per socket
– 1x connection to chipset
– 1x connection to chipset
– 6 half wide links to other sockets
– 3 links to other sockets
Daniel Molka
5
NUMA Topology 4 socket G34 system Mem Mem
Mem Mem
Mem Mem
Node 5
Node 7
Node 4
Node 6
I/O
I/O
Node 1
Node 3
4 socket LGA1567 system
Mem
Mem
Mem
Mem
Mem Mem
Mem
Node 2
Node 3
Mem
Mem
Mem
Mem
I/O Mem
I/O
Mem
Mem
Mem
Mem
Mem
Node 0 Mem Mem
Node 0
Node 2
I/O
I/O
Mem Mem
Node 1
Mem
Mem
Mem
Mem
I/O
8 NUMA nodes
I/O
4 NUMA nodes
– 2 hops for some connections
Fully connected system
– Probe Filters (HT Assist) in every node reduce coherence traffic
2 half wide HT links connect any two sockets (2x 12.8 GB/s)
Mem
– Max. 1 hop distance – 25.6 GB/s per link (12.8 GB/s per direction)
Daniel Molka
6
Microbenchmarks for 64 Bit x86 systems Memory latency and bandwidth measurements – Well-directed placement of data in any cache or memory location – Coherency state control
Implementation: – pthreads with affinity control (sched_setaffinity(…)) – Assembler implementation of measurement routines – Time measurement using Time Stamp Counter (rdtsc) – NUMA aware memory allocation – Hugetlbfs support to reduce TLB influence
Available as Open Source http://www.benchit.org/wiki/index.php/X86membench
Daniel Molka
7
Memory Latency Opteron 6172 L1
L2
L3
Xeon X7560 RAM
L1
L2
L3
RAM
Fast accesses to L1/L2 caches of same die
Access to 2nd die in MCM only marginally faster than 1 hop accesses to other socket
High memory latency because of SMBs
HT Assist enables very low local memory latency (faster than probing caches) Daniel Molka
8
Aggregated memory bandwidth per socket Opteron 6172 L2
L3
Xeon X7560 RAM
L2
L3
RAM
Much better L3 bandwidth and scaling on Intel system – Xeon L3 scaling: 1 core 19.2 GB/s, 8 cores 152 GB/s (19.0 GB/s) – Opteron L3 scaling: 1 core 7.8 GB/s, 12 cores 63.7 GB/s (5.3 per core)
Similar memory bandwidth per socket – AMD: 26.2 GB/s, Intel: 25.7 GB/s
Daniel Molka
9
HyperTransport and QPI bandwidths Same theoretical bandwidth between sockets – (up to) 25.6 GB/s according to vendor specifications – i.e. 12.8 GB/s per direction – Not achievable due to protocol overhead and coherency traffic
Measurements: HT/QPI transfers
1 thread 6/8 threads specification
Opteron 6172, between dies in MCM
3.8 GB/s 5.5 GB/s
up to 19.2 GB/s*
Opteron 6172, between sockets
2.1 GB/s 2.1 GB/s
up to 6.4 GB/s**
Xeon X7560, between sockets
6.3 GB/s 11.0 GB/s
up to 12.8 GB/s
*: one full (16 Bit) link and one half wide (8 Bit) link **: half wide (8 Bit) links between dies in different sockets
– HT bandwidth much lower than expected – Only one of the two links between the sockets used for transfers between two dies in AMD system Daniel Molka
10
Memory performance summary Memory latency – Faster cache accesses on the Intel system – Lower main memory latencies in the AMD system
Memory bandwidth – Significantly faster L3 cache in Intel system – Almost identical memory bandwidth
Interconnect bandwidth – Relatively weak connection between dies of Opteron‘s MCM – Extremely low bandwidth between dies in different sockets in the AMD test system
Daniel Molka
11
Outline Test Systems – Processor configuration – NUMA Topology
Microbenchmarks – Latency – Bandwidth
Spec OMP2001 Performance – Single socket scaling – Multi socket scaling
Conclusions
Daniel Molka
12
SPEC OMP2001 version 3.2 Based on real applications – 11 different codes – Covers wide range of applications with different OpenMP constructs for parallelization
Provides medium data set for small scale SMP systems – Does not consume more than 1 GiB per core – Different memory sizes of test systems irrelevant
Compiler: Intel C/C++ and Fortran Compilers version 11.1 – Same basic optimization flags on both systems: -O3 -ipo -openmp
– Different SSE instruction sets supported by test systems • -msse3 used for the Opteron machine • -xSSE4.2 used for the Xeon machine
Daniel Molka
13
SPEC OMP2001 scaling: single socket
Shared resources do not scale linearly – Group 1 - hardly bandwidth bound: 324,330, and 332 – Group 2 - significantly bandwidth bound: 310, 314, 326, and 328
– Group 3 - strongly bandwidth bound: 312, 316, 318, and 320
316, 318, 320, 324, 326, 328, 330, and 332 do not scale well with MCM Daniel Molka
14
SPEC OMP2001 scaling: multiple sockets
318 and 320 scale poorly on both systems 324, 326, 328, and 332 don‘t scale well with multiple sockets on AMD system
Daniel Molka
15
Overall parallel efficiency benchmark
Opteron 6172
Xeon X7560
1 die/ 1 core
1 socket/ 1 core
4 sockets/ total 1 socket
310.wupwise
0.66
0.64
0.84
0.54
0.73
0.71
0.52
312.swim
0.32
0.32
0.85
0.27
0.36
0.83
0.30
314.mgrid
0.71
0.69
0.76
0.53
0.85
0.81
0.69
316.applu
0.58
0.42
0.83
0.35
0.64
1.47
0.93
318.galgel
0.50
0.40
0.29
0.12
0.43
0.40
0.17
320.equake
0.49
0.39
0.42
0.16
0.54
0.42
0.22
324.Apsi
0.89
0.8
0.74
0.59
0.93
0.85
0.79
326.gafort
0.74
0.66
0.54
0.36
0.85
0.89
0.76
328.fma3d
0.77
0.72
0.64
0.46
0.79
0.86
0.68
330.art
0.88
0.83
0.72
0.62
0.99
0.87
0.86
332.ammp
0.89
0.81
0.65
0.53
0.95
0.75
0.71
Daniel Molka
1 socket/ 1 core
4 sockets / total 1 socket
16
Overall parallel efficiency benchmark
Opteron 6172
Xeon X7560
1 die/ 1 core
1 socket/ 1 core
4 sockets/ total 1 socket
1 socket/ 1 core
4 sockets / total 1 socket
avg. group 1
0.89
0.81
0.72
0.58
0.95
0.83
0.79
avg. group 2
0.72
0.68
0.70
0.47
0.81
0.82
0.66
312.swim
0.32
0.32
0.85
0.27
0.36
0.83
0.30
Better scaling on monolithic Xeon die than on Opteron MCM – L3 limits scaling on the 6-core AMD die – Low HT bandwidth between dies limits scaling on MCM
Also better scaling with number of processors in Intel system – Scaling in AMD system limited by low bandwidth of half wide HT links – AMD HT Assist not enough to compensate low HT bandwidths
Similar behavior for memory bound applications
Daniel Molka
17
Performance comparison
40% higher peak compute performance of AMD system not reflected by results Similar performance for a single socket Better performance on the Intel system if all processors are used
Daniel Molka
18
Conclusions Scaling of applications strongly influenced by hardware properties – Mixture of effects from different components – Different behavior for scaling with cores and scaling with sockets
Poor scaling on multicore processors not necessarily a software problem – Often limited by shared resources
– Potential to conserve energy by using fewer cores
NUMA systems require high interconnect bandwidths – Low HT bandwidths in AMD system severely limit scalability – HT Assist does not reduce coherency traffic enough to compensate bandwidth disadvantage
Daniel Molka
19
Thank you This work has been funded by the German Federal Ministry of Education and Research within the eeClust project.
www.eeclust.de
Daniel Molka
20