Memory Performance and SPEC OpenMP Scalability on Quad-Socket

0 downloads 0 Views 748KB Size Report
Oct 26, 2011 - Center for Information Services and High Performance Computing (ZIH). Memory Performance and. SPEC OpenMP Scalability on Quad-Socket ...
Center for Information Services and High Performance Computing (ZIH)

Memory Performance and SPEC OpenMP Scalability on Quad-Socket x86_64 Systems ICA3PP, October 26 2011

Daniel Molka ([email protected]) Robert Schöne ([email protected]) Daniel Hackenberg ([email protected]) Matthias S. Müller ([email protected])

Motivation Increasing core count in commodity SMP systems because of multicore processors – Shared resources within packages (LLC, memory controller, etc.) – Complex NUMA topology in multisocket configurations

Application performance usually does not scale linearly with number of cores – Software problem or hardware limitation? – Potential to conserve energy?

Daniel Molka

2

Outline Test Systems – Processor configuration – NUMA Topology

Microbenchmarks – Latency – Bandwidth

Spec OMP2001 Performance – Single socket scaling – Multi socket scaling

Conclusions

Daniel Molka

3

Test Systems Quad AMD Opteron 6172

48 cores

Quad Intel Xeon X7560

32 cores (SMT disabled)

– 2100 MHz

– 2266 MHz (w/o Turbo)

– 64+64 KiB L1 per core

– 32+32 KiB L1 per core

– 512 KiB L2 per core

– 256 KiB L2 per core

12 MiB L3 per processor – Mostly exclusive – 2 MiB used for HT Assist

64 GiB DDR3-1333 – 4 DDR3 channels per socket

24 MiB L3 per processor – Inclusive of L1 and L2

256 GiB DDR3-1066 – 4 channel SMI per socket, each with 2 DDR3 channels

Daniel Molka

4

Processor configuration 8-Core Xeon X7560 (Nehalem-EX)

12-Core Opteron 6172 (Magny-Cours) MCM

L2

L2

L2

L2

L2

L2

L2

SMI

SMB

SMB

IMC (2 Channel)

QPI

SMI

SMI

IMC (2 Channel)

SMB

SMB

MCM with two 6-core dies

DDR3 H

I/O

DDR3 C

I/O

DDR3 C

x16

x8

x16

NC

L2

Shared L3 Cache (inclusive)

IMC (2 Channel)

HyperTransport x8

L2

DDR3 G

L2

DDR3 B

x16

x16

DDR3 B

DDR3 A

L2

Shared L3 Cache (non-inclusive)

HyperTransport

x8 x8

IMC (2 Channel)

L2

SMI

Shared L3 Cache (non-inclusive)

L2

DDR3 F

L2

DDR3 E

L2

DDR3 A

L2

8-Core Die Core Core Core Core Core Core Core Core 0 6 7 1 2 3 4 5 L1 L1 L1 L1 L1 L1 L1 L1

DDR3 D

L2

x16

L2

x8

L2

x8 x8

L2

Six-Core Die Core Core Core Core Core Core 6 7 8 9 10 11 L1 L1 L1 L1 L1 L1

DDR3 D

Six-Core Die Core Core Core Core Core Core 0 1 2 3 4 5 L1 L1 L1 L1 L1 L1

Monolithic 8-core die

– Connected via HyperTransport – Dual DDR3 controller per die

– 4 SMI channels per socket

– Two L3 partitions

– Single large L3

4x HyperTransport 3.0 per socket

4 QPI links per socket

– 1x connection to chipset

– 1x connection to chipset

– 6 half wide links to other sockets

– 3 links to other sockets

Daniel Molka

5

NUMA Topology 4 socket G34 system Mem Mem

Mem Mem

Mem Mem

Node 5

Node 7

Node 4

Node 6

I/O

I/O

Node 1

Node 3

4 socket LGA1567 system

Mem

Mem

Mem

Mem

Mem Mem

Mem

Node 2

Node 3

Mem

Mem

Mem

Mem

I/O Mem

I/O

Mem

Mem

Mem

Mem

Mem

Node 0 Mem Mem

Node 0

Node 2

I/O

I/O

Mem Mem

Node 1

Mem

Mem

Mem

Mem

I/O

8 NUMA nodes

I/O

4 NUMA nodes

– 2 hops for some connections

Fully connected system

– Probe Filters (HT Assist) in every node reduce coherence traffic

2 half wide HT links connect any two sockets (2x 12.8 GB/s)

Mem

– Max. 1 hop distance – 25.6 GB/s per link (12.8 GB/s per direction)

Daniel Molka

6

Microbenchmarks for 64 Bit x86 systems Memory latency and bandwidth measurements – Well-directed placement of data in any cache or memory location – Coherency state control

Implementation: – pthreads with affinity control (sched_setaffinity(…)) – Assembler implementation of measurement routines – Time measurement using Time Stamp Counter (rdtsc) – NUMA aware memory allocation – Hugetlbfs support to reduce TLB influence

Available as Open Source http://www.benchit.org/wiki/index.php/X86membench

Daniel Molka

7

Memory Latency Opteron 6172 L1

L2

L3

Xeon X7560 RAM

L1

L2

L3

RAM

Fast accesses to L1/L2 caches of same die

Access to 2nd die in MCM only marginally faster than 1 hop accesses to other socket

High memory latency because of SMBs

HT Assist enables very low local memory latency (faster than probing caches) Daniel Molka

8

Aggregated memory bandwidth per socket Opteron 6172 L2

L3

Xeon X7560 RAM

L2

L3

RAM

Much better L3 bandwidth and scaling on Intel system – Xeon L3 scaling: 1 core 19.2 GB/s, 8 cores 152 GB/s (19.0 GB/s) – Opteron L3 scaling: 1 core 7.8 GB/s, 12 cores 63.7 GB/s (5.3 per core)

Similar memory bandwidth per socket – AMD: 26.2 GB/s, Intel: 25.7 GB/s

Daniel Molka

9

HyperTransport and QPI bandwidths Same theoretical bandwidth between sockets – (up to) 25.6 GB/s according to vendor specifications – i.e. 12.8 GB/s per direction – Not achievable due to protocol overhead and coherency traffic

Measurements: HT/QPI transfers

1 thread 6/8 threads specification

Opteron 6172, between dies in MCM

3.8 GB/s 5.5 GB/s

up to 19.2 GB/s*

Opteron 6172, between sockets

2.1 GB/s 2.1 GB/s

up to 6.4 GB/s**

Xeon X7560, between sockets

6.3 GB/s 11.0 GB/s

up to 12.8 GB/s

*: one full (16 Bit) link and one half wide (8 Bit) link **: half wide (8 Bit) links between dies in different sockets

– HT bandwidth much lower than expected – Only one of the two links between the sockets used for transfers between two dies in AMD system Daniel Molka

10

Memory performance summary Memory latency – Faster cache accesses on the Intel system – Lower main memory latencies in the AMD system

Memory bandwidth – Significantly faster L3 cache in Intel system – Almost identical memory bandwidth

Interconnect bandwidth – Relatively weak connection between dies of Opteron‘s MCM – Extremely low bandwidth between dies in different sockets in the AMD test system

Daniel Molka

11

Outline Test Systems – Processor configuration – NUMA Topology

Microbenchmarks – Latency – Bandwidth

Spec OMP2001 Performance – Single socket scaling – Multi socket scaling

Conclusions

Daniel Molka

12

SPEC OMP2001 version 3.2 Based on real applications – 11 different codes – Covers wide range of applications with different OpenMP constructs for parallelization

Provides medium data set for small scale SMP systems – Does not consume more than 1 GiB per core – Different memory sizes of test systems irrelevant

Compiler: Intel C/C++ and Fortran Compilers version 11.1 – Same basic optimization flags on both systems: -O3 -ipo -openmp

– Different SSE instruction sets supported by test systems • -msse3 used for the Opteron machine • -xSSE4.2 used for the Xeon machine

Daniel Molka

13

SPEC OMP2001 scaling: single socket

Shared resources do not scale linearly – Group 1 - hardly bandwidth bound: 324,330, and 332 – Group 2 - significantly bandwidth bound: 310, 314, 326, and 328

– Group 3 - strongly bandwidth bound: 312, 316, 318, and 320

316, 318, 320, 324, 326, 328, 330, and 332 do not scale well with MCM Daniel Molka

14

SPEC OMP2001 scaling: multiple sockets

318 and 320 scale poorly on both systems 324, 326, 328, and 332 don‘t scale well with multiple sockets on AMD system

Daniel Molka

15

Overall parallel efficiency benchmark

Opteron 6172

Xeon X7560

1 die/ 1 core

1 socket/ 1 core

4 sockets/ total 1 socket

310.wupwise

0.66

0.64

0.84

0.54

0.73

0.71

0.52

312.swim

0.32

0.32

0.85

0.27

0.36

0.83

0.30

314.mgrid

0.71

0.69

0.76

0.53

0.85

0.81

0.69

316.applu

0.58

0.42

0.83

0.35

0.64

1.47

0.93

318.galgel

0.50

0.40

0.29

0.12

0.43

0.40

0.17

320.equake

0.49

0.39

0.42

0.16

0.54

0.42

0.22

324.Apsi

0.89

0.8

0.74

0.59

0.93

0.85

0.79

326.gafort

0.74

0.66

0.54

0.36

0.85

0.89

0.76

328.fma3d

0.77

0.72

0.64

0.46

0.79

0.86

0.68

330.art

0.88

0.83

0.72

0.62

0.99

0.87

0.86

332.ammp

0.89

0.81

0.65

0.53

0.95

0.75

0.71

Daniel Molka

1 socket/ 1 core

4 sockets / total 1 socket

16

Overall parallel efficiency benchmark

Opteron 6172

Xeon X7560

1 die/ 1 core

1 socket/ 1 core

4 sockets/ total 1 socket

1 socket/ 1 core

4 sockets / total 1 socket

avg. group 1

0.89

0.81

0.72

0.58

0.95

0.83

0.79

avg. group 2

0.72

0.68

0.70

0.47

0.81

0.82

0.66

312.swim

0.32

0.32

0.85

0.27

0.36

0.83

0.30

Better scaling on monolithic Xeon die than on Opteron MCM – L3 limits scaling on the 6-core AMD die – Low HT bandwidth between dies limits scaling on MCM

Also better scaling with number of processors in Intel system – Scaling in AMD system limited by low bandwidth of half wide HT links – AMD HT Assist not enough to compensate low HT bandwidths

Similar behavior for memory bound applications

Daniel Molka

17

Performance comparison

40% higher peak compute performance of AMD system not reflected by results Similar performance for a single socket Better performance on the Intel system if all processors are used

Daniel Molka

18

Conclusions Scaling of applications strongly influenced by hardware properties – Mixture of effects from different components – Different behavior for scaling with cores and scaling with sockets

Poor scaling on multicore processors not necessarily a software problem – Often limited by shared resources

– Potential to conserve energy by using fewer cores

NUMA systems require high interconnect bandwidths – Low HT bandwidths in AMD system severely limit scalability – HT Assist does not reduce coherency traffic enough to compensate bandwidth disadvantage

Daniel Molka

19

Thank you This work has been funded by the German Federal Ministry of Education and Research within the eeClust project.

www.eeclust.de

Daniel Molka

20

Suggest Documents