Performance Evaluation of OpenMP Applications ... - Semantic Scholar

5 downloads 76846 Views 2MB Size Report
1: Steinbuch Center for Computing,. Karlsruhe ... HPC Applications on the Amazon Web Services Cloud“ ... Virtual machines ... their activity, e.g. barrier call.
Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines Jie Tao1 Karl Fuerlinger2 Holger Marten1 1:

[email protected] [email protected] [email protected]

Steinbuch Center for Computing, Karlsruhe Institute of Technology (KIT), Germany 2: MNM-Team, Department of Computer Science, LMU München, Germany

Outline 

Introduction – Virtualization and the impact on performance



Experimental Setup – NAS parallel benchmarks, SPEC OpenMP, microbenchmarks



Study of SP (NAS Parallel Benchmarks) – Initial performance – Analysis using ompP – Optimization results and microbenchmark study



Conclusions

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 2

Virtualization 

Running multiple OSs on the same hardware VM 1 Guest OS

Application Operating System

VM 3

VM 4

Guest OS

Guest OS

Guest OS

Hypervisor

Hardware



VM 2

Host machine

Concepts – Hypervisor (xen, KVM, VMware) – Full virtualization vs para-virtualization



Adopted for – Server consolidation – Cloud Computing: on-demand resource provision



Performance impact Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 3

Performance Impact of Virtualization 

Has been studied before, E.g., Keith Jackson, et al. „Performance of HPC Applications on the Amazon Web Services Cloud“



Here: The performance impact of virtualization on OpenMP applications Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 4

Experimental Setup 

Benchmarks – NAS OpenMP (size A) – SPEC OpenMP (reference dataset) – EPCC OpenMP Microbenchmarks



Host machine – AMD Opteron 2376 („Shanghai“), 2.3 GHz, 2 socket quadcore – Scientific Linux – Virtualized with xen



Virtual machines – – – – –

Hypervisor: xen OS: Debian 2.6.26 Compiler: gcc 4.3.2 #cores: 1-8 Memory: 4GB Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 5

NAS Parallel Benchmarks

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 6

NAS Parallel Benchmarks (2)

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 7

SPEC OpenMP Benchmarks

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 8

SPEC OpenMP Benchmarks (2)

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 9

Execution time of NAS SP

What is going on here? Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 10

OpenMP Performance Analysis with ompP 

ompP: OpenMP profiling tool – Based on source code instrumentation – Independent of the compiler and runtime used – Supports HW counters through PAPI – Uses source code instrumenter Opari from the KOJAK/Scalasca toolset – Available for download (GPL): http://www.ompp-tool.com

Automatic instrumentation of OpenMP constructs, manual region instrumentation

Source Code

ompP library

Settings (env. Vars) HW Counters, output format,…

Executable

Execution on parallel machine

Profiling Report

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 11

Source to Source Instrumentation with Opari 

Preprocessor Instrumentation – Example: Instrumenting OpenMP constructs with Opari – Preprocessor operation Orignial source code

Preprocessor

Modified (instrumented) source code

– Example: Instrumentation of a parallel region POMP_Parallel_fork [master] #pragma omp parallel { POMP_Parallel_begin [team] /* user code in parallel region */ /* user code in parallel region */ } POMP_Barrier_enter [team] #pragma omp barrier POMP_Barrier_exit [team] POMP_Parallel_end [team] } POMP_Parallel_join [master]

Instrumentation added by Opari

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 12

ompP’s Profiling Data 

Example code section and performance profile:

Code:

Profile:

#pragma omp parallel { #pragma omp critical { sleep(1.0); } }

R00002 main.c (34-37) (default) CRITICAL TID execT execC bodyT enterT 0 3.00 1 1.00 2.00 1 1.00 1 1.00 0.00 2 2.00 1 1.00 1.00 3 4.00 1 1.00 3.00 SUM 10.01 4 4.00 6.00



exitT 0.00 0.00 0.00 0.00 0.00

Components: – Source code location and type of region – Timing data and execution counts, depending on the particular construct – One line per thread, last line sums over all threads – Hardware counter data (if PAPI is available and HW counters are selected) – Data is “exact” (measured, not based on sampling) Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 13

ompP Overhead Analysis (1) 

Certain timing categories reported by ompP can be classified as overheads: – Example: enterT in a critical section: Threads wait to enter the critical section (synchronization overhead).



Four overhead categories are defined in ompP: – Imbalance: waiting time incurred due to an imbalanced amount of work in a worksharing or parallel region – Synchronization: overhead that arises due to threads having to synchronize their activity, e.g. barrier call – Limited Parallelism: idle threads due not enough parallelism being exposed by the program – Thread management: overhead for the creation and destruction of threads, and for signaling critical sections, locks as available Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 14

ompP Overhead Analysis (2)

S: Synchronization overhead

I: Imbalance overhead

M: Thread management overhead

L: Limited Parallelism overhead

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 15

Overhead Analysis for the NAS Benchmarks Total

Overhead (%)

Synch

Imbal

Limpar

Mgmt

BT-host BT-full BT-para

1253.71 1294.55 1400.50

81.23 (06.48) 148.48 (11.47) 163.66 (11.65)

0.00 0.00 0.00

80.87 148.47 163.64

0.00 0.00 0.00

0.36 0.01 0.02

FT-host FT-full FT-para

72.27 75.02 88.67

25.62 (35.44) 25.97 (34.53) 32.22 (36.34)

0.01 0.01 0.00

1.06 1.04 6.45

24.43 24.85 25.73

0.12 0.07 0.04

CG-host CG-full CG-para

14.36 17.64 24.05

1.55 (08.95) 4.87 (23.59) 6.37 (26.49)

0.00 0.00 0.00

0.95 3.46 5.27

0.19 1.37 1.08

0.41 0.04 0.02

EP-host EP-full EP-para

92.27 89.66 133.76

1.08 (01.17) 1.24 (01.37) 29.60 (22.13)

0.00 0.00 0.00

0.93 0.75 29.32

0.00 0.00 0.00

0.15 0.49 0.27

SP-host SP-full SP-para

4994.76 16466.47 6816.17

1652.66 (33.03) 14315.84 (86.89) 5302.04 (77.68)

0.11 1.45 2.74

1651.95 14314.36 5299.29

0.00 0.00 0.00

0.60 0.03 0.01

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 16

OpenMP Constructs in the NAS Parallel Benchmarks

Parallel

Loop

Single

Barrier

Critical

Master

BT

2

54

0

0

0

2

FT

2

6

5

1

1

1

CG

2

22

12

0

0

2

EP

1

1

0

0

1

1

SP

2

69

0

3

0

2

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 17

ompP Profile for SP 

ompP Profiling Report for sp.c (lines 898-906) (para-virtualized)

TID

execT

execC

bodyT

exitBarT

exitBarT (native host)

0

310.60

1541444

11.24

289.41

38.92

1

310.50

1541444

11.22

289.35

38.91

2

310.44

1541444

11.3

289.12

37.11

3

310.26

1541444

11.22

289.14

38.03

4

310.85

1541444

11.26

289.68

38.77

5

310.82

1541444

11.24

289.62

35.47

6

311.10

1541444

11.17

289.99

38.85

7

311.14

1541444

10.92

290.48

39.35

SUM

2485.71

12331552

89.60

2316.76

305.41

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 18

exitBarT in a Parallel Loops 

Opari transforms the implicit barrier into an explict barrier



Worst case load imbalance scenario:

Loop_enter

i

Barrier_enter Barrier_exit Loop_exit

t

i

 exitBarT =

Thread i can induce at most t seconds exitBarT time in each other thread

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 19

TID

execT

execC

bodyT exitBarT

0

310.60

1541444

11.24

289.41

1

310.50

1541444

11.22

289.35

2

310.44

1541444

11.3

289.12

3

310.26

1541444

11.22

289.14

4

310.85

1541444

11.26

289.68

5

310.82

1541444

11.24

289.62

6

311.10

1541444

11.17

289.99

7

311.14

1541444

10.92

290.48

SUM

2485.71

12331552

89.60

2316.76

exitBarT should be max. ~80 seconds

Barrier that takes a really long time

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 20

Optimization 

Move parallelization to outermost loop

for (j = 1; j

Suggest Documents