Performance Evaluation of OpenMP Applications ... - Semantic Scholar

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines Jie Tao1 Karl Fuerlinger2 Holger Marten1 1:

[email protected] [email protected] [email protected]

Steinbuch Center for Computing, Karlsruhe Institute of Technology (KIT), Germany 2: MNM-Team, Department of Computer Science, LMU München, Germany

Outline 

Introduction – Virtualization and the impact on performance



Experimental Setup – NAS parallel benchmarks, SPEC OpenMP, microbenchmarks



Study of SP (NAS Parallel Benchmarks) – Initial performance – Analysis using ompP – Optimization results and microbenchmark study



Conclusions

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 2

Virtualization 

Running multiple OSs on the same hardware VM 1 Guest OS

Application Operating System

VM 3

VM 4

Guest OS

Guest OS

Guest OS

Hypervisor

Hardware



VM 2

Host machine

Concepts – Hypervisor (xen, KVM, VMware) – Full virtualization vs para-virtualization



Adopted for – Server consolidation – Cloud Computing: on-demand resource provision



Performance impact Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 3

Performance Impact of Virtualization 

Has been studied before, E.g., Keith Jackson, et al. „Performance of HPC Applications on the Amazon Web Services Cloud“



Here: The performance impact of virtualization on OpenMP applications Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 4

Experimental Setup 

Benchmarks – NAS OpenMP (size A) – SPEC OpenMP (reference dataset) – EPCC OpenMP Microbenchmarks



Host machine – AMD Opteron 2376 („Shanghai“), 2.3 GHz, 2 socket quadcore – Scientific Linux – Virtualized with xen



Virtual machines – – – – –

Hypervisor: xen OS: Debian 2.6.26 Compiler: gcc 4.3.2 #cores: 1-8 Memory: 4GB Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 5

NAS Parallel Benchmarks


NAS Parallel Benchmarks (2)


SPEC OpenMP Benchmarks


SPEC OpenMP Benchmarks (2)


Execution time of NAS SP

What is going on here? Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 10

OpenMP Performance Analysis with ompP 

ompP: OpenMP profiling tool – Based on source code instrumentation – Independent of the compiler and runtime used – Supports HW counters through PAPI – Uses source code instrumenter Opari from the KOJAK/Scalasca toolset – Available for download (GPL): http://www.ompp-tool.com

Automatic instrumentation of OpenMP constructs, manual region instrumentation

Source Code

ompP library

Settings (env. Vars) HW Counters, output format,…

Executable

Execution on parallel machine

Profiling Report


Source to Source Instrumentation with Opari 

Preprocessor Instrumentation – Example: Instrumenting OpenMP constructs with Opari – Preprocessor operation Orignial source code

Preprocessor

Modified (instrumented) source code

– Example: Instrumentation of a parallel region POMP_Parallel_fork [master] #pragma omp parallel { POMP_Parallel_begin [team] /* user code in parallel region */ /* user code in parallel region */ } POMP_Barrier_enter [team] #pragma omp barrier POMP_Barrier_exit [team] POMP_Parallel_end [team] } POMP_Parallel_join [master]

Instrumentation added by Opari


ompP’s Profiling Data 

Example code section and performance profile:

Code:

Profile:

#pragma omp parallel { #pragma omp critical { sleep(1.0); } }

R00002 main.c (34-37) (default) CRITICAL TID execT execC bodyT enterT 0 3.00 1 1.00 2.00 1 1.00 1 1.00 0.00 2 2.00 1 1.00 1.00 3 4.00 1 1.00 3.00 SUM 10.01 4 4.00 6.00



exitT 0.00 0.00 0.00 0.00 0.00

Components: – Source code location and type of region – Timing data and execution counts, depending on the particular construct – One line per thread, last line sums over all threads – Hardware counter data (if PAPI is available and HW counters are selected) – Data is “exact” (measured, not based on sampling) Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 13

ompP Overhead Analysis (1) 

Certain timing categories reported by ompP can be classified as overheads: – Example: enterT in a critical section: Threads wait to enter the critical section (synchronization overhead).



Four overhead categories are defined in ompP: – Imbalance: waiting time incurred due to an imbalanced amount of work in a worksharing or parallel region – Synchronization: overhead that arises due to threads having to synchronize their activity, e.g. barrier call – Limited Parallelism: idle threads due not enough parallelism being exposed by the program – Thread management: overhead for the creation and destruction of threads, and for signaling critical sections, locks as available Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 14

ompP Overhead Analysis (2)

S: Synchronization overhead

I: Imbalance overhead

M: Thread management overhead

L: Limited Parallelism overhead


Overhead Analysis for the NAS Benchmarks Total

Overhead (%)

Synch

Imbal

Limpar

Mgmt

BT-host BT-full BT-para

1253.71 1294.55 1400.50

81.23 (06.48) 148.48 (11.47) 163.66 (11.65)

0.00 0.00 0.00

80.87 148.47 163.64

0.00 0.00 0.00

0.36 0.01 0.02

FT-host FT-full FT-para

72.27 75.02 88.67

25.62 (35.44) 25.97 (34.53) 32.22 (36.34)

0.01 0.01 0.00

1.06 1.04 6.45

24.43 24.85 25.73

0.12 0.07 0.04

CG-host CG-full CG-para

14.36 17.64 24.05

1.55 (08.95) 4.87 (23.59) 6.37 (26.49)

0.00 0.00 0.00

0.95 3.46 5.27

0.19 1.37 1.08

0.41 0.04 0.02

EP-host EP-full EP-para

92.27 89.66 133.76

1.08 (01.17) 1.24 (01.37) 29.60 (22.13)

0.00 0.00 0.00

0.93 0.75 29.32

0.00 0.00 0.00

0.15 0.49 0.27

SP-host SP-full SP-para

4994.76 16466.47 6816.17

1652.66 (33.03) 14315.84 (86.89) 5302.04 (77.68)

0.11 1.45 2.74

1651.95 14314.36 5299.29

0.00 0.00 0.00

0.60 0.03 0.01


OpenMP Constructs in the NAS Parallel Benchmarks

Parallel

Loop

Single

Barrier

Critical

Master

BT

2

54

0

0

0

2

FT

2

6

5

1

1

1

CG

2

22

12

0

0

2

EP

1

1

0

0

1

1

SP

2

69

0

3

0

2


ompP Profile for SP 

ompP Profiling Report for sp.c (lines 898-906) (para-virtualized)

TID

execT

execC

bodyT

exitBarT

exitBarT (native host)

0

310.60

1541444

11.24

289.41

38.92

1

310.50

1541444

11.22

289.35

38.91

2

310.44

1541444

11.3

289.12

37.11

3

310.26

1541444

11.22

289.14

38.03

4

310.85

1541444

11.26

289.68

38.77

5

310.82

1541444

11.24

289.62

35.47

6

311.10

1541444

11.17

289.99

38.85

7

311.14

1541444

10.92

290.48

39.35

SUM

2485.71

12331552

89.60

2316.76

305.41


exitBarT in a Parallel Loops 

Opari transforms the implicit barrier into an explict barrier



Worst case load imbalance scenario:

Loop_enter

i

Barrier_enter Barrier_exit Loop_exit

t

i

 exitBarT =

Thread i can induce at most t seconds exitBarT time in each other thread


TID

execT

execC

bodyT exitBarT

0

310.60

1541444

11.24

289.41

1

310.50

1541444

11.22

289.35

2

310.44

1541444

11.3

289.12

3

310.26

1541444

11.22

289.14

4

310.85

1541444

11.26

289.68

5

310.82

1541444

11.24

289.62

6

311.10

1541444

11.17

289.99

7

311.14

1541444

10.92

290.48

SUM

2485.71

12331552

89.60

2316.76

exitBarT should be max. ~80 seconds

Barrier that takes a really long time


Optimization 

Move parallelization to outermost loop

for (j = 1; j

Performance Evaluation of OpenMP Applications ... - Semantic Scholar

Performance Evaluation of OpenMP Applications ... - Semantic Scholar

Suggest Documents

Finding Inefficiencies in OpenMP Applications ... - Semantic Scholar

OpenMP-centric Performance Analysis of Hybrid Applications

Evaluating OpenMP Performance Analysis Tools ... - Semantic Scholar

Predicting Performance of Intel Cluster OpenMP ... - Semantic Scholar

Automatic performance analysis of hybrid MPI/OpenMP applications

Applications of High-Performance Knowledge ... - Semantic Scholar

Performance of MPI parallel applications - Semantic Scholar

QUANTITATIVE PERFORMANCE EVALUATION ... - Semantic Scholar

Performance Evaluation of MPI, UPC and OpenMP on ... - CiteSeerX

User co-scheduling for MPI + OpenMP applications using OpenMP

Performance Evaluation of Medical Image ... - Semantic Scholar

Performance evaluation of intensity modulated ... - Semantic Scholar

Design and performance evaluation of ... - Semantic Scholar

Performance Evaluation of Selected Job ... - Semantic Scholar

Performance Evaluation of Clock Synchronization ... - Semantic Scholar

Performance Evaluation of Wireless Transmissions ... - Semantic Scholar

Performance Evaluation of Silicon-Carbide ... - Semantic Scholar

Performance Evaluation of Transmission Line ... - Semantic Scholar

Performance evaluation of reverse osmosis ... - Semantic Scholar

Statistical Performance Evaluation of Biometric ... - Semantic Scholar

Trace-Driven Performance Evaluation of ... - Semantic Scholar

Performance Evaluation of Coordinated-Actuated ... - Semantic Scholar

Performance Evaluation of Curled Textlines ... - Semantic Scholar

Performance Evaluation of Software Effort ... - Semantic Scholar