Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines Jie Tao1 Karl Fuerlinger2 Holger Marten1 1:
[email protected] [email protected] [email protected]
Steinbuch Center for Computing, Karlsruhe Institute of Technology (KIT), Germany 2: MNM-Team, Department of Computer Science, LMU München, Germany
Outline
Introduction – Virtualization and the impact on performance
Experimental Setup – NAS parallel benchmarks, SPEC OpenMP, microbenchmarks
Study of SP (NAS Parallel Benchmarks) – Initial performance – Analysis using ompP – Optimization results and microbenchmark study
Conclusions
Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 2
Virtualization
Running multiple OSs on the same hardware VM 1 Guest OS
Application Operating System
VM 3
VM 4
Guest OS
Guest OS
Guest OS
Hypervisor
Hardware
VM 2
Host machine
Concepts – Hypervisor (xen, KVM, VMware) – Full virtualization vs para-virtualization
Adopted for – Server consolidation – Cloud Computing: on-demand resource provision
Performance impact Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 3
Performance Impact of Virtualization
Has been studied before, E.g., Keith Jackson, et al. „Performance of HPC Applications on the Amazon Web Services Cloud“
Here: The performance impact of virtualization on OpenMP applications Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 4
Experimental Setup
Benchmarks – NAS OpenMP (size A) – SPEC OpenMP (reference dataset) – EPCC OpenMP Microbenchmarks
Host machine – AMD Opteron 2376 („Shanghai“), 2.3 GHz, 2 socket quadcore – Scientific Linux – Virtualized with xen
Virtual machines – – – – –
Hypervisor: xen OS: Debian 2.6.26 Compiler: gcc 4.3.2 #cores: 1-8 Memory: 4GB Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 5
NAS Parallel Benchmarks
Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 6
NAS Parallel Benchmarks (2)
Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 7
SPEC OpenMP Benchmarks
Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 8
SPEC OpenMP Benchmarks (2)
Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 9
Execution time of NAS SP
What is going on here? Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 10
OpenMP Performance Analysis with ompP
ompP: OpenMP profiling tool – Based on source code instrumentation – Independent of the compiler and runtime used – Supports HW counters through PAPI – Uses source code instrumenter Opari from the KOJAK/Scalasca toolset – Available for download (GPL): http://www.ompp-tool.com
Automatic instrumentation of OpenMP constructs, manual region instrumentation
Source Code
ompP library
Settings (env. Vars) HW Counters, output format,…
Executable
Execution on parallel machine
Profiling Report
Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 11
Source to Source Instrumentation with Opari
Preprocessor Instrumentation – Example: Instrumenting OpenMP constructs with Opari – Preprocessor operation Orignial source code
Preprocessor
Modified (instrumented) source code
– Example: Instrumentation of a parallel region POMP_Parallel_fork [master] #pragma omp parallel { POMP_Parallel_begin [team] /* user code in parallel region */ /* user code in parallel region */ } POMP_Barrier_enter [team] #pragma omp barrier POMP_Barrier_exit [team] POMP_Parallel_end [team] } POMP_Parallel_join [master]
Instrumentation added by Opari
Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 12
ompP’s Profiling Data
Example code section and performance profile:
Code:
Profile:
#pragma omp parallel { #pragma omp critical { sleep(1.0); } }
R00002 main.c (34-37) (default) CRITICAL TID execT execC bodyT enterT 0 3.00 1 1.00 2.00 1 1.00 1 1.00 0.00 2 2.00 1 1.00 1.00 3 4.00 1 1.00 3.00 SUM 10.01 4 4.00 6.00
exitT 0.00 0.00 0.00 0.00 0.00
Components: – Source code location and type of region – Timing data and execution counts, depending on the particular construct – One line per thread, last line sums over all threads – Hardware counter data (if PAPI is available and HW counters are selected) – Data is “exact” (measured, not based on sampling) Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 13
ompP Overhead Analysis (1)
Certain timing categories reported by ompP can be classified as overheads: – Example: enterT in a critical section: Threads wait to enter the critical section (synchronization overhead).
Four overhead categories are defined in ompP: – Imbalance: waiting time incurred due to an imbalanced amount of work in a worksharing or parallel region – Synchronization: overhead that arises due to threads having to synchronize their activity, e.g. barrier call – Limited Parallelism: idle threads due not enough parallelism being exposed by the program – Thread management: overhead for the creation and destruction of threads, and for signaling critical sections, locks as available Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 14
ompP Overhead Analysis (2)
S: Synchronization overhead
I: Imbalance overhead
M: Thread management overhead
L: Limited Parallelism overhead
Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 15
Overhead Analysis for the NAS Benchmarks Total
Overhead (%)
Synch
Imbal
Limpar
Mgmt
BT-host BT-full BT-para
1253.71 1294.55 1400.50
81.23 (06.48) 148.48 (11.47) 163.66 (11.65)
0.00 0.00 0.00
80.87 148.47 163.64
0.00 0.00 0.00
0.36 0.01 0.02
FT-host FT-full FT-para
72.27 75.02 88.67
25.62 (35.44) 25.97 (34.53) 32.22 (36.34)
0.01 0.01 0.00
1.06 1.04 6.45
24.43 24.85 25.73
0.12 0.07 0.04
CG-host CG-full CG-para
14.36 17.64 24.05
1.55 (08.95) 4.87 (23.59) 6.37 (26.49)
0.00 0.00 0.00
0.95 3.46 5.27
0.19 1.37 1.08
0.41 0.04 0.02
EP-host EP-full EP-para
92.27 89.66 133.76
1.08 (01.17) 1.24 (01.37) 29.60 (22.13)
0.00 0.00 0.00
0.93 0.75 29.32
0.00 0.00 0.00
0.15 0.49 0.27
SP-host SP-full SP-para
4994.76 16466.47 6816.17
1652.66 (33.03) 14315.84 (86.89) 5302.04 (77.68)
0.11 1.45 2.74
1651.95 14314.36 5299.29
0.00 0.00 0.00
0.60 0.03 0.01
Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 16
OpenMP Constructs in the NAS Parallel Benchmarks
Parallel
Loop
Single
Barrier
Critical
Master
BT
2
54
0
0
0
2
FT
2
6
5
1
1
1
CG
2
22
12
0
0
2
EP
1
1
0
0
1
1
SP
2
69
0
3
0
2
Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 17
ompP Profile for SP
ompP Profiling Report for sp.c (lines 898-906) (para-virtualized)
TID
execT
execC
bodyT
exitBarT
exitBarT (native host)
0
310.60
1541444
11.24
289.41
38.92
1
310.50
1541444
11.22
289.35
38.91
2
310.44
1541444
11.3
289.12
37.11
3
310.26
1541444
11.22
289.14
38.03
4
310.85
1541444
11.26
289.68
38.77
5
310.82
1541444
11.24
289.62
35.47
6
311.10
1541444
11.17
289.99
38.85
7
311.14
1541444
10.92
290.48
39.35
SUM
2485.71
12331552
89.60
2316.76
305.41
Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 18
exitBarT in a Parallel Loops
Opari transforms the implicit barrier into an explict barrier
Worst case load imbalance scenario:
Loop_enter
i
Barrier_enter Barrier_exit Loop_exit
t
i
exitBarT =
Thread i can induce at most t seconds exitBarT time in each other thread
Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 19
TID
execT
execC
bodyT exitBarT
0
310.60
1541444
11.24
289.41
1
310.50
1541444
11.22
289.35
2
310.44
1541444
11.3
289.12
3
310.26
1541444
11.22
289.14
4
310.85
1541444
11.26
289.68
5
310.82
1541444
11.24
289.62
6
311.10
1541444
11.17
289.99
7
311.14
1541444
10.92
290.48
SUM
2485.71
12331552
89.60
2316.76
exitBarT should be max. ~80 seconds
Barrier that takes a really long time
Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 20
Optimization
Move parallelization to outermost loop
for (j = 1; j