Efficient Multiprogramming for Multicores with. SCAF. Timothy Creech, Aparna
Kotha, Rajeev Barua. Department of Electrical and Computer Engineering, ...
Efficient Multiprogramming for Multicores with SCAF
Timothy Creech, Aparna Kotha, Rajeev Barua Department of Electrical and Computer Engineering, University of Maryland, College Park, MD Techniques Used
! Multiprogramming serial programs on interactive systems is easy ! time-sharing works well for serial processes ! However, hardware and programs are becoming increasingly parallel ! fine-grained time-sharing doesn’t work well for multiprogramming parallel processes ! SCAF is a runtime system for multiprogramming as space-sharing ! space-sharing frees the OS from needing to time-share hardware contexts ! SCAF reasons continuously at runtime about processes’ parallel efficiencies in order improve system efficiency
Serial “experiments” at runtime ! estimate serial performance without serialization/profiling ! serial performance then compared to observed performance to reason about acheived efficiency . Serial execution
feedback loop
Training control thread spawned
Why is Multiprogramming Parallel Processes Hard?
control
work
work
work
work
work
work
Parallel execution
performance metric Efoo = f1(pfoo)
Control thread forks training process
Threads spawned
! parallel runtimes assume the entire machine is available for each process, resulting in many more threads than hardware contexts ! operating systems simply attempt to schedule all of the threads to hardware contexts in an uncoordinated fashion ! multiprogramming parallel programs is needlessly complicated for the users
Dynamic Allocation ! malleability allows allocations to vary at runtime based on observed efficiency ! more efficient applications rewarded with more threads
syscalls intercepted
performance metric Ebaz
Client. foo
work
Introduction
resource allocation pbaz
resource allocation pfoo = f2(Efoo)
Training ends
Client baz
scafd
OS’s time-sharing scheduler
available resources N
Serial execution
Results: Minimal Overhead
! SunOS 5.10 on UltraSparc T2
! Linux on Dual Xeon E5410
Results: Selected 3-Way Multiprogramming Example
60
! Multiprogramming 3 NAS benchmarks: CG, SP, and LU ! Process allocations over time with SCAF cg.C.x sp.B.x lu.B.x
30 20 0 100
200
300
400
Time (seconds)
! Summary of results: Configuration Process Runtime Unmodified CG 435.9s SP 474.6s LU 507.3s EquiCG 374.0s partitioning SP 380.8s LU 349.8s SCAF CG 172.2s SP 374.0s LU 424.0s
dynamic space-sharing
time →
SCAF Runtime Features
!
Speedup Speedup 13.1 31.5 9.6 8.8 15.5 40.7 12.2 13.0 35.7 59.3 12.5 11.1
Results: NAS Parallel Benchmarks ! 2-way multiprogramming across all NAS benchmarks on 2 platforms Dual Xeon E5410 Multiprogramming
8
Unmod 1 Unmod 2
EQ 1 EQ 2
T2 Multiprogramming
SCAF 1 SCAF 2
50
6
Unmod 1 Unmod 2
EQ 1 EQ 2
SCAF 1 SCAF 2
40 Speedup
Speedup
4
30 20
2 10 0
0 bt.B+lu.B cg.C+bt.B cg.C+ft.C cg.C+lu.B cg.C+mg.C cg.C+sp.B cg.C+ua.B ft.C+bt.B ft.C+lu.B mg.C+bt.B mg.C+lu.B sp.B+bt.B sp.B+ft.C sp.B+lu.B sp.B+mg.C sp.B+ua.B ua.B+bt.B ua.B+ft.C ua.B+lu.B ua.B+mg.C Average Geo.Mean
! Controls multiprogrammed parallel processes’ allocations to effect space-sharing ! Allocations vary at runtime in order to improve system efficiency ! Requires no offline profiling or measurements ! Requires no program porting, modification, or recompilation ! Processes supported via modified runtime libraries (e.g. libgomp, libiomp5, etc.) ! Ports available: ! x86 Linux: GNU OpenMP ! Sparc SunOS: GNU OpenMP ! Ports in progress: ! x86 Linux: Intel OpenMP ! Tile-Gx Linux: GNU OpenMP ! Xeon Phi Linux: GNU OpenMP, Intel OpenMP
10
20 10 0 0
! 70% of pairs see improvements over equipartitioning ! 15% average improvement
bt.B+lu.B cg.C+bt.B cg.C+ft.B cg.C+lu.B cg.C+mg.C cg.C+sp.B cg.C+ua.A ft.B+bt.B ft.B+lu.B ft.B+mg.C mg.C+bt.B mg.C+lu.B sp.B+bt.B sp.B+ft.B sp.B+lu.B sp.B+mg.C sp.B+ua.A ua.A+bt.B ua.A+ft.B ua.A+lu.B ua.A+mg.C Average Geo.Mean
time →
hardware contexts
hardware contexts
.
thread-based time-sharing
30
Threads
Time-sharing vs Space-sharing ! To avoid oversubscription and prevent the OS thread scheduler from time-sharing hardware contexts, we can implement space-sharing ! How many hardware contexts does each process get? ! equipartitioning gives each process an equal number of contexts, but this is wasteful if a process scales poorly ! SCAF reasons about per-process parallel efficiency in order to optimize the speedup gained by each hardware context
ua.B.x
sp.B.x
mg.B.x
lu.B.x
ft.B.x
ep.B.x
cg.B.x
bt.B.x
ua.B.x
sp.B.x
lu.B.x
ft.B.x
ep.B.x
mg.B.x
Benchmark
50
Speedup 24.2 28.6 35.1
Benchmark
40
“lu.B.x” Instances Machine subscription Threads/process Run time (s) 1 100% 64 180 2 100% 32 312 3 100% 21 382
!
0
40
! instead of time-sharing 64 × N threads, we would like to space-share
2
60
! Massive oversubscription can lead to terrible system efficiency
4
50
Speedup 24.2 2.5 1.5
30 25 20 15 10 5 0
Unmod SCAF
6
cg.B.x
“lu.B.x” Instances Machine subscription Threads/process Run time (s) 1 100% 64 180 2 200% 64 3522 3 300% 64 8770
!
8
bt.B.x
! NPB OpenMP Benchmark “lu.B.x” ! SunOS 5.11, Sparc T2 ! 64 threads per CPU ! Say multiple users are running the “lu.B.x” program on this SMP system ! Each instance of “lu.B.x” runs on 64 threads ! How does time-sharing handle this?
! Extra logic required to implement SCAF does not adversely affect performance when processes run alone ! Results below compare executables running with and without the SCAF runtime in place Speedup
Motivating Example
! 57% of pairs see improvements over equipartitioning ! 15% average improvement
This work was supported by a NASA Office of the Chief Technologist’s Space Technology Research Fellowship.
MICRO-46 December 7-11, 2013, Davis, CA, USA
Contact:
[email protected] · http://ece.umd.edu/~tcreech