Efficient Multiprogramming for Multicores with SCAF - Microarch.org

4 downloads 56 Views 219KB Size Report
Efficient Multiprogramming for Multicores with. SCAF. Timothy Creech, Aparna Kotha, Rajeev Barua. Department of Electrical and Computer Engineering, ...
Efficient Multiprogramming for Multicores with SCAF

Timothy Creech, Aparna Kotha, Rajeev Barua Department of Electrical and Computer Engineering, University of Maryland, College Park, MD Techniques Used

! Multiprogramming serial programs on interactive systems is easy ! time-sharing works well for serial processes ! However, hardware and programs are becoming increasingly parallel ! fine-grained time-sharing doesn’t work well for multiprogramming parallel processes ! SCAF is a runtime system for multiprogramming as space-sharing ! space-sharing frees the OS from needing to time-share hardware contexts ! SCAF reasons continuously at runtime about processes’ parallel efficiencies in order improve system efficiency

Serial “experiments” at runtime ! estimate serial performance without serialization/profiling ! serial performance then compared to observed performance to reason about acheived efficiency . Serial execution

feedback loop

Training control thread spawned

Why is Multiprogramming Parallel Processes Hard?

control

work

work

work

work

work

work

Parallel execution

performance metric Efoo = f1(pfoo)

Control thread forks training process

Threads spawned

! parallel runtimes assume the entire machine is available for each process, resulting in many more threads than hardware contexts ! operating systems simply attempt to schedule all of the threads to hardware contexts in an uncoordinated fashion ! multiprogramming parallel programs is needlessly complicated for the users

Dynamic Allocation ! malleability allows allocations to vary at runtime based on observed efficiency ! more efficient applications rewarded with more threads

syscalls intercepted

performance metric Ebaz

Client. foo

work

Introduction

resource allocation pbaz

resource allocation pfoo = f2(Efoo)

Training ends

Client baz

scafd

OS’s time-sharing scheduler

available resources N

Serial execution

Results: Minimal Overhead

! SunOS 5.10 on UltraSparc T2

! Linux on Dual Xeon E5410

Results: Selected 3-Way Multiprogramming Example

60

! Multiprogramming 3 NAS benchmarks: CG, SP, and LU ! Process allocations over time with SCAF cg.C.x sp.B.x lu.B.x

30 20 0 100

200

300

400

Time (seconds)

! Summary of results: Configuration Process Runtime Unmodified CG 435.9s SP 474.6s LU 507.3s EquiCG 374.0s partitioning SP 380.8s LU 349.8s SCAF CG 172.2s SP 374.0s LU 424.0s

dynamic space-sharing

time →

SCAF Runtime Features

!

Speedup Speedup 13.1 31.5 9.6 8.8 15.5 40.7 12.2 13.0 35.7 59.3 12.5 11.1

Results: NAS Parallel Benchmarks ! 2-way multiprogramming across all NAS benchmarks on 2 platforms Dual Xeon E5410 Multiprogramming

8

Unmod 1 Unmod 2

EQ 1 EQ 2

T2 Multiprogramming

SCAF 1 SCAF 2

50

6

Unmod 1 Unmod 2

EQ 1 EQ 2

SCAF 1 SCAF 2

40 Speedup

Speedup

4

30 20

2 10 0

0 bt.B+lu.B cg.C+bt.B cg.C+ft.C cg.C+lu.B cg.C+mg.C cg.C+sp.B cg.C+ua.B ft.C+bt.B ft.C+lu.B mg.C+bt.B mg.C+lu.B sp.B+bt.B sp.B+ft.C sp.B+lu.B sp.B+mg.C sp.B+ua.B ua.B+bt.B ua.B+ft.C ua.B+lu.B ua.B+mg.C Average Geo.Mean

! Controls multiprogrammed parallel processes’ allocations to effect space-sharing ! Allocations vary at runtime in order to improve system efficiency ! Requires no offline profiling or measurements ! Requires no program porting, modification, or recompilation ! Processes supported via modified runtime libraries (e.g. libgomp, libiomp5, etc.) ! Ports available: ! x86 Linux: GNU OpenMP ! Sparc SunOS: GNU OpenMP ! Ports in progress: ! x86 Linux: Intel OpenMP ! Tile-Gx Linux: GNU OpenMP ! Xeon Phi Linux: GNU OpenMP, Intel OpenMP

10

20 10 0 0

! 70% of pairs see improvements over equipartitioning ! 15% average improvement

bt.B+lu.B cg.C+bt.B cg.C+ft.B cg.C+lu.B cg.C+mg.C cg.C+sp.B cg.C+ua.A ft.B+bt.B ft.B+lu.B ft.B+mg.C mg.C+bt.B mg.C+lu.B sp.B+bt.B sp.B+ft.B sp.B+lu.B sp.B+mg.C sp.B+ua.A ua.A+bt.B ua.A+ft.B ua.A+lu.B ua.A+mg.C Average Geo.Mean

time →

hardware contexts

hardware contexts

.

thread-based time-sharing

30

Threads

Time-sharing vs Space-sharing ! To avoid oversubscription and prevent the OS thread scheduler from time-sharing hardware contexts, we can implement space-sharing ! How many hardware contexts does each process get? ! equipartitioning gives each process an equal number of contexts, but this is wasteful if a process scales poorly ! SCAF reasons about per-process parallel efficiency in order to optimize the speedup gained by each hardware context

ua.B.x

sp.B.x

mg.B.x

lu.B.x

ft.B.x

ep.B.x

cg.B.x

bt.B.x

ua.B.x

sp.B.x

lu.B.x

ft.B.x

ep.B.x

mg.B.x

Benchmark

50

Speedup 24.2 28.6 35.1

Benchmark

40

“lu.B.x” Instances Machine subscription Threads/process Run time (s) 1 100% 64 180 2 100% 32 312 3 100% 21 382

!

0

40

! instead of time-sharing 64 × N threads, we would like to space-share

2

60

! Massive oversubscription can lead to terrible system efficiency

4

50

Speedup 24.2 2.5 1.5

30 25 20 15 10 5 0

Unmod SCAF

6

cg.B.x

“lu.B.x” Instances Machine subscription Threads/process Run time (s) 1 100% 64 180 2 200% 64 3522 3 300% 64 8770

!

8

bt.B.x

! NPB OpenMP Benchmark “lu.B.x” ! SunOS 5.11, Sparc T2 ! 64 threads per CPU ! Say multiple users are running the “lu.B.x” program on this SMP system ! Each instance of “lu.B.x” runs on 64 threads ! How does time-sharing handle this?

! Extra logic required to implement SCAF does not adversely affect performance when processes run alone ! Results below compare executables running with and without the SCAF runtime in place Speedup

Motivating Example

! 57% of pairs see improvements over equipartitioning ! 15% average improvement

This work was supported by a NASA Office of the Chief Technologist’s Space Technology Research Fellowship.

MICRO-46 December 7-11, 2013, Davis, CA, USA

Contact: [email protected] · http://ece.umd.edu/~tcreech