Evaluate kernel startup overhead for. Multiple spawns. Spawn + MPIX Kernel call(). Spawn + MPIX Kernel call multiple().
Efficient Offloading of Parallel Kernels Using MPI Comm spawn
Sebastian Rinke, Suraj Prabhakaran, Felix Wolf HUCAA’13 | 1.10.13
The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under Grant Agreement n◦ 287530
2
HUCAA’13
Sebastian Rinke et al.
State of the Art AC
AC
CN
CN
Interconnect
3
HUCAA’13
Sebastian Rinke et al.
CN
CN
AC
AC
State of the Art Pros High bandwidth
AC
AC
CN
CN
Low latency User has simple view Interconnect
Cons
4
Oblivious to varying workloads ⇒ Idle/overloaded ACs
CN
CN
CNs and ACs affect each other’s availability
AC
AC
HUCAA’13
Sebastian Rinke et al.
Network-attached Accelerators
CN
Pros
AC
AC allocation based on application needs
CN
Distributed memory kernel offload to multiple ACs
CN
Interconnect
MPI within (larger) kernel AC and CN have their own network interface
5
HUCAA’13
Sebastian Rinke et al.
AC
AC CN
Network-attached Accelerators Cons
CN
Greater penalty for data transfers How to offload MPI kernels to ACs?
AC CN Interconnect
Yet another programming model?
AC
CN AC CN
6
HUCAA’13
Sebastian Rinke et al.
Network-attached Accelerators Cons
CN
Greater penalty for data transfers How to offload MPI kernels to ACs?
AC CN Interconnect
Yet another programming model?
AC
CN AC CN
⇒ No, use MPI’s dynamic process model!
6
HUCAA’13
Sebastian Rinke et al.
Outline DEEP Architecture Offloading Approaches Implementation Results Conclusion
7
HUCAA’13
Sebastian Rinke et al.
DEEP Architecture
CN BIC
BN BN
BN BN
BN BN
BIC
BN BN
BN BN
BN BN
BIC
BN BN
BN BN
BN BN
CN InfiniBand CN CN
Cluster
8
HUCAA’13
Sebastian Rinke et al.
Booster
DEEP Architecture Overview Cluster 128 cluster nodes (2 Intel Xeon E5-2680) QDR InfiniBand
Booster 512 booster nodes (1 Intel Xeon Phi) EXTOLL network (8×8×8 3D torus)
MPI over complete system
9
HUCAA’13
Sebastian Rinke et al.
CN BIC
BN BN
BN BN
BN BN
BIC
BN BN
BN BN
BN BN
BIC
BN BN
BN BN
BN BN
CN InfiniBand CN CN
Cluster
Booster
Outline DEEP Architecture Offloading Approaches Implementation Results Conclusion
10
HUCAA’13
Sebastian Rinke et al.
Offloading Approach, Why? Main program and kernels are MPI programs Start all CN and AC processes at job start Distinguish between CN/AC processes during runtime
CN 0
AC 4
CN 1 Interconnect CN 2 CN 3
AC 5 AC 6
MPI_COMM_WORLD
11
HUCAA’13
Sebastian Rinke et al.
Offloading Approach, Why? Con All processes in one MPI COMM WORLD
CN 0
AC 4
CN 1
Workaround Split communicator Replace all occurences of MPI COMM WORLD with new communicator ⇒ Major code changes
12
HUCAA’13
Sebastian Rinke et al.
Interconnect CN 2 CN 3
AC 5 AC 6
MPI_COMM_WORLD
Spawn Create AC processes during runtime with MPI Comm spawn() Provides separate communicators starting with rank 0 Collectives for convenient intercommunication CN 0
AC 0
CN 1 Interconnect CN 2 CN 3 COMM_WORLD (A)
13
HUCAA’13
Sebastian Rinke et al.
AC 1 AC 2
COMM_WORLD (B)
MPI Comm spawn()
MPI_Comm_spawn( char *command, char *argv[], int maxprocs, MPI_Info info, int root, MPI_Comm comm, MPI_Comm *intercomm, int array_of_errcodes[])
14
HUCAA’13
Sebastian Rinke et al.
Intercommunicator 0
1 2
Local group
1 2 0 5 3 4 Remote group
Spawn Usage Scenario CN 0 AC CN 1 Interconnect
AC
CN 2 CN 3 COMM_WORLD
15
HUCAA’13
Sebastian Rinke et al.
AC
Spawn Usage Scenario CN 0
AC 0
CN 1 Interconnect CN 2 CN 3 COMM_WORLD (A)
15
HUCAA’13
Sebastian Rinke et al.
AC 1 AC 2
COMM_WORLD (B)
Spawn Usage Scenario CN 0
Send data AC 0
CN 1 Interconnect CN 2 CN 3 COMM_WORLD (A)
15
HUCAA’13
Sebastian Rinke et al.
AC 1 AC 2
COMM_WORLD (B)
Spawn Usage Scenario CN 0
AC 0
CN 1 Interconnect CN 2 CN 3 COMM_WORLD (A)
15
HUCAA’13
Sebastian Rinke et al.
AC 1 AC 2
COMM_WORLD (B)
Spawn Usage Scenario CN 0
Recv data AC 0
CN 1 Interconnect CN 2 CN 3 COMM_WORLD (A)
15
HUCAA’13
Sebastian Rinke et al.
AC 1 AC 2
COMM_WORLD (B)
Spawn Con One spawn allows for one kernel execution
Workaround 1. Terminate and re-spawn AC processes 2. Spawn once and use protocol to trigger kernel executions
16
HUCAA’13
Sebastian Rinke et al.
Spawn + Kernel Call Create AC processes during runtime with MPI Comm spawn() Trigger kernel execution with MPIX Kernel call() ⇒ No need for re-spawning and user-implemented protocol
CN 0
Run kernel AC 0
CN 1 Interconnect CN 2 CN 3 COMM_WORLD (A)
17
HUCAA’13
Sebastian Rinke et al.
AC 1 AC 2
COMM_WORLD (B)
MPIX Kernel call()
MPIX_Kernel_call( char *kernelname, int argcount, void *args[], int *argsizes, int root, MPI_Comm comm, MPI_Comm intercomm)
18
HUCAA’13
Sebastian Rinke et al.
Intercommunicator 0
1 2
Local group
1 2 0 5 3 4 Remote group
Kernel Call Usage Scenario CN 0 AC CN 1 Interconnect
AC
CN 2 CN 3 COMM_WORLD
19
HUCAA’13
Sebastian Rinke et al.
AC
Kernel Call Usage Scenario CN 0
AC 0
CN 1 Interconnect CN 2 CN 3 COMM_WORLD (A)
19
HUCAA’13
Sebastian Rinke et al.
AC 1 AC 2
COMM_WORLD (B)
Kernel Call Usage Scenario CN 0
Run kernel AC 0
CN 1 Interconnect CN 2 CN 3 COMM_WORLD (A)
19
HUCAA’13
Sebastian Rinke et al.
AC 1 AC 2
COMM_WORLD (B)
Kernel Call Usage Scenario CN 0
Send data AC 0
CN 1 Interconnect CN 2 CN 3 COMM_WORLD (A)
19
HUCAA’13
Sebastian Rinke et al.
AC 1 AC 2
COMM_WORLD (B)
Kernel Call Usage Scenario CN 0
AC 0
CN 1 Interconnect CN 2 CN 3 COMM_WORLD (A)
19
HUCAA’13
Sebastian Rinke et al.
AC 1 AC 2
COMM_WORLD (B)
Kernel Call Usage Scenario CN 0
Recv data AC 0
CN 1 Interconnect CN 2 CN 3 COMM_WORLD (A)
19
HUCAA’13
Sebastian Rinke et al.
AC 1 AC 2
COMM_WORLD (B)
Kernel Call Usage Scenario CN 0
Run kernel AC 0
CN 1 Interconnect CN 2 CN 3 COMM_WORLD (A)
19
HUCAA’13
Sebastian Rinke et al.
AC 1 AC 2
COMM_WORLD (B)
Kernel Call Code Example void main(int argc, char **argv) { // Spawn AC processes MPI_Comm_spawn(..., comm, &intercomm, ...);
void kernel0(double a, int b, char c) { // Get intercommunicator to parents MPI_Comm_get_parent(&intercomm); // Recv input data from parents MPI_Alltoall(..., intercomm);
// Start ”kernel0” on ACs MPIX_Kernel_call("kernel0", ..., comm, intercomm);
// Do calculations and communicate ... // Send results to parents MPI_Alltoall(..., intercomm);
// Send input data to kernel functions MPI_Alltoall(..., intercomm); } // Do some other calculations ... // Recv results from kernel functions MPI_Alltoall(..., intercomm); }
20
HUCAA’13
Sebastian Rinke et al.
MPIX Kernel call multiple()
MPIX_Kernel_call_multiple( int count, char *array_of_kernelname[], int *array_of_argcount, void **array_of_args[], int *array_of_argsizes[], int root, MPI_Comm comm, MPI_Comm intercomm)
21
HUCAA’13
Sebastian Rinke et al.
Intercommunicator 0
1 2
Local group
1 2 0 5 3 4 Remote group
Outline DEEP Architecture Offloading Approaches Implementation Results Conclusion
22
HUCAA’13
Sebastian Rinke et al.
Spawned Program
Kernels must be available in program spawned on ACs Kernel execution requests handled by main function (provided) Programmer implements kernel functions only ⇒ Union of both parts during linking
23
HUCAA’13
Sebastian Rinke et al.
Main
Kernels
MPIX Kernel call*()
24
HUCAA’13
Sebastian Rinke et al.
MPIX Kernel call*() 1. CNs send kernel requests to ACs Kernel names Kernel arguments
24
HUCAA’13
Sebastian Rinke et al.
MPIX Kernel call*() 1. CNs send kernel requests to ACs Kernel names Kernel arguments
2. ACs start first kernel Get address of kernel function through dlsym() Push kernel arguments onto process stack Invoke kernel using function pointer
24
HUCAA’13
Sebastian Rinke et al.
MPIX Kernel call*() 1. CNs send kernel requests to ACs Kernel names Kernel arguments
2. ACs start first kernel Get address of kernel function through dlsym() Push kernel arguments onto process stack Invoke kernel using function pointer
3. ACs run all remaining kernels and wait for new requests
24
HUCAA’13
Sebastian Rinke et al.
MPIX Kernel call*() 1. CNs send kernel requests to ACs Kernel names Kernel arguments
2. ACs start first kernel Get address of kernel function through dlsym() Push kernel arguments onto process stack Invoke kernel using function pointer
3. ACs run all remaining kernels and wait for new requests
Termination of AC processes through empty kernel name
24
HUCAA’13
Sebastian Rinke et al.
MPIX Kernel call*()
Note Compiler may change symbol name of kernel function Avoid name mangling for kernel entry functions C: No issues C++: Declare with extern "C" Fortran: Define with BIND(C) attribute
25
HUCAA’13
Sebastian Rinke et al.
Outline DEEP Architecture Offloading Approaches Implementation Results Conclusion
26
HUCAA’13
Sebastian Rinke et al.
Results Evaluate kernel startup overhead for Multiple spawns Spawn + MPIX Kernel call() Spawn + MPIX Kernel call multiple()
27
HUCAA’13
Sebastian Rinke et al.
Results Evaluate kernel startup overhead for Multiple spawns Spawn + MPIX Kernel call() Spawn + MPIX Kernel call multiple()
Benchmark Environment Cluster part of DEEP 120 nodes (2 Intel Xeon E5-2680) 40 CNs 80 ACs ⇒ CN : AC = 1 : 2
QDR InfiniBand Open MPI 1.6.4, NFS file system
27
HUCAA’13
Sebastian Rinke et al.
Results
2 processes per CN ⇒ 80 CN processes 1 process per AC ⇒ 80 AC processes 5 doubles as kernel arguments
28
HUCAA’13
Sebastian Rinke et al.
Kernel Startup Times Multiple spawns 1 Spawn + Kernel_call 1 Spawn + Kernel_call_multiple 50
85
170
340
Time [sec]
40 30 20 10 0 1
29
HUCAA’13
Sebastian Rinke et al.
2
4
8 16 32 64 Number of kernel calls
128
256
Kernel call vs. Kernel call multiple 6
Kernel_call Kernel_call_multiple
Time [msec]
5 4 3 2 1 0 1
30
HUCAA’13
Sebastian Rinke et al.
2
4
8 16 32 64 Number of kernel calls
128
256
Outline DEEP Architecture Offloading Approaches Implementation Results Conclusion
31
HUCAA’13
Sebastian Rinke et al.
Conclusion Network-attached ACs can help to address varying scaling characteristics within the same application Number of CNs and network-attached ACs is independent Distributed memory kernel functions with MPI communication can be offloaded Offloading mechanism based on MPI Comm spawn() MPIX Kernel call*() complements MPI’s spawn ⇒ Reduced kernel startup overhead Currently, work with application developers to integrate offloading
32
HUCAA’13
Sebastian Rinke et al.
Thank you.
33
HUCAA’13
Sebastian Rinke et al.