Efficient Offloading of Parallel Kernels Using ... - Google Sites

Efficient Offloading of Parallel Kernels Using MPI Comm spawn

Sebastian Rinke, Suraj Prabhakaran, Felix Wolf HUCAA’13 | 1.10.13

The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under Grant Agreement n◦ 287530

2

HUCAA’13

Sebastian Rinke et al.

State of the Art AC

AC

CN

CN

Interconnect

3

HUCAA’13


CN

CN

AC

AC

State of the Art Pros High bandwidth

AC

AC

CN

CN

Low latency User has simple view Interconnect

Cons

4

Oblivious to varying workloads ⇒ Idle/overloaded ACs

CN

CN

CNs and ACs affect each other’s availability

AC

AC

HUCAA’13


Network-attached Accelerators

CN

Pros

AC

AC allocation based on application needs

CN

Distributed memory kernel offload to multiple ACs

CN

Interconnect

MPI within (larger) kernel AC and CN have their own network interface

5

HUCAA’13


AC

AC CN

Network-attached Accelerators Cons

CN

Greater penalty for data transfers How to offload MPI kernels to ACs?

AC CN Interconnect

Yet another programming model?

AC

CN AC CN

6

HUCAA’13


Network-attached Accelerators Cons

CN

Greater penalty for data transfers How to offload MPI kernels to ACs?

AC CN Interconnect

Yet another programming model?

AC

CN AC CN

⇒ No, use MPI’s dynamic process model!

6

HUCAA’13


Outline DEEP Architecture Offloading Approaches Implementation Results Conclusion

7

HUCAA’13


DEEP Architecture

CN BIC

BN BN

BN BN

BN BN

BIC

BN BN

BN BN

BN BN

BIC

BN BN

BN BN

BN BN

CN InfiniBand CN CN

Cluster

8

HUCAA’13


Booster

DEEP Architecture Overview Cluster 128 cluster nodes (2 Intel Xeon E5-2680) QDR InfiniBand

Booster 512 booster nodes (1 Intel Xeon Phi) EXTOLL network (8×8×8 3D torus)

MPI over complete system

9

HUCAA’13


CN BIC

BN BN

BN BN

BN BN

BIC

BN BN

BN BN

BN BN

BIC

BN BN

BN BN

BN BN

CN InfiniBand CN CN

Cluster

Booster


10

HUCAA’13


Offloading Approach, Why? Main program and kernels are MPI programs Start all CN and AC processes at job start Distinguish between CN/AC processes during runtime

CN 0

AC 4

CN 1 Interconnect CN 2 CN 3

AC 5 AC 6

MPI_COMM_WORLD

11

HUCAA’13


Offloading Approach, Why? Con All processes in one MPI COMM WORLD

CN 0

AC 4

CN 1

Workaround Split communicator Replace all occurences of MPI COMM WORLD with new communicator ⇒ Major code changes

12

HUCAA’13


Interconnect CN 2 CN 3

AC 5 AC 6

MPI_COMM_WORLD

Spawn Create AC processes during runtime with MPI Comm spawn() Provides separate communicators starting with rank 0 Collectives for convenient intercommunication CN 0

AC 0

CN 1 Interconnect CN 2 CN 3 COMM_WORLD (A)

13

HUCAA’13


AC 1 AC 2

COMM_WORLD (B)

MPI Comm spawn()

MPI_Comm_spawn( char *command, char *argv[], int maxprocs, MPI_Info info, int root, MPI_Comm comm, MPI_Comm *intercomm, int array_of_errcodes[])

14

HUCAA’13


Intercommunicator 0

1 2

Local group

1 2 0 5 3 4 Remote group

Spawn Usage Scenario CN 0 AC CN 1 Interconnect

AC

CN 2 CN 3 COMM_WORLD

15

HUCAA’13


AC

Spawn Usage Scenario CN 0

AC 0


15

HUCAA’13


AC 1 AC 2

COMM_WORLD (B)


Send data AC 0


15

HUCAA’13


AC 1 AC 2

COMM_WORLD (B)


AC 0


15

HUCAA’13


AC 1 AC 2

COMM_WORLD (B)


Recv data AC 0


15

HUCAA’13


AC 1 AC 2

COMM_WORLD (B)

Spawn Con One spawn allows for one kernel execution

Workaround 1. Terminate and re-spawn AC processes 2. Spawn once and use protocol to trigger kernel executions

16

HUCAA’13


Spawn + Kernel Call Create AC processes during runtime with MPI Comm spawn() Trigger kernel execution with MPIX Kernel call() ⇒ No need for re-spawning and user-implemented protocol

CN 0

Run kernel AC 0


17

HUCAA’13


AC 1 AC 2

COMM_WORLD (B)

MPIX Kernel call()

MPIX_Kernel_call( char *kernelname, int argcount, void *args[], int *argsizes, int root, MPI_Comm comm, MPI_Comm intercomm)

18

HUCAA’13


Intercommunicator 0

1 2

Local group


Kernel Call Usage Scenario CN 0 AC CN 1 Interconnect

AC

CN 2 CN 3 COMM_WORLD

19

HUCAA’13


AC

Kernel Call Usage Scenario CN 0

AC 0


19

HUCAA’13


AC 1 AC 2

COMM_WORLD (B)


Run kernel AC 0


19

HUCAA’13


AC 1 AC 2

COMM_WORLD (B)


Send data AC 0


19

HUCAA’13


AC 1 AC 2

COMM_WORLD (B)


AC 0


19

HUCAA’13


AC 1 AC 2

COMM_WORLD (B)


Recv data AC 0


19

HUCAA’13


AC 1 AC 2

COMM_WORLD (B)


Run kernel AC 0


19

HUCAA’13


AC 1 AC 2

COMM_WORLD (B)

Kernel Call Code Example void main(int argc, char **argv) { // Spawn AC processes MPI_Comm_spawn(..., comm, &intercomm, ...);

void kernel0(double a, int b, char c) { // Get intercommunicator to parents MPI_Comm_get_parent(&intercomm); // Recv input data from parents MPI_Alltoall(..., intercomm);

// Start ”kernel0” on ACs MPIX_Kernel_call("kernel0", ..., comm, intercomm);

// Do calculations and communicate ... // Send results to parents MPI_Alltoall(..., intercomm);

// Send input data to kernel functions MPI_Alltoall(..., intercomm); } // Do some other calculations ... // Recv results from kernel functions MPI_Alltoall(..., intercomm); }

20

HUCAA’13


MPIX Kernel call multiple()

MPIX_Kernel_call_multiple( int count, char *array_of_kernelname[], int *array_of_argcount, void **array_of_args[], int *array_of_argsizes[], int root, MPI_Comm comm, MPI_Comm intercomm)

21

HUCAA’13


Intercommunicator 0

1 2

Local group



22

HUCAA’13


Spawned Program

Kernels must be available in program spawned on ACs Kernel execution requests handled by main function (provided) Programmer implements kernel functions only ⇒ Union of both parts during linking

23

HUCAA’13


Main

Kernels

MPIX Kernel call*()

24

HUCAA’13


MPIX Kernel call*() 1. CNs send kernel requests to ACs Kernel names Kernel arguments

24

HUCAA’13



2. ACs start first kernel Get address of kernel function through dlsym() Push kernel arguments onto process stack Invoke kernel using function pointer

24

HUCAA’13




3. ACs run all remaining kernels and wait for new requests

24

HUCAA’13




3. ACs run all remaining kernels and wait for new requests

Termination of AC processes through empty kernel name

24

HUCAA’13


MPIX Kernel call*()

Note Compiler may change symbol name of kernel function Avoid name mangling for kernel entry functions C: No issues C++: Declare with extern "C" Fortran: Define with BIND(C) attribute

25

HUCAA’13



26

HUCAA’13


Results Evaluate kernel startup overhead for Multiple spawns Spawn + MPIX Kernel call() Spawn + MPIX Kernel call multiple()

27

HUCAA’13


Results Evaluate kernel startup overhead for Multiple spawns Spawn + MPIX Kernel call() Spawn + MPIX Kernel call multiple()

Benchmark Environment Cluster part of DEEP 120 nodes (2 Intel Xeon E5-2680) 40 CNs 80 ACs ⇒ CN : AC = 1 : 2

QDR InfiniBand Open MPI 1.6.4, NFS file system

27

HUCAA’13


Results

2 processes per CN ⇒ 80 CN processes 1 process per AC ⇒ 80 AC processes 5 doubles as kernel arguments

28

HUCAA’13


Kernel Startup Times Multiple spawns 1 Spawn + Kernel_call 1 Spawn + Kernel_call_multiple 50

85

170

340

Time [sec]

40 30 20 10 0 1

29

HUCAA’13


2

4

8 16 32 64 Number of kernel calls

128

256

Kernel call vs. Kernel call multiple 6

Kernel_call Kernel_call_multiple

Time [msec]

5 4 3 2 1 0 1

30

HUCAA’13


2

4

8 16 32 64 Number of kernel calls

128

256


31

HUCAA’13


Conclusion Network-attached ACs can help to address varying scaling characteristics within the same application Number of CNs and network-attached ACs is independent Distributed memory kernel functions with MPI communication can be offloaded Offloading mechanism based on MPI Comm spawn() MPIX Kernel call*() complements MPI’s spawn ⇒ Reduced kernel startup overhead Currently, work with application developers to integrate offloading

32

HUCAA’13


Thank you.

33

HUCAA’13


Efficient Offloading of Parallel Kernels Using ... - Google Sites

Efficient Offloading of Parallel Kernels Using ... - Google Sites

Suggest Documents

Energy Efficient Offloading of 3G Networks - CiteSeerX

Using the Parallel Research Kernels to study PGAS ...

On Pairwise Kernels: An Efficient Alternative and ... - Google Sites

Efficient Parallel Skyline Processing using Hyperplane Projections

Efficient parallel inversion using the ... - Semantic Scholar

Efficient Farming Using Advanced Parallel ... - Semantic Scholar

Efficient parallel modular multiplication method using ...

Efficient parallel inversion using the ... - Semantic Scholar

Fast and Space Efficient String Kernels using Suffix ... - Purdue Statistics

An Efficient Offloading Scheme For MEC System

An Efficient Dynamic Offloading Approach based on

Computational offloading for efficient trust ...

Energy Efficient Mobile Computation Offloading via

PARALLEL NUMERICAL KERNELS FOR CLIMATE ... - GFDL Extranet

Memory-Efficient Parallel Simulation of Electron Beam ... - Google Sites

Parallel Computing System for efficient computation of ... - Google Sites

Efficient graphlet kernels for large graph comparison

Efficient Convolution Kernels for Dependency and Constituent ...

Efficient multivariate kernels for sequence classification

Multimodal Symbolic Association using Parallel ... - Google Sites

Efficient Dynamic Economic Load Dispatch Using Parallel Process of ...

Efficient Graph Kernels for Textual Entailment

A MEMORY EFFICIENT PARALLEL

Modifying Kernels Using Label Information