Design Alternatives and Hybrid Approach for MPI ... - Semantic Scholar

1 downloads 0 Views 763KB Size Report
Karen Tomko(2), Sayantan Sur(1), ... Hemmert et. al, studied offload capabilities with Portals. • Hoefler et. al ... Kandalla et .al, proposed MPI_Ialltoall and studied.
Designing Non-blocking Broadcast with Collective Offload on InfiniBand Clusters : A Case Study with HPL Krishna Kandalla (1), Hari Subramoni (1), Jerome Vienne(1), Siddhesh Raikar(1) , Karen Tomko(2), Sayantan Sur(1), and Dhabaleswar. K. Panda(1)

(1)Computer

Science & Engineering Department, The Ohio State University (2) The Ohio Supercomputer Center

Outline • Introduction • Motivation and Problem Statement • Designing MPI_Ibcast with Collective Offload • Experimental Evaluation

- Micro-Benchmark evaluations - Experiments with HPL - Impact of System Noise • Conclusions and Future work HOTI '11

2

Introduction • MPI-2.2 defines blocking collective operations – limits performance and scalability of applications • Host-based blocking collectives are prone to noise

• MPI-3 will support non-blocking collectives • ConnectX-2 adapters from Mellanox can be used to design non-blocking collectives

HOTI '11

3

Overview of InfiniBand Collective Offload • Applications can offload task-lists to the NIC • A CQE gets created on the MCQ after execution

Application

Task List Send

Wait

Send

Send

Wait

InfiniBand HCA

MQ

M C Q

Send

Send Q

Send CQ

Physical Link

Data

Recv Q

Recv CQ

Subramoni et. al, Hot Interconnects 2010 HOTI '11

4

Outline • Introduction

• Motivation and Problem Statement • Designing MPI_Ibcast with Collective Offload • Experimental Evaluation • Conclusions and Future work

HOTI '11

5

Design Space of Collective Algorithms Overlap (Higher is Better)

Blocking Coll Non-Blocking Coll (Host) Non-Blocking Coll (Offload)

Latency (Lower is Better)

Portability (Higher is Better)

HOTI '11

Design Space of Collective Algorithms Overlap (Higher is Better)

Blocking Coll Non-Blocking Coll (Host) Non-Blocking Coll (Offload) • Challenge: progress the collective schedules in an asynchronous manner with: → performance portability → minimal host processor intervention Latency → acceptable communication latency (Lower is Better) Portability (Higher is Better)

HOTI '11

Related Work

• Hemmert et. al, studied offload capabilities with Portals • Hoefler et. al, proposed host-based non-blocking solutions (libNBC) • Graham et. al, reported early experiences with COREDirect (CAC ‘09, CCGrid ’10) • Kandalla et .al, proposed MPI_Ialltoall and studied benefits with P3DFFT (ISC ‘11)

• Venkata et. al, proposed network-offload based MPI_Bcast and MPI_Ibcast with 64 processes (CAC ‘11) HOTI '11

8

Problem Statement • Can we design network-offload-based MPI_Ibcast? • Will network offload solutions offer better communication/computation overlap for collectives? • Can network offload improve application throughput? • Are network-based designs more resilient to noise? • Can we leverage our proposed MPI_Ibcast to improve the efficiency of popular benchmarks like HPL? HOTI '11

9

MVAPICH/MVAPICH2 Software • High Performance MPI Library for IB, 10GigE/iWARP and RoCE – MVAPICH (MPI-1) and MVAPICH2 (MPI-2.2) – Used by more than 1,650 organizations in 63 countries

– More than 76,000 downloads from OSU site directly – Empowering many TOP500 clusters • 7th ranked 111,104-core cluster (Pleiades) at NASA • 17th ranked 62,976-core cluster (Ranger) at TACC

– Available with software stacks of many IB, HSE and server vendors including Open Fabrics Enterprise Distribution (OFED) and Linux Distros (RedHat and SuSE) – http://mvapich.cse.ohio-state.edu HOTI '11

10

Outline • Introduction

• Motivation and Problem Statement • Designing MPI_Ibcast with Collective Offload • Experimental Evaluation • Conclusions and Future work

HOTI '11

11

MPI_Ibcast (small messages) NIC (Recv Queue)

Recv b1

b2

b3

b4

.

.

.

b1

b2

b3

b4

.

.

.

Rank i

MVAPICH2 (Buffer-List)

Send Send Send

De-Queue list R

W(R )

S1(b1) S2(b1) S3(b1)

HOTI '11

.

.

.

12

MPI_Ibcast (large messages) • Use separate large message QP

• Dynamically register user-buffers • Use InfiniBand’s RNR for flow-control • Network offload-based scatter-allgather algorithms - Scatter

: Binomial Tree (Special case of K-nomial with K=2) - Allgather : Recursive-Doubling or Ring • Use send/recv/wait tasks to implement the above algorithms. Flow control guaranteed by the network HOTI '11

13

Multi-Core Aware Shared-Memory Based Collective Algorithms in MVAPICH2 Inter–Leader (network) phase Leader Communicator

Intra-Node Phase (Shared-Mem)

Socket A

Socket B

Node1

Node2 HOTI '11

NodeN 14

Design Choices for MPI_Ibcast • Flat-Offload (FO) : Offload the entire broadcast operation • Two-Level-Host (TOH): Inter-leader through offload channel Intra-node through shared-memory (during MPI_Wait) (Better for latency sensitive scenarios) • Two-Level-Offload (TOO): Inter-leader through offload channel Intra-node through offload loopback (Can offer better overlap, if latency is not too critical) HOTI '11

15

Outline • Introduction

• Motivation and Problem Statement • Designing MPI_Ibcast with Collective Offload • Experimental Evaluation - Micro-Benchmark evaluations - Experiments with HPL

- Impact of System Noise • Conclusions and Future work HOTI '11

16

Experimental Setup • Intel Xeon E5640 (2.53 GHz), 12 GB memory per node

• MT26428 QDR ConnectX-2 with PCI-Ex interfaces, 171port Mellanox QDR switch, OFED 1.5.1 • RHEL 5.4, 2.6.18-164.e15 kernel version • MVAPICH2 (v1.6)

HOTI '11

17

Micro-Benchmark Evaluations Impact of Noise

Throughput Benchmark start_throughput_timer() MPI_Ibcast(..) CBLAS_DGEMM() MPI_Wait(..) end_throughput_timer()

• Daemon process on each core -mat_mul • Noise Freq ( 20Hz to 1KHz ) • Noise Duration ( 50 to 250 usec ) • Measure throughput of CBLAS_DGEMM - Offload MPI_Ibcast - Host-based MPI_Ibcast HOTI '11

18

3000 2500 2000 Offload-TOH

1500

libNBC-Test-10

1000

libNBC-Test-2

500

libNBC-Test-5 Theoretical Peak

0

60 260 460 660 860 1060 1260 1460 1660 1860 2060 2260 2460 2660 2860 3110 3510 3910 4310 4710 5510 7510

DGEMM Througput(GFLOPS)

Communication/Computation Overlap

DGEMM Problem Size (N) CBLAS-DGEMM overlapped with Offload-Ibcast delivers better throughput when compared to Host-Based Ibcast with 256 processes HOTI '11

19

DGEMM Througput (GFLOPS)

Communication/Computation Overlap Offload-TOO

libNBC-Test-1

libNBC-Test-4

Theoretical Peak

libNBC-Test-2

3000 2500 2000 1500 1000 500

0 256K

512K 1M 2M 4M MPI_Ibcast Message Length (Bytes)

8M

Bcast-Offload delivers near perfect communication/computation overlap for all messages in a portable manner with 256 Processes HOTI '11

20

Latency Comparison MV2-Bcast-Default

MV2-Bcast-Loop-back

libNBC

Offload-TOO 140

Latency (msec)

Latency (usec)

120

Offload-TOH 45 40 35 30 25 20 15 10 5 0 16 64 256 1K 4K 16K

100 80 60 40

20 0 1

4

Message Length (Bytes) Bcast Latency Comparison with 256 Processes Bcast-Offload delivers good overlap, without sacrificing on communication latency! HOTI '11

21

Outline • Introduction • Motivation and Problem Statement • Designing MPI_Ibcast with Collective Offload • Experimental Evaluation

- Micro-Benchmark evaluations - Experiments with HPL - Impact of System Noise • Conclusions and Future work HOTI '11

22

HPL Performance Normalized HPL Performance

HPL-Offload

HPL-1ring

HPL-Host

1.2

4.5%

1 0.8 0.6

0.4 0.2 0 10

20

30

40

50

60

70

HPL Problem Size (N) as % of Total Memory

HPL Performance Comparison with 512 Processes HPL-Offload consistently offers higher throughput than HPL-1ring and HPL-Host. Improves peak throughput by up to 4.5 % for large problem sizes HOTI '11

23

90 80 70 60 50 40 30 20 10 0

5000 4000 3000

2000 1000 0 64

128 256 512 System Size (Numer of Processes) HPL-Offload HPL-1ring HPL-Host HPL-Offload HPL-1ring HPL-Host HPL Peak Performance and Corresponding Problem Sizes HPL-Offload surpasses the peak throughput of HPL-1ring with significantly smaller problem sizes and run-times! HOTI '11

24

Throughput (GFlops)

Memory Consumption (%)

HPL Performance and Problem Size

HPL Performance Analysis Rest of MPI

HPL-Bcast-Send

HPL-Bcast-Recv

HPL-Iprobe

4

3.14

Rest of MPI Offload-Ibcast Wait

11.15

31.42 64.42

MPI Time % Comparison with HPL-1-ring Algorithm

88.74

MPI Time % Comparison with HPL-Offload-Ibcast Algorithm

HPL’s Run-time analysis with 256 Processes and N=177,000 HPL with Offload-Ibcast has lower MPI-level overheads than HPL’s 1-ring Broadcast Algorithm HOTI '11

25

Outline • Introduction

• Motivation and Problem Statement • Designing MPI_Ibcast with Collective Offload • Experimental Evaluation - Micro-Benchmark evaluations - Experiments with HPL

- Impact of System Noise • Conclusions and Future work HOTI '11

26

7.9%

8

3.9%

6

4 2 1K

0

40 20

Noise Duration (usec) DGEMM Throughput degradation due to System Noise

Host-based Throughput drops by about 7.9% Offload-Throughput drops by only about 3.9% HOTI '11

27

Noise Frequency (Hertz)

Performance Degradation (%)

Impact of Noise

Conclusions and Future Work • Proposed MPI_Ibcast shows near perfect overlap • MPI_Ibcast based on CORE-Direct improves HPL’s peak throughput by upto 4.5% with 512 processes • HPL-Offload achieves peak throughput with significantly smaller problem sizes and shorter run-times • MPI_Ibcast based on Offload is more resilient to noise Future work • Extend Offload-based techniques for other MPI collectives and study their benefits with real applications • Support for Offload-based collectives will be available in future MVAPICH2 releases HOTI '11

28

Thank you!

http://mvapich.cse.ohio-state.edu

(1){kandalla,

subramon, viennej, pai, surs, panda}@cse.ohiostate.edu (2)[email protected] (1)Network-Based Computing Laboratory, Ohio State University (2)The Ohio Supercomputer Center

HOTI '11

29

Suggest Documents