Designing Non-blocking Broadcast with Collective Offload on InfiniBand Clusters : A Case Study with HPL Krishna Kandalla (1), Hari Subramoni (1), Jerome Vienne(1), Siddhesh Raikar(1) , Karen Tomko(2), Sayantan Sur(1), and Dhabaleswar. K. Panda(1)
(1)Computer
Science & Engineering Department, The Ohio State University (2) The Ohio Supercomputer Center
Outline • Introduction • Motivation and Problem Statement • Designing MPI_Ibcast with Collective Offload • Experimental Evaluation
- Micro-Benchmark evaluations - Experiments with HPL - Impact of System Noise • Conclusions and Future work HOTI '11
2
Introduction • MPI-2.2 defines blocking collective operations – limits performance and scalability of applications • Host-based blocking collectives are prone to noise
• MPI-3 will support non-blocking collectives • ConnectX-2 adapters from Mellanox can be used to design non-blocking collectives
HOTI '11
3
Overview of InfiniBand Collective Offload • Applications can offload task-lists to the NIC • A CQE gets created on the MCQ after execution
Application
Task List Send
Wait
Send
Send
Wait
InfiniBand HCA
MQ
M C Q
Send
Send Q
Send CQ
Physical Link
Data
Recv Q
Recv CQ
Subramoni et. al, Hot Interconnects 2010 HOTI '11
4
Outline • Introduction
• Motivation and Problem Statement • Designing MPI_Ibcast with Collective Offload • Experimental Evaluation • Conclusions and Future work
HOTI '11
5
Design Space of Collective Algorithms Overlap (Higher is Better)
Blocking Coll Non-Blocking Coll (Host) Non-Blocking Coll (Offload)
Latency (Lower is Better)
Portability (Higher is Better)
HOTI '11
Design Space of Collective Algorithms Overlap (Higher is Better)
Blocking Coll Non-Blocking Coll (Host) Non-Blocking Coll (Offload) • Challenge: progress the collective schedules in an asynchronous manner with: → performance portability → minimal host processor intervention Latency → acceptable communication latency (Lower is Better) Portability (Higher is Better)
HOTI '11
Related Work
• Hemmert et. al, studied offload capabilities with Portals • Hoefler et. al, proposed host-based non-blocking solutions (libNBC) • Graham et. al, reported early experiences with COREDirect (CAC ‘09, CCGrid ’10) • Kandalla et .al, proposed MPI_Ialltoall and studied benefits with P3DFFT (ISC ‘11)
• Venkata et. al, proposed network-offload based MPI_Bcast and MPI_Ibcast with 64 processes (CAC ‘11) HOTI '11
8
Problem Statement • Can we design network-offload-based MPI_Ibcast? • Will network offload solutions offer better communication/computation overlap for collectives? • Can network offload improve application throughput? • Are network-based designs more resilient to noise? • Can we leverage our proposed MPI_Ibcast to improve the efficiency of popular benchmarks like HPL? HOTI '11
9
MVAPICH/MVAPICH2 Software • High Performance MPI Library for IB, 10GigE/iWARP and RoCE – MVAPICH (MPI-1) and MVAPICH2 (MPI-2.2) – Used by more than 1,650 organizations in 63 countries
– More than 76,000 downloads from OSU site directly – Empowering many TOP500 clusters • 7th ranked 111,104-core cluster (Pleiades) at NASA • 17th ranked 62,976-core cluster (Ranger) at TACC
– Available with software stacks of many IB, HSE and server vendors including Open Fabrics Enterprise Distribution (OFED) and Linux Distros (RedHat and SuSE) – http://mvapich.cse.ohio-state.edu HOTI '11
10
Outline • Introduction
• Motivation and Problem Statement • Designing MPI_Ibcast with Collective Offload • Experimental Evaluation • Conclusions and Future work
HOTI '11
11
MPI_Ibcast (small messages) NIC (Recv Queue)
Recv b1
b2
b3
b4
.
.
.
b1
b2
b3
b4
.
.
.
Rank i
MVAPICH2 (Buffer-List)
Send Send Send
De-Queue list R
W(R )
S1(b1) S2(b1) S3(b1)
HOTI '11
.
.
.
12
MPI_Ibcast (large messages) • Use separate large message QP
• Dynamically register user-buffers • Use InfiniBand’s RNR for flow-control • Network offload-based scatter-allgather algorithms - Scatter
: Binomial Tree (Special case of K-nomial with K=2) - Allgather : Recursive-Doubling or Ring • Use send/recv/wait tasks to implement the above algorithms. Flow control guaranteed by the network HOTI '11
13
Multi-Core Aware Shared-Memory Based Collective Algorithms in MVAPICH2 Inter–Leader (network) phase Leader Communicator
Intra-Node Phase (Shared-Mem)
Socket A
Socket B
Node1
Node2 HOTI '11
NodeN 14
Design Choices for MPI_Ibcast • Flat-Offload (FO) : Offload the entire broadcast operation • Two-Level-Host (TOH): Inter-leader through offload channel Intra-node through shared-memory (during MPI_Wait) (Better for latency sensitive scenarios) • Two-Level-Offload (TOO): Inter-leader through offload channel Intra-node through offload loopback (Can offer better overlap, if latency is not too critical) HOTI '11
15
Outline • Introduction
• Motivation and Problem Statement • Designing MPI_Ibcast with Collective Offload • Experimental Evaluation - Micro-Benchmark evaluations - Experiments with HPL
- Impact of System Noise • Conclusions and Future work HOTI '11
16
Experimental Setup • Intel Xeon E5640 (2.53 GHz), 12 GB memory per node
• MT26428 QDR ConnectX-2 with PCI-Ex interfaces, 171port Mellanox QDR switch, OFED 1.5.1 • RHEL 5.4, 2.6.18-164.e15 kernel version • MVAPICH2 (v1.6)
HOTI '11
17
Micro-Benchmark Evaluations Impact of Noise
Throughput Benchmark start_throughput_timer() MPI_Ibcast(..) CBLAS_DGEMM() MPI_Wait(..) end_throughput_timer()
• Daemon process on each core -mat_mul • Noise Freq ( 20Hz to 1KHz ) • Noise Duration ( 50 to 250 usec ) • Measure throughput of CBLAS_DGEMM - Offload MPI_Ibcast - Host-based MPI_Ibcast HOTI '11
18
3000 2500 2000 Offload-TOH
1500
libNBC-Test-10
1000
libNBC-Test-2
500
libNBC-Test-5 Theoretical Peak
0
60 260 460 660 860 1060 1260 1460 1660 1860 2060 2260 2460 2660 2860 3110 3510 3910 4310 4710 5510 7510
DGEMM Througput(GFLOPS)
Communication/Computation Overlap
DGEMM Problem Size (N) CBLAS-DGEMM overlapped with Offload-Ibcast delivers better throughput when compared to Host-Based Ibcast with 256 processes HOTI '11
19
DGEMM Througput (GFLOPS)
Communication/Computation Overlap Offload-TOO
libNBC-Test-1
libNBC-Test-4
Theoretical Peak
libNBC-Test-2
3000 2500 2000 1500 1000 500
0 256K
512K 1M 2M 4M MPI_Ibcast Message Length (Bytes)
8M
Bcast-Offload delivers near perfect communication/computation overlap for all messages in a portable manner with 256 Processes HOTI '11
20
Latency Comparison MV2-Bcast-Default
MV2-Bcast-Loop-back
libNBC
Offload-TOO 140
Latency (msec)
Latency (usec)
120
Offload-TOH 45 40 35 30 25 20 15 10 5 0 16 64 256 1K 4K 16K
100 80 60 40
20 0 1
4
Message Length (Bytes) Bcast Latency Comparison with 256 Processes Bcast-Offload delivers good overlap, without sacrificing on communication latency! HOTI '11
21
Outline • Introduction • Motivation and Problem Statement • Designing MPI_Ibcast with Collective Offload • Experimental Evaluation
- Micro-Benchmark evaluations - Experiments with HPL - Impact of System Noise • Conclusions and Future work HOTI '11
22
HPL Performance Normalized HPL Performance
HPL-Offload
HPL-1ring
HPL-Host
1.2
4.5%
1 0.8 0.6
0.4 0.2 0 10
20
30
40
50
60
70
HPL Problem Size (N) as % of Total Memory
HPL Performance Comparison with 512 Processes HPL-Offload consistently offers higher throughput than HPL-1ring and HPL-Host. Improves peak throughput by up to 4.5 % for large problem sizes HOTI '11
23
90 80 70 60 50 40 30 20 10 0
5000 4000 3000
2000 1000 0 64
128 256 512 System Size (Numer of Processes) HPL-Offload HPL-1ring HPL-Host HPL-Offload HPL-1ring HPL-Host HPL Peak Performance and Corresponding Problem Sizes HPL-Offload surpasses the peak throughput of HPL-1ring with significantly smaller problem sizes and run-times! HOTI '11
24
Throughput (GFlops)
Memory Consumption (%)
HPL Performance and Problem Size
HPL Performance Analysis Rest of MPI
HPL-Bcast-Send
HPL-Bcast-Recv
HPL-Iprobe
4
3.14
Rest of MPI Offload-Ibcast Wait
11.15
31.42 64.42
MPI Time % Comparison with HPL-1-ring Algorithm
88.74
MPI Time % Comparison with HPL-Offload-Ibcast Algorithm
HPL’s Run-time analysis with 256 Processes and N=177,000 HPL with Offload-Ibcast has lower MPI-level overheads than HPL’s 1-ring Broadcast Algorithm HOTI '11
25
Outline • Introduction
• Motivation and Problem Statement • Designing MPI_Ibcast with Collective Offload • Experimental Evaluation - Micro-Benchmark evaluations - Experiments with HPL
- Impact of System Noise • Conclusions and Future work HOTI '11
26
7.9%
8
3.9%
6
4 2 1K
0
40 20
Noise Duration (usec) DGEMM Throughput degradation due to System Noise
Host-based Throughput drops by about 7.9% Offload-Throughput drops by only about 3.9% HOTI '11
27
Noise Frequency (Hertz)
Performance Degradation (%)
Impact of Noise
Conclusions and Future Work • Proposed MPI_Ibcast shows near perfect overlap • MPI_Ibcast based on CORE-Direct improves HPL’s peak throughput by upto 4.5% with 512 processes • HPL-Offload achieves peak throughput with significantly smaller problem sizes and shorter run-times • MPI_Ibcast based on Offload is more resilient to noise Future work • Extend Offload-based techniques for other MPI collectives and study their benefits with real applications • Support for Offload-based collectives will be available in future MVAPICH2 releases HOTI '11
28
Thank you!
http://mvapich.cse.ohio-state.edu
(1){kandalla,
subramon, viennej, pai, surs, panda}@cse.ohiostate.edu (2)
[email protected] (1)Network-Based Computing Laboratory, Ohio State University (2)The Ohio Supercomputer Center
HOTI '11
29