1 Managed by UT-Battelle for the Department of Energy. 1 Managed by UT- .... OMPI Cheetah Offloaded Bcast (iboffload). OMPI Cheetah Host Bcast (p2p) ...
ConnectX-2 CORE-Direct Enabled Asynchronous Broadcast Collective Communications
Manjunath Gorentla Venkata1, Richard L. Graham1, Joshua S. Ladd1, Pavel Shamis1, Ishai Rabinovitz2, Vasily Filipov2, and Gilad Shainer3
1
Managed by UT-Battelle for the Department of Energy
Acknowledgements • US Department of Energy ASCR FASTOS program • HPC Advisory Council (computer resources) – www.hpcadvisorycouncil.com
2
Managed by UT-Battelle for the Department of Energy
Outline • Problems being addressed • Core-Direct capabilities • Broadcast algorithm • Results • Next steps
3
Managed by UT-Battelle for the Department of Energy
Problems Being Addressed – Collective Operations • Communication characteristics at scale • Overlapping computation with communication – true asynchronous communications – Goal: Avoid using the CPU for communication processing
• System noise • Application skew è Scalability • Collective communication performance
4
Managed by UT-Battelle for the Department of Energy
Scalability of Collective Operations Ideal Algorithm
4 3 2 1
5
Managed by UT-Battelle for the Department of Energy
Impact of System Noise
Scalability of Collective Operations - II Offloaded Algorithm
Nonblocking Algorithm
- Communication processing 6
Managed by UT-Battelle for the Department of Energy
InfiniBand Collective Offload – Key idea • Create local description of the communication patterns • Hand the description to the HCA • Manage collective communications at the network level • Poll for collective completion • Add new support for: – Synchronization primitives (hardware) • Send Enable task • Receive Enable task • Wait task
– Multiple Work Request • A sequence of network tasks
– Management Queue
7
Managed by UT-Battelle for the Department of Energy
InfiniBand Hardware Changes • Tasks defined in the current standard • • • • •
Send Receive Read Write Atomic
• New support • Synchronization primitives (hardware) – Send Enable task – Receive Enable task – Wait task
• Multiple Work Request – A sequence of network tasks
• Management Queue
8
Managed by UT-Battelle for the Department of Energy
Enhanced MPI Queue Design
!+6&% ,.% /+)%,2""36-4#'2)% 0+123)4+1%
!+6&% 0+47%
/+)% /++)% 0+123)4+1% 9
Managed by UT-Battelle for the Department of Energy
:$$%1+6&% .3+3+1% 0+123)4+%% )+454$-6*%
!"#$$% '#%
!+6&% 0+47%
0+47%,.%
!+)7-4+% 9.%
9.%,.%
,2$$+487+% 9.%
(#)*+% '#%
,)+&-'% ./%
!+6&% 0+47%
0+47%,.%
!+6&% 0+47%
0+47%,.%
0+47%,.%
Broadcast Algorithms • Single level hierarchy – Short messages: broadcast over K-nomial tree, K=2 – Large messages: Scatter/AllGather with K-nomial tree, K=2
• Multiple levels of hierarchy – Broadcast within each level
10 Managed by UT-Battelle for the Department of Energy
IB – Large Message Algorithm !"#$%&&'('
!"#$%&&')'
56'2%70&1%"'2%$%03%'8%9#":' 4"%-01'*!'' +%,-'
+%,-'
2%$3'
2%$3'
;6':' &%,-%"'
4"%-01'*!'' +%,-'
+%,-'
2%$3'
2%$3'
?6'./01'#,'$"%-01'9%&&/7%' *!'' +%,-'' 2%$3''
+%,-'
*!'' ./01'
./01'
2%$3'
2%$3' @6'+%,-'A&%"'-/1/'
11 Managed by UT-Battelle for the Department of Energy
+%,-'
+%,-'' 2%$3''
System setup • 8 node cluster • Node Architecture – 3 GHz Intel Xeon – Dual socket – Quad core
• Network – ConnextX-2 HCA – 36 port QDR switch running pre-release firmware
12 Managed by UT-Battelle for the Department of Energy
Broadcast Latency – usec per call
Msg size
IBOff + SM IBOff
P2P + SM
Open MPI – default
MVAPICH
16B
3.48
16.11
2.55
5.58
5.81
1KB
4.87
23.96
5.66
12.20
10.46
8MB
25244
40735
28288
37343
41439
13 Managed by UT-Battelle for the Department of Energy
Nonblocking Broadcast Latency – usec per call
Msg sizeß
IBOff + SM
IBOff
P2P + SM
16B
3.58
19.79
2.57
1KB
4.96
27.44
5.70
8MB
26100
37855
28781
14 Managed by UT-Battelle for the Department of Energy
Broadcast – small data - hierarchical 180 OMPI Cheetah (uma,iboffload) OMPI Cheetah (socket,uma,iboffload) OMPI Cheetah (p2p) OMPI Cheetah (uma,p2p) OMPI Cheetah (socket,uma,p2p) OMPI Default Mvapich-1 1.2rc1
160
140
Latency (Usec)
120
100
80
60
40
20
0 0
5000
10000
15000
20000
Message size (bytes)
15 Managed by UT-Battelle for the Department of Energy
25000
30000
35000
Broadcast – large data - hierarchical 80000 OMPI Cheetah (uma,iboffload) OMPI Cheetah (socket,uma,iboffload) OMPI Cheetah (p2p) OMPI Cheetah (uma,p2p) OMPI Cheetah (socket,uma,p2p) OMPI Default Mvapich-1 1.2rc1
70000
Latency (Usec)
60000
50000
40000
30000
20000
10000
0 0
2e+06
4e+06
6e+06
8e+06
1e+07
1.2e+07
Message size (bytes) 16 Managed by UT-Battelle for the Department of Energy
1.4e+07
1.6e+07
1.8e+07
Overlap Measurement Benchmark steps: Polling Method 1. Post broadcast 2. Do work and poll for completion 3. Continue until broadcast completion
Post-work-wait Method 1. 2. 3. 4. 5.
Post broadcast Do work Wait for broadcast completion Compare the time of steps 1 – 3 with post-wait Increase the work and repeat steps 1-4 until the time for postwork-wait is greater than post-wait
17 Managed by UT-Battelle for the Department of Energy
Nonblocking Broadcast – Overlap - Poll 105%
100%
Overlap
95%
90%
85% OMPI Cheetah Offloaded Bcast (iboffload) OMPI Cheetah Host Bcast (p2p) 80% 0
2e+06
4e+06
6e+06
8e+06
1e+07
1.2e+07
Message Size (bytes) 18 Managed by UT-Battelle for the Department of Energy
1.4e+07
1.6e+07
1.8e+07
Nonblocking Broadcast – Overlap - Wait 100%
Overlap
80%
60%
40%
OMPI Cheetah Offloaded Bcast Min (iboffload) OMPI Cheetah Offloaded Bcast Max (iboffload) OMPI Cheetah Offloaded Bcast Avg (iboffload) OMPI Cheetah Host Bcast Min (p2p) OMPI Cheetah Host Bcast Max (p2p) OMPI Cheetah Host Bcast Avg (p2p)
20%
0% 0
2e+06
19 Managed by UT-Battelle for the Department of Energy
4e+06
6e+06
8e+06
1e+07
1.2e+07
Message Size (bytes)
1.4e+07
1.6e+07
1.8e+07
Summary • Added hardware support for offloading broadcast operations • Developed MPI-level support for one-copy for asynchronous contiguous large-data transfer • Good Broadcast performance • Good overlap capabilities
20 Managed by UT-Battelle for the Department of Energy