ConnectX-2 CORE-Direct Enabled Asynchronous Broadcast ...

7 downloads 599 Views 2MB Size Report
1 Managed by UT-Battelle for the Department of Energy. 1 Managed by UT- .... OMPI Cheetah Offloaded Bcast (iboffload). OMPI Cheetah Host Bcast (p2p) ...
ConnectX-2 CORE-Direct Enabled Asynchronous Broadcast Collective Communications

Manjunath Gorentla Venkata1, Richard L. Graham1, Joshua S. Ladd1, Pavel Shamis1, Ishai Rabinovitz2, Vasily Filipov2, and Gilad Shainer3

1

Managed by UT-Battelle for the Department of Energy

Acknowledgements •  US Department of Energy ASCR FASTOS program •  HPC Advisory Council (computer resources) –  www.hpcadvisorycouncil.com

2

Managed by UT-Battelle for the Department of Energy

Outline •  Problems being addressed •  Core-Direct capabilities •  Broadcast algorithm •  Results •  Next steps

3

Managed by UT-Battelle for the Department of Energy

Problems Being Addressed – Collective Operations •  Communication characteristics at scale •  Overlapping computation with communication – true asynchronous communications –  Goal: Avoid using the CPU for communication processing

•  System noise •  Application skew è Scalability •  Collective communication performance

4

Managed by UT-Battelle for the Department of Energy

Scalability of Collective Operations Ideal Algorithm

4 3 2 1

5

Managed by UT-Battelle for the Department of Energy

Impact of System Noise

Scalability of Collective Operations - II Offloaded Algorithm

Nonblocking Algorithm

- Communication processing 6

Managed by UT-Battelle for the Department of Energy

InfiniBand Collective Offload – Key idea •  Create local description of the communication patterns •  Hand the description to the HCA •  Manage collective communications at the network level •  Poll for collective completion •  Add new support for: –  Synchronization primitives (hardware) •  Send Enable task •  Receive Enable task •  Wait task

–  Multiple Work Request •  A sequence of network tasks

–  Management Queue

7

Managed by UT-Battelle for the Department of Energy

InfiniBand Hardware Changes •  Tasks defined in the current standard •  •  •  •  • 

Send Receive Read Write Atomic

•  New support •  Synchronization primitives (hardware) –  Send Enable task –  Receive Enable task –  Wait task

•  Multiple Work Request –  A sequence of network tasks

•  Management Queue

8

Managed by UT-Battelle for the Department of Energy

Enhanced MPI Queue Design

!+6&% ,.% /+)%,2""36-4#'2)% 0+123)4+1%

!+6&% 0+47%

/+)% /++)% 0+123)4+1% 9

Managed by UT-Battelle for the Department of Energy

:$$%1+6&% .3+3+1% 0+123)4+%% )+454$-6*%

!"#$$% &#'#%

!+6&% 0+47%

0+47%,.%

!+)7-4+% 9.%

9.%,.%

,2$$+487+% 9.%

(#)*+% &#'#%

,)+&-'% ./%

!+6&% 0+47%

0+47%,.%

!+6&% 0+47%

0+47%,.%

0+47%,.%

Broadcast Algorithms •  Single level hierarchy –  Short messages: broadcast over K-nomial tree, K=2 –  Large messages: Scatter/AllGather with K-nomial tree, K=2

•  Multiple levels of hierarchy –  Broadcast within each level

10 Managed by UT-Battelle for the Department of Energy

IB – Large Message Algorithm !"#$%&&'('

!"#$%&&')'

56'2%70&1%"'2%$%03%'8%9#":' 4"%-01'*!'' +%,-'

+%,-'

2%$3'

2%$3'

;6':' &%,-%"'

4"%-01'*!'' +%,-'

+%,-'

2%$3'

2%$3'

?6'./01'#,'$"%-01'9%&&/7%' *!'' +%,-'' 2%$3''

+%,-'

*!'' ./01'

./01'

2%$3'

2%$3' @6'+%,-'A&%"'-/1/'

11 Managed by UT-Battelle for the Department of Energy

+%,-'

+%,-'' 2%$3''

System setup •  8 node cluster •  Node Architecture –  3 GHz Intel Xeon –  Dual socket –  Quad core

•  Network –  ConnextX-2 HCA –  36 port QDR switch running pre-release firmware

12 Managed by UT-Battelle for the Department of Energy

Broadcast Latency – usec per call

Msg size

IBOff + SM IBOff

P2P + SM

Open MPI – default

MVAPICH

16B

3.48

16.11

2.55

5.58

5.81

1KB

4.87

23.96

5.66

12.20

10.46

8MB

25244

40735

28288

37343

41439

13 Managed by UT-Battelle for the Department of Energy

Nonblocking Broadcast Latency – usec per call

Msg sizeß

IBOff + SM

IBOff

P2P + SM

16B

3.58

19.79

2.57

1KB

4.96

27.44

5.70

8MB

26100

37855

28781

14 Managed by UT-Battelle for the Department of Energy

Broadcast – small data - hierarchical 180 OMPI Cheetah (uma,iboffload) OMPI Cheetah (socket,uma,iboffload) OMPI Cheetah (p2p) OMPI Cheetah (uma,p2p) OMPI Cheetah (socket,uma,p2p) OMPI Default Mvapich-1 1.2rc1

160

140

Latency (Usec)

120

100

80

60

40

20

0 0

5000

10000

15000

20000

Message size (bytes)

15 Managed by UT-Battelle for the Department of Energy

25000

30000

35000

Broadcast – large data - hierarchical 80000 OMPI Cheetah (uma,iboffload) OMPI Cheetah (socket,uma,iboffload) OMPI Cheetah (p2p) OMPI Cheetah (uma,p2p) OMPI Cheetah (socket,uma,p2p) OMPI Default Mvapich-1 1.2rc1

70000

Latency (Usec)

60000

50000

40000

30000

20000

10000

0 0

2e+06

4e+06

6e+06

8e+06

1e+07

1.2e+07

Message size (bytes) 16 Managed by UT-Battelle for the Department of Energy

1.4e+07

1.6e+07

1.8e+07

Overlap Measurement Benchmark steps: Polling Method 1.  Post broadcast 2.  Do work and poll for completion 3.  Continue until broadcast completion

Post-work-wait Method 1.  2.  3.  4.  5. 

Post broadcast Do work Wait for broadcast completion Compare the time of steps 1 – 3 with post-wait Increase the work and repeat steps 1-4 until the time for postwork-wait is greater than post-wait

17 Managed by UT-Battelle for the Department of Energy

Nonblocking Broadcast – Overlap - Poll 105%

100%

Overlap

95%

90%

85% OMPI Cheetah Offloaded Bcast (iboffload) OMPI Cheetah Host Bcast (p2p) 80% 0

2e+06

4e+06

6e+06

8e+06

1e+07

1.2e+07

Message Size (bytes) 18 Managed by UT-Battelle for the Department of Energy

1.4e+07

1.6e+07

1.8e+07

Nonblocking Broadcast – Overlap - Wait 100%

Overlap

80%

60%

40%

OMPI Cheetah Offloaded Bcast Min (iboffload) OMPI Cheetah Offloaded Bcast Max (iboffload) OMPI Cheetah Offloaded Bcast Avg (iboffload) OMPI Cheetah Host Bcast Min (p2p) OMPI Cheetah Host Bcast Max (p2p) OMPI Cheetah Host Bcast Avg (p2p)

20%

0% 0

2e+06

19 Managed by UT-Battelle for the Department of Energy

4e+06

6e+06

8e+06

1e+07

1.2e+07

Message Size (bytes)

1.4e+07

1.6e+07

1.8e+07

Summary •  Added hardware support for offloading broadcast operations •  Developed MPI-level support for one-copy for asynchronous contiguous large-data transfer •  Good Broadcast performance •  Good overlap capabilities

20 Managed by UT-Battelle for the Department of Energy