the network controller that realizes centralized flow control over SDN switches. Software Defined Networking. 6. SDN. Sw
Design and Implementation of Control Sequence Generator for SDN-enhanced MPI
Baatarsuren Munkhdorj, Keichi Takahashi, Dashdavaa Khureltulga, Yasuhiro Watashiba, Yoshiyuki Kido, Susumu Date and Shinji Shimojo
Our vision HPC
Software Defined Networking
Network-aware High Performance Computing 2
Background •
13%
Computer cluster is a recent trend in High Performance Computing (HPC) architecture, and the scale of cluster systems increases year by year. ➡
87% •
MPP
Cluster
Architecture System Share in HPC (top500.org)
The interconnect of the system has become a key aspect for gaining high-throughput on computer clusters.
Message Passing Interface (MPI) has become the de-facto standard for programming on computer clusters. ➡
Optimization/tuning of MPI programs is an important mission to improve the performance of cluster based supercomputers. 3
Message Passing Interface •
•
MPI provides a suite of APIs for inter-process communication. •
Peer-to-peer communication APIs: 1 to 1 communication
•
Collective communication APIs: multi-process communication
MPI collective communication has to be performed as quick as possible for gaining high-performance. ➡
Acceleration of MPI collective operators is the key to success.
MPI_Allreduce
MPI_Bcast
MPI_Gather
6 Rank 0 1
2
Rank 1 Rank 2 Rank 3
Rank 1
Rank 2
3
Rank 1 Rank 2 Rank 3
Rank 0 4
Rank 0
Rank 3
MPI and the underlying network
•
Although the performance of MPI collective communication heavily depends on the underlying network, the current implementations of MPI are not being aware of network. ➡
as the scale of cluster systems increases, the interconnect of the system also increases.
5
Software Defined Networking •
Software Defined Networking (SDN) is a concept that the control and forwarding plane of network are completely separated. ➡
Network programmability: SDN offers a software programming interface between network administrator and the network controller that realizes centralized flow control over SDN switches. SDN Controller
SDN Switches
a w r Fo
g n i rd
Program
s e l u r
6
Our vision HPC
Software Defined Networking
Network-aware High Performance Computing 7
Our previous works •
We have successfully accelerated MPI collective operators by integrating SDN’s network control features to MPI.
•
SDN_MPI_Bcast: Applying multicast
•
SDN_MPI_Allreduce: Link congestion avoidance Process 2 Process 3
Process 1
MPI_Bcast MPI_Allreduce Process 2 Process 3
Process 4
Process 1 8
Process 4
SDN-enhanced MPI Collectives SDN Controller
SDN Controller
Process 1
Process 2 Process 3
Process 4 Process 1
Process 2
Process 3
Process 4
SDN_MPI_Bcast
SDN_MPI_Allreduce
Dashdavaa, Khureltulga, et al. "Architecture of a high-speed mpi_bcast leveraging software-defined network." Euro-Par 2013: Parallel Processing Workshops. Springer Berlin Heidelberg, 2014.
Takahashi, Koichi, et al. "Performance evaluation of SDNenhanced MPI allreduce on a cluster system with fat-tree interconnect." High Performance Computing & Simulation (HPCS), 2014 International Conference on. IEEE, 2014.
SDN-enhanced MPI Collectives SDN Controller
SDN Controller
Process 1
Process 2 Process 3
Process 4 Process 1
Process 2
Process 3
Process 4
SDN_MPI_Bcast
SDN_MPI_Allreduce
Dashdavaa, Khureltulga, et al. "Architecture of a high-speed mpi_bcast leveraging software-defined network." Euro-Par 2013: Parallel Processing Workshops. Springer Berlin Heidelberg, 2014.
Takahashi, Koichi, et al. "Performance evaluation of SDNenhanced MPI allreduce on a cluster system with fat-tree interconnect." High Performance Computing & Simulation (HPCS), 2014 International Conference on. IEEE, 2014.
SDN-enhanced MPI Collectives SDN Controller
SDN Controller
Process 1
Process 2 Process 3
Process 4 Process 1
Process 2
Process 3
Process 4
SDN_MPI_Bcast
SDN_MPI_Allreduce
Dashdavaa, Khureltulga, et al. "Architecture of a high-speed mpi_bcast leveraging software-defined network." Euro-Par 2013: Parallel Processing Workshops. Springer Berlin Heidelberg, 2014.
Takahashi, Koichi, et al. "Performance evaluation of SDNenhanced MPI allreduce on a cluster system with fat-tree interconnect." High Performance Computing & Simulation (HPCS), 2014 International Conference on. IEEE, 2014.
SDN-enhanced MPI Collectives SDN Controller
SDN Controller
Process 1
Process 2 Process 3
Process 4 Process 1
Process 2
Process 3
Process 4
SDN_MPI_Bcast
SDN_MPI_Allreduce
Dashdavaa, Khureltulga, et al. "Architecture of a high-speed mpi_bcast leveraging software-defined network." Euro-Par 2013: Parallel Processing Workshops. Springer Berlin Heidelberg, 2014.
Takahashi, Koichi, et al. "Performance evaluation of SDNenhanced MPI allreduce on a cluster system with fat-tree interconnect." High Performance Computing & Simulation (HPCS), 2014 International Conference on. IEEE, 2014.
SDN-enhanced MPI Collectives SDN Controller
SDN Controller Duplicate
Process 1
Process 2 Process 3
Process 4 Process 1
Process 2
Process 3
Process 4
SDN_MPI_Bcast
SDN_MPI_Allreduce
Dashdavaa, Khureltulga, et al. "Architecture of a high-speed mpi_bcast leveraging software-defined network." Euro-Par 2013: Parallel Processing Workshops. Springer Berlin Heidelberg, 2014.
Takahashi, Koichi, et al. "Performance evaluation of SDNenhanced MPI allreduce on a cluster system with fat-tree interconnect." High Performance Computing & Simulation (HPCS), 2014 International Conference on. IEEE, 2014.
SDN-enhanced MPI Collectives SDN Controller
SDN Controller Duplicate
Process 1
Process 2 Process 3
Process 4 Process 1
Process 2
Process 3
Process 4
SDN_MPI_Bcast
SDN_MPI_Allreduce
Dashdavaa, Khureltulga, et al. "Architecture of a high-speed mpi_bcast leveraging software-defined network." Euro-Par 2013: Parallel Processing Workshops. Springer Berlin Heidelberg, 2014.
Takahashi, Koichi, et al. "Performance evaluation of SDNenhanced MPI allreduce on a cluster system with fat-tree interconnect." High Performance Computing & Simulation (HPCS), 2014 International Conference on. IEEE, 2014.
SDN-enhanced MPI Collectives SDN Controller
SDN Controller Duplicate
Process 1
Process 2 Process 3
Process 4 Process 1
Process 2
Process 3
Process 4
SDN_MPI_Bcast
SDN_MPI_Allreduce
Dashdavaa, Khureltulga, et al. "Architecture of a high-speed mpi_bcast leveraging software-defined network." Euro-Par 2013: Parallel Processing Workshops. Springer Berlin Heidelberg, 2014.
Takahashi, Koichi, et al. "Performance evaluation of SDNenhanced MPI allreduce on a cluster system with fat-tree interconnect." High Performance Computing & Simulation (HPCS), 2014 International Conference on. IEEE, 2014.
SDN-enhanced MPI Collectives SDN Controller
SDN Controller Duplicate
Process 1
Process 2 Process 3
Process 4 Process 1
Process 2
Process 3
Process 4
SDN_MPI_Bcast
SDN_MPI_Allreduce
Dashdavaa, Khureltulga, et al. "Architecture of a high-speed mpi_bcast leveraging software-defined network." Euro-Par 2013: Parallel Processing Workshops. Springer Berlin Heidelberg, 2014.
Takahashi, Koichi, et al. "Performance evaluation of SDNenhanced MPI allreduce on a cluster system with fat-tree interconnect." High Performance Computing & Simulation (HPCS), 2014 International Conference on. IEEE, 2014.
SDN-enhanced MPI Collectives SDN Controller
SDN Controller Duplicate Duplicate
Process 1
Process 2 Process 3
Process 4 Process 1
Process 2
Process 3
Process 4
SDN_MPI_Bcast
SDN_MPI_Allreduce
Dashdavaa, Khureltulga, et al. "Architecture of a high-speed mpi_bcast leveraging software-defined network." Euro-Par 2013: Parallel Processing Workshops. Springer Berlin Heidelberg, 2014.
Takahashi, Koichi, et al. "Performance evaluation of SDNenhanced MPI allreduce on a cluster system with fat-tree interconnect." High Performance Computing & Simulation (HPCS), 2014 International Conference on. IEEE, 2014.
SDN-enhanced MPI Collectives SDN Controller
Link with congestion
SDN Controller
Duplicate Duplicate
Process 1
Process 2 Process 3
Process 4 Process 1
Process 2
Process 3
Process 4
SDN_MPI_Bcast
SDN_MPI_Allreduce
Dashdavaa, Khureltulga, et al. "Architecture of a high-speed mpi_bcast leveraging software-defined network." Euro-Par 2013: Parallel Processing Workshops. Springer Berlin Heidelberg, 2014.
Takahashi, Koichi, et al. "Performance evaluation of SDNenhanced MPI allreduce on a cluster system with fat-tree interconnect." High Performance Computing & Simulation (HPCS), 2014 International Conference on. IEEE, 2014.
SDN-enhanced MPI Collectives SDN Controller
Link with congestion
SDN Controller
Duplicate Duplicate
Process 1
Process 2 Process 3
Process 4 Process 1
Process 2
Process 3
Process 4
SDN_MPI_Bcast
SDN_MPI_Allreduce
Dashdavaa, Khureltulga, et al. "Architecture of a high-speed mpi_bcast leveraging software-defined network." Euro-Par 2013: Parallel Processing Workshops. Springer Berlin Heidelberg, 2014.
Takahashi, Koichi, et al. "Performance evaluation of SDNenhanced MPI allreduce on a cluster system with fat-tree interconnect." High Performance Computing & Simulation (HPCS), 2014 International Conference on. IEEE, 2014.
Current situation •
Our previous works just focused on acceleration of MPI collective communication. ➡
•
Each SDN-enhanced MPI collective operation has been developed individually. As a result, SDN-enhanced MPI collective operations cannot be used sequentially in an MPI application.
A general framework that allows us to perform a whole MPI program with the accelerated operators is required. SDN-MPI-Bcast SDN-MPI-Allreduce SDN_MPI_Bcast library
SDN_MPI_Allreduce library Controller for SDN_MPI_Allreduce
Controller for SDN_MPI_Bcast
10
Research goal
Our final goal is to realize a network-aware HPC environment (SDN-enhanced MPI framework) to put the accelerated MPI collective operations in practice.
11
Overview of SDN-enhanced MPI Framework in action
SDN Controller
Time
Process 1 12
Process 2 Process 3
Process 4
Overview of SDN-enhanced MPI Framework in action
MPI_Bcast SDN Controller
Time
Process 1 12
Process 2 Process 3
Process 4
Overview of SDN-enhanced MPI Framework in action Control module for MPI_Bcast
MPI_Bcast SDN Controller
Time
Process 1 12
Process 2 Process 3
Process 4
Overview of SDN-enhanced MPI Framework in action Control module for MPI_Bcast
MPI_Bcast SDN Controller
Reconfiguration
Time
Process 1 12
Process 2 Process 3
Process 4
Overview of SDN-enhanced MPI Framework in action Control module for MPI_Bcast
MPI_Bcast SDN Controller
Reconfiguration
Duplicate Duplicate
Time
Process 1 12
Process 2 Process 3
Process 4
Overview of SDN-enhanced MPI Framework in action
MPI_Bcast SDN Controller
Time
12
Overview of SDN-enhanced MPI Framework in action
MPI_Bcast MPI_Allreduce
Time
SDN Controller
12 Process 1
Process 2
Process 3
Process 4
Overview of SDN-enhanced MPI Framework in action
MPI_Bcast MPI_Allreduce
Time
SDN Controller
12 Process 1
Process 2
Control module for MPI_Allreduce
Process 3
Process 4
Overview of SDN-enhanced MPI Framework in action
MPI_Bcast MPI_Allreduce
SDN Controller
Control module for MPI_Allreduce
Reconfiguration
Time
12 Process 1
Process 2
Process 3
Process 4
Overview of SDN-enhanced MPI Framework in action
MPI_Bcast MPI_Allreduce
SDN Controller
Control module for MPI_Allreduce
Reconfiguration
Link with congestion
Time
12 Process 1
Process 2
Process 3
Process 4
Technical issues 1. Control sequence generator •
How does the system generate an actual control sequence corresponding MPI collective operations in order for SDN controller to switch controller modules?
•
The runtime information of MPI processes is needed for SDN controller to switch network configuration •
e.g. process distribution in computing nodes
•
e.g. communication group attributes
•
e.g. source and destination relationship
2. Synchronization mechanism of the “network program” on SDN switches •
How does SDN controller detect which MPI collective routine is performed? 13
Technical issues 1. Control sequence generator •
How does the system generate an actual control sequence corresponding MPI collective operations in order for SDN controller to switch controller modules?
•
The runtime information of MPI processes is needed for SDN controller to switch network configuration •
e.g. process distribution in computing nodes
•
e.g. communication group attributes
•
e.g. source and destination relationship
2. Synchronization mechanism of the “network program” on SDN switches •
How does SDN controller detect which MPI collective routine is performed? 13
Focus of paper 1. MPI_Bcast(Group #、IP addresses、 rank (IP address) of source …) 2. MPI_Allreduce(Group #、IP addresses、 Group scope …) 3. ……
SDN Controller Control Sequence
• •
SDN_MPI_Bcast SDN_MPI_Allreduce …. MPI Application
Synchronization mechanism 14
SDN Switches
Focus of paper 1. MPI_Bcast(Group #、IP addresses、
Focus of this paper
rank (IP address) of source …) 2. MPI_Allreduce(Group #、IP addresses、 Group scope …) 3. ……
SDN Controller Control Sequence
• •
SDN_MPI_Bcast SDN_MPI_Allreduce …. MPI Application
Synchronization mechanism 14
SDN Switches
Control Sequence Generator (1/2) •
The core idea is to leverage the execution log of MPI application as information source and generate control sequence for the corresponding MPI collective communication. •
The followings have to be extracted from the execution log •
event list: the execution order of MPI collective operations
•
process distribution: IP addresses of processes
•
communication group attributes: which processes belong to which communication group
•
source and destination relationship: which processes send message, and which processes receive the message 15
Control Sequence Generator (2/2) SDN Controller
MPI Application
Log Analyzer
SDN Switches
16
Control Sequence Generator (2/2) SDN Controller
Write all the necessary information such as IP addresses and group numbers into log.
MPI Application
Log Analyzer
SDN Switches
16
Control Sequence Generator (2/2) SDN Controller
Write all the necessary information such as IP addresses and group numbers into log.
MPI Application
Log Analyzer
ts=0.230285 icomm=0 rank=3 thd=0 type=comm et=IntraCommCreate icomm=2 rank=3 ts=0.273049 icomm=0 rank=1 thd=0 type=comm
Log File
SDN Switches
16
Control Sequence Generator (2/2)
Write all the necessary information such as IP addresses and group numbers into log.
MPI Application
Log Analyzer
Control Sequence
p rou G nk t( a s r a s、 e _Bc I s P M ce res r d u d 1. o Pa fs I o ) 、 # ess r d d (IP a …)
ts=0.230285 icomm=0 rank=3 thd=0 type=comm et=IntraCommCreate icomm=2 rank=3 ts=0.273049 icomm=0 rank=1 thd=0 type=comm
Log File
SDN Controller
SDN Switches
16
Control Sequence Generator (2/2)
Write all the necessary information such as IP addresses and group numbers into log.
MPI Application
Log Analyzer
Control Sequence
p rou G nk t( a s r a s、 e _Bc I s P M ce res r d u d 1. o Pa fs I o ) 、 # ess r d d (IP a …)
Flow control rules
ts=0.230285 icomm=0 rank=3 thd=0 type=comm et=IntraCommCreate icomm=2 rank=3 ts=0.273049 icomm=0 rank=1 thd=0 type=comm
Log File
SDN Controller
SDN Switches
16
Control sequence ts=0.230285 icomm=0 rank=3 thd=0 type=comm et=IntraCommCreate icomm=2 rank=3 ts=0.273049 icomm=0 rank=1 thd=0 type=comm et=IntraCommCreate icomm=2 rank=1 ts=0.538578 icomm=0 rank=2 thd=0 type=bare et=11
ts=0.538581 icomm=0 rank=2 thd=0 type=cago et=601 bytes=10.0.0.3
Additional Additional fields field
ts=0.538586 icomm=0 rank=2 thd=0 type=bare et=12
Log File
ts=0.543090 icomm=0 rank=6 thd=0 type=cago et=601 bytes=10.0.0.7 ts=0.569542 icomm=0 rank=0 thd=0 type=msg et=recv icomm=0 rank=0 tag=9999 sz=1 ts=0.584932 icomm=0 rank=2 thd=0 type=msg et=recv icomm=0 rank=0 tag=9999 sz=1
Control Sequence Generation Event List {:type=>"Bcast", :group=>[0], :src=>0,} {:type=>"Allreduce", :group=>[7, 2], :src=>nil} {:type=>"Bcast", :group=>[2, 7], :src=>0} {:type=>"Allreduce", :group=>[0], :src=>nil}
Process/Group Attributes {0=>[0, 1, 2, 3, 4, 5, 6, 7], 7=>[4, 5, 6, 7], 2=>[0, 1, 2, 3]} {0=>["10.0.0.1", "10.0.0.2", "10.0.0.3", "10.0.0.4", "10.0.0.5", "10.0.0.6", "10.0.0.7", "10.0.0.8"], 7=>["10.0.0.5", "10.0.0.6", "10.0.0.7", "10.0.0.8"], 2=>["10.0.0.1", "10.0.0.2", "10.0.0.3", 17 "10.0.0.4"]}
Control sequence ts=0.230285 icomm=0 rank=3 thd=0 type=comm et=IntraCommCreate icomm=2 rank=3 ts=0.273049 icomm=0 rank=1 thd=0 type=comm et=IntraCommCreate icomm=2 rank=1 ts=0.538578 icomm=0 rank=2 thd=0 type=bare et=11
ts=0.538581 icomm=0 rank=2 thd=0 type=cago et=601 bytes=10.0.0.3
Additional Additional fields field
ts=0.538586 icomm=0 rank=2 thd=0 type=bare et=12
Log File
ts=0.543090 icomm=0 rank=6 thd=0 type=cago et=601 bytes=10.0.0.7 ts=0.569542 icomm=0 rank=0 thd=0 type=msg et=recv icomm=0 rank=0 tag=9999 sz=1 ts=0.584932 icomm=0 rank=2 thd=0 type=msg et=recv icomm=0 rank=0 tag=9999 sz=1
Control Sequence Generation Event List {:type=>"Bcast", :group=>[0], :src=>0,} {:type=>"Allreduce", :group=>[7, 2], :src=>nil} {:type=>"Bcast", :group=>[2, 7], :src=>0} {:type=>"Allreduce", :group=>[0], :src=>nil}
Process/Group Attributes {0=>[0, 1, 2, 3, 4, 5, 6, 7], 7=>[4, 5, 6, 7], 2=>[0, 1, 2, 3]} {0=>["10.0.0.1", "10.0.0.2", "10.0.0.3", "10.0.0.4", "10.0.0.5", "10.0.0.6", "10.0.0.7", "10.0.0.8"], 7=>["10.0.0.5", "10.0.0.6", "10.0.0.7", "10.0.0.8"], 2=>["10.0.0.1", "10.0.0.2", "10.0.0.3", 17 "10.0.0.4"]}
Control sequence ts=0.230285 icomm=0 rank=3 thd=0 type=comm et=IntraCommCreate icomm=2 rank=3 ts=0.273049 icomm=0 rank=1 thd=0 type=comm et=IntraCommCreate icomm=2 rank=1 ts=0.538578 icomm=0 rank=2 thd=0 type=bare et=11
ts=0.538581 icomm=0 rank=2 thd=0 type=cago et=601 bytes=10.0.0.3
Additional Additional fields field
ts=0.538586 icomm=0 rank=2 thd=0 type=bare et=12
Log File
ts=0.543090 icomm=0 rank=6 thd=0 type=cago et=601 bytes=10.0.0.7 ts=0.569542 icomm=0 rank=0 thd=0 type=msg et=recv icomm=0 rank=0 tag=9999 sz=1 ts=0.584932 icomm=0 rank=2 thd=0 type=msg et=recv icomm=0 rank=0 tag=9999 sz=1
Control Sequence Generation Event List {:type=>"Bcast", :group=>[0], :src=>0,} {:type=>"Allreduce", :group=>[7, 2], :src=>nil} {:type=>"Bcast", :group=>[2, 7], :src=>0} {:type=>"Allreduce", :group=>[0], :src=>nil}
Process/Group Attributes {0=>[0, 1, 2, 3, 4, 5, 6, 7], 7=>[4, 5, 6, 7], 2=>[0, 1, 2, 3]} {0=>["10.0.0.1", "10.0.0.2", "10.0.0.3", "10.0.0.4", "10.0.0.5", "10.0.0.6", "10.0.0.7", "10.0.0.8"], 7=>["10.0.0.5", "10.0.0.6", "10.0.0.7", "10.0.0.8"], 2=>["10.0.0.1", "10.0.0.2", "10.0.0.3", 17 "10.0.0.4"]}
Technical issues 1. Control sequence generator •
How does the system generate an actual control sequence corresponding MPI collective operations in order for SDN controller to switch controller modules?
•
The runtime information of MPI processes is needed for SDN controller to switch network configuration •
e.g. process distribution in computing nodes
•
e.g. communication group attributes
•
e.g. source and destination relationship
2. Synchronization mechanism of the “network program” on SDN switches •
How does SDN controller detect which MPI collective routine is performed? 18
Technical issues 1. Control sequence generator •
How does the system generate an actual control sequence corresponding MPI collective operations in order for SDN controller to switch controller modules?
•
The runtime information of MPI processes is needed for SDN controller to switch network configuration •
e.g. process distribution in computing nodes
•
e.g. communication group attributes
•
e.g. source and destination relationship
2. Synchronization mechanism of the “network program” on SDN switches •
How does SDN controller detect which MPI collective routine is performed? 18
Synchronisation mechanism: temporal solution
• • •
• • •
• •
•
Notification packet is sent right before a MPI collective is operated.
•
Notification packet informs upcoming MPI collective communication’s type and the execution order in the event list.
•
We are working on the development of more efficient method.
MPI_Barrier Send_notification SDN_MPI_Bcast hardware-offloaded multicast MPI_Barrier Send_notification SDN_MPI_Allreduce dynamic load balancing, link congestion avoidance MPI_Barrier Send_notification
MPI Application
Notification packet
Notification packet
Notification packet 19
SDN Controller
Experiment environment •
Experiments were conducted on both of physical and virtual cluster. •
Physical cluster: a master node and 24 slave nodes.
•
CPU (x2): Intel Xeon 2.00 GHz, 6 cores
•
Memory: 64GB (DDR3)
•
Virtual cluster: emulated with Muniment.
•
MPI implementation: OpenMPI 1.6.5 20
Measurement result ログファイルサイズ (KByte)
Log size (KByte)
0.08
1200
Time consumption of 処理時間 (sec) control sequence generation (sec)
0.07
1000
0.06
800
0.05
600
0.04 0.03
400
0.02
200
0.01
0
0.00
2
4
8
16
32
64
128 256 512
プロセス数 Number of
2
4
8
16
32
プロセス数
64
128 256 512
Number of
•
The overhead of generating control sequence is evaluated.
•
Control sequence was generated on a single node.
•
Time consumption was less than 0.1 second even 512 processes were used.
•
The result was acceptable for the number of processes that we used. 21
Conclusion •
We aim realizing a network-aware SDN-enhanced MPI framework where MPI collective operations are accelerated.
•
As the first stage towards the goal, we have proposed control sequence generator for the framework. ➡
•
The main functionality of control sequence generator is to provide SDN controller with the information needed to configure the underlying network.
We have succeeded to run distinct SDN-enhanced MPI collective operations sequentially on a prototype framework applying the proposed method.
22
Future plan •
Development of the module that tags MPI packets to identify MPI collective communication flows from others is in progress.
•
For this time, we adopted a combination of notification packet and barrier function to verify the feasibility of the proposed method.
•
Barrier function completely separates MPI collective communication routines, and notification packet informs the timing to reconfigure the underlying network.
23
Thank you
24
Tagging module •
Tagging MPI packets
•
No need to call barrier function
•
No need to send notification packets
Socket API MPI Application
Kernel Network Stack
System Call Tagging Kernel Module 25