Transitively Deadlock-Free Routing Algorithms ¨ Quintin:
[email protected] Jean-Noel ´ Pierre Vigneras:
[email protected]
2016-03-12
HPC platforms
Curie 2011 >5 000 nodes Tera1000 2017 >8 000 nodes
I Examples from the Top500 list:
– K Computer 2011: 82 944 nodes. – Sequoia 2013: 98 304 nodes.
2 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI
C
Atos, 2016
OpenSM Solution I Behavior: – – – –
discover the topology, select the routing algorithm, compute new routing table, distribute the routing tables.
I Routing computation time above one minute [1]:
[1] Zahavi et. al. ”Quasi Fat Trees for HPC Clouds and Their Fault-Resilient Closed-Form Routing.” HOTI 2014
3 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI
C
Atos, 2016
BXI Routing Solution I Behavior:
– compute and distribute routing table patches.
Computes patches under 2 seconds for 64 800 nodes[2]:
´ [2] Quintin, Vigneras; Fault-Tolerant Routing for Exascale Supercomputer: The BXI Routing Architecture. HiPINEB’15 4 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI
C
Atos, 2016
BXI Hardware Features I Scales up to 64k destination I Network Interface Controler:
– hardware implementation of Portals4, – offload and os bypass (MPI, SHMEM and PGAS), – 100Gb/s BXI port to the switch.
I Switch:
– low latency, non-blocking 48 ports crossbar, – out-of-band management.
I Wormhole distributed table-based adaptive routing: – 16 virtual channels, – one routing table per port, – up to 48 adaptive routes.
5 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI
C
Atos, 2016
Topology and Routing table Example I Topology: 4-level rearrangeable non-blocking fat-tree
– 64 800 nodes, 11 160 switches, 194 400 inter-switches links – ⇡50 GB of routing tables – In production, faults may happen on a daily basis (link failures, human mistakes, ...)
6 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI
C
Atos, 2016
Outline 1. Introduction
Context BXI Features Reconsidering Faults
2. Transitively Deadlock-free Property 3. Algorithm for Fat-Tree 4. Algorithm for Agnostic Topology
New Deadlock-Free Routing Algorithms
5. Conclusion
7 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI
C
Atos, 2016
Handling Faults I Recomputing all routing tables each time is not an option: – topology structure is considered immutable (2.0.0)
(2.0.1)
(2.1.0)
(2.1.1)
(1.0.0)
(1.0.1)
(1.1.0)
(1.1.1)
(0.0.0)
(0.0.1)
(0.1.0)
(0.1.1)
0
2
8 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI
1
3
4
C
Atos, 2016
5
6
7
Handling Faults I Recomputing all routing tables each time is not an option: – topology structure is considered immutable – a fault is only a missing equipment. (2.0.0)
(2.0.1)
(2.1.0)
(2.1.1)
(1.0.0)
(1.0.1)
(1.1.0)
(1.1.1)
(0.0.0)
(0.0.1)
(0.1.0)
(0.1.1)
0
2
8 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI
1
3
4
C
Atos, 2016
5
6
7
BXI Routing Architecture I Defines two distinct modes of operation: – offline mode:
• computes routing tables from scratch, • archives routing tables on storage, • analyses quality of routing tables,
– online mode: • • • •
9 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI
computes routing table patches. uploads routing tables on switches, archives patches on disk, analyses quality of online routing tables.
C
Atos, 2016
BXI Routing Architecture I Defines two distinct modes of operation: – offline mode:
• computes routing tables from scratch, • archives routing tables on storage, • analyses quality of routing tables,
– online mode: • • • •
computes routing table patches. uploads routing tables on switches, archives patches on disk, analyses quality of online routing tables.
I How to ensure routing tables updates are deadlock-free?
9 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI
C
Atos, 2016
Outline 1. Introduction
Context BXI Features Reconsidering Faults
2. Transitively Deadlock-free Property 3. Algorithm for Fat-Tree 4. Algorithm for Agnostic Topology
New Deadlock-Free Routing Algorithms
5. Conclusion
10 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI
C
Atos, 2016
Transitions Between Two Routing Functions I Transition from Rold and Rnew :
– not atomic, even on a switch, – each switch updates routing tables separately,
Rold Rnew
11 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI
C
Atos, 2016
Transitions Between Two Routing Functions I Transition from Rold and Rnew :
– not atomic, even on a switch, – each switch updates routing tables separately, – ghost dependencies remain.
Rold Rnew
11 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI
C
Atos, 2016
Deadlock-free Transitions I Transition between two routing functions is deadlock-free if both are included within a deadlock-free routing function. I Each routing table entry can be applied in any order: – new routes are included within enclosing routing function, – removing routes is safe.
Rall Rold Rnew
12 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI
C
Atos, 2016
Transitively Deadlock-Free Property I A Transitively Deadlock-Free routing algorithm:
– computes patches to switch between routing functions, – selects only routes under an enclosing routing function, – the enclosing function must be deadlock-free.
13 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI
C
Atos, 2016
Transitively Deadlock-Free Property I A Transitively Deadlock-Free routing algorithm:
– computes patches to switch between routing functions, – selects only routes under an enclosing routing function, – the enclosing function must be deadlock-free.
I Since the initial routing function is deadlock-free and is within the same deadlock-free routing function.
13 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI
C
Atos, 2016
Outline 1. Introduction
Context BXI Features Reconsidering Faults
2. Transitively Deadlock-free Property 3. Algorithm for Fat-Tree 4. Algorithm for Agnostic Topology
New Deadlock-Free Routing Algorithms
5. Conclusion
14 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI
C
Atos, 2016
Notation on Fat-Tree I Twins are all switches at same relative location in other sub-topology. (2.0.0)
(2.0.1)
(2.0.2)
(2.0.3)
(2.1.0)
(2.1.1)
(2.1.2)
(2.1.3)
(1.0.0)
(1.0.1)
(1.0.2)
(1.0.3)
(1.1.0)
(1.1.1)
(1.1.2)
(1.1.3)
(0.0.0)
(0.0.1)
(0.0.2)
(0.0.3)
(0.1.0)
(0.1.1)
(0.1.2)
(0.1.3)
0 3 1 2
4 7 5 6
8 11 9 10
12 15 13 14
16 19 17 18
20 23 21 22
24 27 25 26
28 31 29 30
sub-topology
(0.0)
sub-topology
(0.1)
sub-topology
(0.2)
sub-topology
(0.3)
sub-topology
(1.0)
sub-topology (0)
15 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI
sub-topology
(1.1)
sub-topology
(1.2)
sub-topology (1)
C
Atos, 2016
sub-topology
(1.3)
Notation on Fat-Tree I Twins are all switches at same relative location in other sub-topology. (2.0.1)
(2.1.1)
(1.0.0)
(1.0.1)
(1.0.2)
(1.0.3)
(1.1.0)
(1.1.1)
(1.1.2)
(1.1.3)
(0.0.0)
(0.0.1)
(0.0.2)
(0.0.3)
(0.1.0)
(0.1.1)
(0.1.2)
(0.1.3)
0 3 1 2
4 7 5 6
8 11 9 10
12 15 13 14
16 19 17 18
20 23 21 22
24 27 25 26
28 31 29 30
sub-topology
(0.0)
sub-topology
(0.1)
sub-topology
(0.2)
sub-topology
(0.3)
sub-topology
(1.0)
sub-topology (0)
15 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI
sub-topology
(1.1)
sub-topology
(1.2)
sub-topology (1)
C
Atos, 2016
sub-topology
(1.3)
Deadlock-Free Routing Algorithm for Fat-Tree I Set Restrictions between up-ports.
(2.0.0)
(2.0.1)
(2.1.0)
(2.1.1)
(1.0.0)
(1.0.1)
(1.1.0)
(1.1.1)
(0.0.0)
(0.0.1)
(0.1.0)
(0.1.1)
0
2
16 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI
1
3
4
C
Atos, 2016
5
6
7
Online Routing Algorithm for Fat-Tree
(2.0.0)
(2.0.1)
(2.1.0)
(2.1.1)
(1.0.0)
(1.0.1)
(1.1.0)
(1.1.1)
(0.0.0)
(0.0.1)
(0.1.0)
(0.1.1)
0
2
17 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI
1
3
4
C
Atos, 2016
5
6
7
Online Routing Algorithm for Fat-Tree
(2.0.0)
(2.0.1)
(2.1.0)
(2.1.1)
(1.0.0)
(1.0.1)
(1.1.0)
(1.1.1)
(0.0.0)
(0.0.1)
(0.1.0)
(0.1.1)
0
2
17 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI
1
3
4
C
Atos, 2016
5
6
7
Online Routing Algorithm for Fat-Tree I On switch (2.0.0):
– routing tables to reach [0-3] are patched. (2.0.0)
(2.0.1)
(2.1.0)
(2.1.1)
(1.0.0)
(1.0.1)
(1.1.0)
(1.1.1)
(0.0.0)
(0.0.1)
(0.1.0)
(0.1.1)
0
2
17 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI
1
3
4
C
Atos, 2016
5
6
7
Online Routing Algorithm for Fat-Tree I On switch (1.0.0):
– down-port routing table entries for nodes [4-7] are patched, – up-port routing tables are not patched. (2.0.0)
(2.0.1)
(2.1.0)
(2.1.1)
(1.0.1)
(1.1.0)
(1.1.1)
(0.0.0)
(0.0.1)
(0.1.0)
(0.1.1)
0
2
up-ports (1.0.0)
17 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI
1
down-ports
3
4
C
Atos, 2016
5
6
7
Online Routing Algorithm for Fat-Tree
(2.0.0)
(2.0.1)
(2.1.0)
(2.1.1)
(1.0.0)
(1.0.1)
(1.1.0)
(1.1.1)
(0.0.0)
(0.0.1)
(0.1.0)
(0.1.1)
0
2
17 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI
1
3
4
C
Atos, 2016
5
6
7
Online Routing Algorithm for Fat-Tree I (1.0.0) and its twin(s) (1.1.0), are bypassed to reach [2-3].
I On (0.0.0) (0.1.0) and (0.1.1), down-port routing tables are patched to reach [2-3]. (2.0.0)
(1.0.0)
by p
(2.1.0)
(1.0.1)
(1.1.0) bypass
bypass
as s
(2.0.1)
(0.0.0)
(0.0.1)
0
2
17 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI
1
(0.1.0)
3
4
C
Atos, 2016
5
(2.1.1)
(1.1.1)
by pa
ss
(0.1.1) 6
7
Online Routing Algorithm for Fat-Tree
(2.0.0)
(2.0.1)
(2.1.0)
(2.1.1)
(1.0.0)
(1.0.1)
(1.1.0)
(1.1.1)
(0.0.0)
(0.0.1)
(0.1.0)
(0.1.1)
0
2
17 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI
1
3
4
C
Atos, 2016
5
6
7
Online Routing Algorithm for Fat-Tree I (1.1.1) and its twin(s) (1.0.1), are bypassed to reach [4-5]:
– nodes [4,5] and [2-3] cannot exchange messages, – for the routing function, the topology is no more connected. (2.0.0)
(2.0.1)
(2.1.0)
(2.1.1)
(1.0.0)
(1.0.1)
(1.1.0)
(1.1.1)
(0.0.0)
(0.0.1)
(0.1.0)
(0.1.1)
0
2
17 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI
1
3
4
C
Atos, 2016
5
6
7
Transitively Deadlock-Free Algorithm for Fat-Tree I Enclosing routing algorithm Rall provides:
– down-port(s) for each destination within the sub-topology, – all up-ports for each destination outside the sub-topology. (2.0.0)
(2.0.1)
(2.1.0)
(2.1.1)
(1.0.0)
(1.0.1)
(1.1.0)
(1.1.1)
(0.0.0)
(0.0.1)
(0.1.0)
(0.1.1)
0
2
18 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI
1
3
4
C
Atos, 2016
5
6
7
Outline 1. Introduction
Context BXI Features Reconsidering Faults
2. Transitively Deadlock-free Property 3. Algorithm for Fat-Tree 4. Algorithm for Agnostic Topology
New Deadlock-Free Routing Algorithms
5. Conclusion
19 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI
C
Atos, 2016
Transitively Deadlock-Free Agnostic Routing Algorithm I Enclosing Routing function:
– computes all routes following the up-down rule: • Messages cannot go upward after going downward,
– returns for any port a set with all ports leading to the destination.
N1
20 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI
N2
N3
N4
N5
C
Atos, 2016
N6
Methodology for New Online Routing Algorithms I Creates an enclosing routing algorithm. I Computes restrictions to remove deadlocks. I All unrestricted routes mustn’t introduce deadlock. I Computes offline routing tables. I Provides restrictions to online agnostic routing algorithm.
21 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI
C
Atos, 2016
Conclusion I New architecture based on two modes: Offline/Online. I Two new online algorithms: – – – – –
handle link faults, up to 25 percent of faulty links, handle link recovery, scalable, formally described, transitively deadlock-free,
I Methodology to create new transitively deadlock-free algorithms. I Future steps: – study routing quality, – adapt on events the routing tables: • such as computed statistics. . . 22 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI
C
Atos, 2016
Dynamic Routing and Deadlock/Livelock I Network Topology
– Minimal routing exception: dashed linked is only usable to reach S4 .
0 C3,2
C2,1 S1
C1,2
C3,2 S2
C2,3 0 C2,3
C4,3 S3
C3,4
I Dependency channel graph. 0 C3,2
C2,1
C3,2
C4,3
C1,2
C2,3
C3,4
0 C2,3
23 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI
C
Atos, 2016
S4
Dynamic Routing and Deadlock/Livelock I Network Topology
– Minimal routing exception: dashed linked is only usable to reach S4 .
Message S2 ! S4
0 C3,2
C2,1 S1
C1,2
C3,2 S2
C2,3 0 C2,3
C4,3 S3
C3,4
I Waiting buffer graph. 0 C3,2
C2,1
C3,2
C4,3
C1,2
C2,3
C3,4
0 C2,3
23 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI
C
Atos, 2016
S4
Dynamic Routing and Deadlock/Livelock I Network Topology
– Minimal routing exception: dashed linked is only usable to reach S4 .
Message S2 ! S4 Message S1 ! S4
0 C3,2
C2,1 S1
C1,2
C3,2 S2
C2,3 0 C2,3
C4,3 S3
C3,4
I Waiting buffer graph is dynamic. 0 C3,2
C2,1
C3,2
C4,3
C1,2
C2,3
C3,4
0 C2,3
23 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI
C
Atos, 2016
S4
Dynamic Routing and Deadlock/Livelock I Network Topology
– Minimal routing exception: dashed linked is only usable to reach S4 .
Message S2 ! S4 Message S1 ! S4
0 C3,2
C2,1 S1
C1,2
C3,2 S2
C2,3 0 C2,3
C4,3 S3
C3,4
I Waiting buffer graph is dynamic. 0 C3,2
C2,1
C3,2
C4,3
C1,2
C2,3
C3,4
0 C2,3
23 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI
C
Atos, 2016
S4
Dynamic Routing and Deadlock/Livelock I Network Topology
– Minimal routing exception: dashed linked is only usable to reach S4 .
Message S2 ! S4 Message S1 ! S4
0 C3,2
C2,1 S1
C1,2
C3,2 S2
C2,3 0 C2,3
C4,3 S3
C3,4
I Waiting buffer graph is dynamic. 0 C3,2
C2,1
C3,2
C4,3
C1,2
C2,3
C3,4
0 C2,3
23 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI
C
Atos, 2016
S4
Dynamic Routing and Deadlock/Livelock I Network Topology
– Minimal routing exception: dashed linked is only usable to reach S4 .
Message S2 ! S4 Message S1 ! S4
0 C3,2
C2,1 S1
C1,2
C3,2 S2
C2,3 0 C2,3
C4,3 S3
C3,4
I Waiting buffer graph is dynamic. 0 C3,2
C2,1
C3,2
C4,3
C1,2
C2,3
C3,4
0 C2,3
23 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI
C
Atos, 2016
S4
Dynamic Routing and Deadlock/Livelock I Network Topology
– Minimal routing exception: dashed linked is only usable to reach S4 .
Message S2 ! S4 Message S1 ! S4
0 C3,2
C2,1 S1
C1,2
C3,2 S2
C2,3 0 C2,3
C4,3 S3
C3,4
I Waiting buffer graph is dynamic. 0 C3,2
C2,1
C3,2
C4,3
C1,2
C2,3
C3,4
0 C2,3
23 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI
C
Atos, 2016
S4
Transitively Deadlock-Free I Enclosing Routing function:
– computes all routes following the up-down rule: • Messages cannot go upward after going downward,
– returns for any port a set with all ports leading to the destination.
I Channels can be ordered. N1
24 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI
N2
N3
N4
N5
C
Atos, 2016
N6
Online Routing Algorithm for Agnostic Topology I Up*/Down* Algorithm on the network:
N1
25 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI
N2
N3
N4
N5
C
Atos, 2016
N6
Online Routing Algorithm for Agnostic Topology I Up*/Down* Algorithm on the network: – N1 is the selected root, – directions are added to links.
N1
25 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI
N2
N3
N4
N5
C
Atos, 2016
N6