Transitively Deadlock-Free Routing Algorithms

0 downloads 0 Views 2MB Size Report
Transitively Deadlock-Free. Routing Algorithms. Jean-Noël Quintin: jean[email protected]. Pierre Vignéras: [email protected]. 2016-03-12 ...
Transitively Deadlock-Free Routing Algorithms ¨ Quintin: [email protected] Jean-Noel ´ Pierre Vigneras: [email protected]

2016-03-12

HPC platforms

Curie 2011 >5 000 nodes Tera1000 2017 >8 000 nodes

I Examples from the Top500 list:

– K Computer 2011: 82 944 nodes. – Sequoia 2013: 98 304 nodes.

2 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI

C

Atos, 2016

OpenSM Solution I Behavior: – – – –

discover the topology, select the routing algorithm, compute new routing table, distribute the routing tables.

I Routing computation time above one minute [1]:

[1] Zahavi et. al. ”Quasi Fat Trees for HPC Clouds and Their Fault-Resilient Closed-Form Routing.” HOTI 2014

3 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI

C

Atos, 2016

BXI Routing Solution I Behavior:

– compute and distribute routing table patches.

Computes patches under 2 seconds for 64 800 nodes[2]:

´ [2] Quintin, Vigneras; Fault-Tolerant Routing for Exascale Supercomputer: The BXI Routing Architecture. HiPINEB’15 4 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI

C

Atos, 2016

BXI Hardware Features I Scales up to 64k destination I Network Interface Controler:

– hardware implementation of Portals4, – offload and os bypass (MPI, SHMEM and PGAS), – 100Gb/s BXI port to the switch.

I Switch:

– low latency, non-blocking 48 ports crossbar, – out-of-band management.

I Wormhole distributed table-based adaptive routing: – 16 virtual channels, – one routing table per port, – up to 48 adaptive routes.

5 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI

C

Atos, 2016

Topology and Routing table Example I Topology: 4-level rearrangeable non-blocking fat-tree

– 64 800 nodes, 11 160 switches, 194 400 inter-switches links – ⇡50 GB of routing tables – In production, faults may happen on a daily basis (link failures, human mistakes, ...)

6 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI

C

Atos, 2016

Outline 1. Introduction

Context BXI Features Reconsidering Faults

2. Transitively Deadlock-free Property 3. Algorithm for Fat-Tree 4. Algorithm for Agnostic Topology

New Deadlock-Free Routing Algorithms

5. Conclusion

7 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI

C

Atos, 2016

Handling Faults I Recomputing all routing tables each time is not an option: – topology structure is considered immutable (2.0.0)

(2.0.1)

(2.1.0)

(2.1.1)

(1.0.0)

(1.0.1)

(1.1.0)

(1.1.1)

(0.0.0)

(0.0.1)

(0.1.0)

(0.1.1)

0

2

8 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI

1

3

4

C

Atos, 2016

5

6

7

Handling Faults I Recomputing all routing tables each time is not an option: – topology structure is considered immutable – a fault is only a missing equipment. (2.0.0)

(2.0.1)

(2.1.0)

(2.1.1)

(1.0.0)

(1.0.1)

(1.1.0)

(1.1.1)

(0.0.0)

(0.0.1)

(0.1.0)

(0.1.1)

0

2

8 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI

1

3

4

C

Atos, 2016

5

6

7

BXI Routing Architecture I Defines two distinct modes of operation: – offline mode:

• computes routing tables from scratch, • archives routing tables on storage, • analyses quality of routing tables,

– online mode: • • • •

9 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI

computes routing table patches. uploads routing tables on switches, archives patches on disk, analyses quality of online routing tables.

C

Atos, 2016

BXI Routing Architecture I Defines two distinct modes of operation: – offline mode:

• computes routing tables from scratch, • archives routing tables on storage, • analyses quality of routing tables,

– online mode: • • • •

computes routing table patches. uploads routing tables on switches, archives patches on disk, analyses quality of online routing tables.

I How to ensure routing tables updates are deadlock-free?

9 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI

C

Atos, 2016

Outline 1. Introduction

Context BXI Features Reconsidering Faults

2. Transitively Deadlock-free Property 3. Algorithm for Fat-Tree 4. Algorithm for Agnostic Topology

New Deadlock-Free Routing Algorithms

5. Conclusion

10 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI

C

Atos, 2016

Transitions Between Two Routing Functions I Transition from Rold and Rnew :

– not atomic, even on a switch, – each switch updates routing tables separately,

Rold Rnew

11 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI

C

Atos, 2016

Transitions Between Two Routing Functions I Transition from Rold and Rnew :

– not atomic, even on a switch, – each switch updates routing tables separately, – ghost dependencies remain.

Rold Rnew

11 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI

C

Atos, 2016

Deadlock-free Transitions I Transition between two routing functions is deadlock-free if both are included within a deadlock-free routing function. I Each routing table entry can be applied in any order: – new routes are included within enclosing routing function, – removing routes is safe.

Rall Rold Rnew

12 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI

C

Atos, 2016

Transitively Deadlock-Free Property I A Transitively Deadlock-Free routing algorithm:

– computes patches to switch between routing functions, – selects only routes under an enclosing routing function, – the enclosing function must be deadlock-free.

13 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI

C

Atos, 2016

Transitively Deadlock-Free Property I A Transitively Deadlock-Free routing algorithm:

– computes patches to switch between routing functions, – selects only routes under an enclosing routing function, – the enclosing function must be deadlock-free.

I Since the initial routing function is deadlock-free and is within the same deadlock-free routing function.

13 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI

C

Atos, 2016

Outline 1. Introduction

Context BXI Features Reconsidering Faults

2. Transitively Deadlock-free Property 3. Algorithm for Fat-Tree 4. Algorithm for Agnostic Topology

New Deadlock-Free Routing Algorithms

5. Conclusion

14 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI

C

Atos, 2016

Notation on Fat-Tree I Twins are all switches at same relative location in other sub-topology. (2.0.0)

(2.0.1)

(2.0.2)

(2.0.3)

(2.1.0)

(2.1.1)

(2.1.2)

(2.1.3)

(1.0.0)

(1.0.1)

(1.0.2)

(1.0.3)

(1.1.0)

(1.1.1)

(1.1.2)

(1.1.3)

(0.0.0)

(0.0.1)

(0.0.2)

(0.0.3)

(0.1.0)

(0.1.1)

(0.1.2)

(0.1.3)

0 3 1 2

4 7 5 6

8 11 9 10

12 15 13 14

16 19 17 18

20 23 21 22

24 27 25 26

28 31 29 30

sub-topology

(0.0)

sub-topology

(0.1)

sub-topology

(0.2)

sub-topology

(0.3)

sub-topology

(1.0)

sub-topology (0)

15 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI

sub-topology

(1.1)

sub-topology

(1.2)

sub-topology (1)

C

Atos, 2016

sub-topology

(1.3)

Notation on Fat-Tree I Twins are all switches at same relative location in other sub-topology. (2.0.1)

(2.1.1)

(1.0.0)

(1.0.1)

(1.0.2)

(1.0.3)

(1.1.0)

(1.1.1)

(1.1.2)

(1.1.3)

(0.0.0)

(0.0.1)

(0.0.2)

(0.0.3)

(0.1.0)

(0.1.1)

(0.1.2)

(0.1.3)

0 3 1 2

4 7 5 6

8 11 9 10

12 15 13 14

16 19 17 18

20 23 21 22

24 27 25 26

28 31 29 30

sub-topology

(0.0)

sub-topology

(0.1)

sub-topology

(0.2)

sub-topology

(0.3)

sub-topology

(1.0)

sub-topology (0)

15 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI

sub-topology

(1.1)

sub-topology

(1.2)

sub-topology (1)

C

Atos, 2016

sub-topology

(1.3)

Deadlock-Free Routing Algorithm for Fat-Tree I Set Restrictions between up-ports.

(2.0.0)

(2.0.1)

(2.1.0)

(2.1.1)

(1.0.0)

(1.0.1)

(1.1.0)

(1.1.1)

(0.0.0)

(0.0.1)

(0.1.0)

(0.1.1)

0

2

16 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI

1

3

4

C

Atos, 2016

5

6

7

Online Routing Algorithm for Fat-Tree

(2.0.0)

(2.0.1)

(2.1.0)

(2.1.1)

(1.0.0)

(1.0.1)

(1.1.0)

(1.1.1)

(0.0.0)

(0.0.1)

(0.1.0)

(0.1.1)

0

2

17 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI

1

3

4

C

Atos, 2016

5

6

7

Online Routing Algorithm for Fat-Tree

(2.0.0)

(2.0.1)

(2.1.0)

(2.1.1)

(1.0.0)

(1.0.1)

(1.1.0)

(1.1.1)

(0.0.0)

(0.0.1)

(0.1.0)

(0.1.1)

0

2

17 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI

1

3

4

C

Atos, 2016

5

6

7

Online Routing Algorithm for Fat-Tree I On switch (2.0.0):

– routing tables to reach [0-3] are patched. (2.0.0)

(2.0.1)

(2.1.0)

(2.1.1)

(1.0.0)

(1.0.1)

(1.1.0)

(1.1.1)

(0.0.0)

(0.0.1)

(0.1.0)

(0.1.1)

0

2

17 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI

1

3

4

C

Atos, 2016

5

6

7

Online Routing Algorithm for Fat-Tree I On switch (1.0.0):

– down-port routing table entries for nodes [4-7] are patched, – up-port routing tables are not patched. (2.0.0)

(2.0.1)

(2.1.0)

(2.1.1)

(1.0.1)

(1.1.0)

(1.1.1)

(0.0.0)

(0.0.1)

(0.1.0)

(0.1.1)

0

2

up-ports (1.0.0)

17 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI

1

down-ports

3

4

C

Atos, 2016

5

6

7

Online Routing Algorithm for Fat-Tree

(2.0.0)

(2.0.1)

(2.1.0)

(2.1.1)

(1.0.0)

(1.0.1)

(1.1.0)

(1.1.1)

(0.0.0)

(0.0.1)

(0.1.0)

(0.1.1)

0

2

17 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI

1

3

4

C

Atos, 2016

5

6

7

Online Routing Algorithm for Fat-Tree I (1.0.0) and its twin(s) (1.1.0), are bypassed to reach [2-3].

I On (0.0.0) (0.1.0) and (0.1.1), down-port routing tables are patched to reach [2-3]. (2.0.0)

(1.0.0)

by p

(2.1.0)

(1.0.1)

(1.1.0) bypass

bypass

as s

(2.0.1)

(0.0.0)

(0.0.1)

0

2

17 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI

1

(0.1.0)

3

4

C

Atos, 2016

5

(2.1.1)

(1.1.1)

by pa

ss

(0.1.1) 6

7

Online Routing Algorithm for Fat-Tree

(2.0.0)

(2.0.1)

(2.1.0)

(2.1.1)

(1.0.0)

(1.0.1)

(1.1.0)

(1.1.1)

(0.0.0)

(0.0.1)

(0.1.0)

(0.1.1)

0

2

17 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI

1

3

4

C

Atos, 2016

5

6

7

Online Routing Algorithm for Fat-Tree I (1.1.1) and its twin(s) (1.0.1), are bypassed to reach [4-5]:

– nodes [4,5] and [2-3] cannot exchange messages, – for the routing function, the topology is no more connected. (2.0.0)

(2.0.1)

(2.1.0)

(2.1.1)

(1.0.0)

(1.0.1)

(1.1.0)

(1.1.1)

(0.0.0)

(0.0.1)

(0.1.0)

(0.1.1)

0

2

17 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI

1

3

4

C

Atos, 2016

5

6

7

Transitively Deadlock-Free Algorithm for Fat-Tree I Enclosing routing algorithm Rall provides:

– down-port(s) for each destination within the sub-topology, – all up-ports for each destination outside the sub-topology. (2.0.0)

(2.0.1)

(2.1.0)

(2.1.1)

(1.0.0)

(1.0.1)

(1.1.0)

(1.1.1)

(0.0.0)

(0.0.1)

(0.1.0)

(0.1.1)

0

2

18 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI

1

3

4

C

Atos, 2016

5

6

7

Outline 1. Introduction

Context BXI Features Reconsidering Faults

2. Transitively Deadlock-free Property 3. Algorithm for Fat-Tree 4. Algorithm for Agnostic Topology

New Deadlock-Free Routing Algorithms

5. Conclusion

19 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI

C

Atos, 2016

Transitively Deadlock-Free Agnostic Routing Algorithm I Enclosing Routing function:

– computes all routes following the up-down rule: • Messages cannot go upward after going downward,

– returns for any port a set with all ports leading to the destination.

N1

20 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI

N2

N3

N4

N5

C

Atos, 2016

N6

Methodology for New Online Routing Algorithms I Creates an enclosing routing algorithm. I Computes restrictions to remove deadlocks. I All unrestricted routes mustn’t introduce deadlock. I Computes offline routing tables. I Provides restrictions to online agnostic routing algorithm.

21 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI

C

Atos, 2016

Conclusion I New architecture based on two modes: Offline/Online. I Two new online algorithms: – – – – –

handle link faults, up to 25 percent of faulty links, handle link recovery, scalable, formally described, transitively deadlock-free,

I Methodology to create new transitively deadlock-free algorithms. I Future steps: – study routing quality, – adapt on events the routing tables: • such as computed statistics. . . 22 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI

C

Atos, 2016

Dynamic Routing and Deadlock/Livelock I Network Topology

– Minimal routing exception: dashed linked is only usable to reach S4 .

0 C3,2

C2,1 S1

C1,2

C3,2 S2

C2,3 0 C2,3

C4,3 S3

C3,4

I Dependency channel graph. 0 C3,2

C2,1

C3,2

C4,3

C1,2

C2,3

C3,4

0 C2,3

23 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI

C

Atos, 2016

S4

Dynamic Routing and Deadlock/Livelock I Network Topology

– Minimal routing exception: dashed linked is only usable to reach S4 .

Message S2 ! S4

0 C3,2

C2,1 S1

C1,2

C3,2 S2

C2,3 0 C2,3

C4,3 S3

C3,4

I Waiting buffer graph. 0 C3,2

C2,1

C3,2

C4,3

C1,2

C2,3

C3,4

0 C2,3

23 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI

C

Atos, 2016

S4

Dynamic Routing and Deadlock/Livelock I Network Topology

– Minimal routing exception: dashed linked is only usable to reach S4 .

Message S2 ! S4 Message S1 ! S4

0 C3,2

C2,1 S1

C1,2

C3,2 S2

C2,3 0 C2,3

C4,3 S3

C3,4

I Waiting buffer graph is dynamic. 0 C3,2

C2,1

C3,2

C4,3

C1,2

C2,3

C3,4

0 C2,3

23 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI

C

Atos, 2016

S4

Dynamic Routing and Deadlock/Livelock I Network Topology

– Minimal routing exception: dashed linked is only usable to reach S4 .

Message S2 ! S4 Message S1 ! S4

0 C3,2

C2,1 S1

C1,2

C3,2 S2

C2,3 0 C2,3

C4,3 S3

C3,4

I Waiting buffer graph is dynamic. 0 C3,2

C2,1

C3,2

C4,3

C1,2

C2,3

C3,4

0 C2,3

23 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI

C

Atos, 2016

S4

Dynamic Routing and Deadlock/Livelock I Network Topology

– Minimal routing exception: dashed linked is only usable to reach S4 .

Message S2 ! S4 Message S1 ! S4

0 C3,2

C2,1 S1

C1,2

C3,2 S2

C2,3 0 C2,3

C4,3 S3

C3,4

I Waiting buffer graph is dynamic. 0 C3,2

C2,1

C3,2

C4,3

C1,2

C2,3

C3,4

0 C2,3

23 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI

C

Atos, 2016

S4

Dynamic Routing and Deadlock/Livelock I Network Topology

– Minimal routing exception: dashed linked is only usable to reach S4 .

Message S2 ! S4 Message S1 ! S4

0 C3,2

C2,1 S1

C1,2

C3,2 S2

C2,3 0 C2,3

C4,3 S3

C3,4

I Waiting buffer graph is dynamic. 0 C3,2

C2,1

C3,2

C4,3

C1,2

C2,3

C3,4

0 C2,3

23 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI

C

Atos, 2016

S4

Dynamic Routing and Deadlock/Livelock I Network Topology

– Minimal routing exception: dashed linked is only usable to reach S4 .

Message S2 ! S4 Message S1 ! S4

0 C3,2

C2,1 S1

C1,2

C3,2 S2

C2,3 0 C2,3

C4,3 S3

C3,4

I Waiting buffer graph is dynamic. 0 C3,2

C2,1

C3,2

C4,3

C1,2

C2,3

C3,4

0 C2,3

23 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI

C

Atos, 2016

S4

Transitively Deadlock-Free I Enclosing Routing function:

– computes all routes following the up-down rule: • Messages cannot go upward after going downward,

– returns for any port a set with all ports leading to the destination.

I Channels can be ordered. N1

24 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI

N2

N3

N4

N5

C

Atos, 2016

N6

Online Routing Algorithm for Agnostic Topology I Up*/Down* Algorithm on the network:

N1

25 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI

N2

N3

N4

N5

C

Atos, 2016

N6

Online Routing Algorithm for Agnostic Topology I Up*/Down* Algorithm on the network: – N1 is the selected root, – directions are added to links.

N1

25 / 25 | 2016-03-12 ¨ Quintin Jean-Noel France | BDS | R&D/BXI

N2

N3

N4

N5

C

Atos, 2016

N6

Suggest Documents