A Fault Tolerant Architecture with Flexible Dynamic ... - Recercat

3 downloads 0 Views 5MB Size Report
between processes. • It takes checkpoints and event logs and sends to storage. • It masks the faults using a table of the processes. • It detect communication ...
A Fault Tolerant Architecture with Flexible Dynamic Redundancy By Guna Santos

Key Points y The increasing number of applications demanding high performance and availability (performability) y Fault tolerance plays a major role in order to keep high availability y A fault tolerant solution must generates minimal overhead in its activities, in order to not affect the application performance y After a fault ocurrence, some fault tolerant system generate system degradation, affecting the performance.

caos

2

Challenges y To provide a fault tolerant solution able to mitigate or to avoid the system degradation y Restoring the system configuration y Avoiding changes in number of active nodes y Allowing perform maintenance tasks, preventing faults

caos

3

Contents y Goals y Fault Tolerance and the RADIC Architecture y Recovery Side-Effect y Protecting the System y Implementation and Experimental Results y Conclusions y Future Work and Open Lines

caos

4

Goals y To incorporate the ability to control the system degradation to a fault tolerant architecture with following characteristics: y Flexible, Decentralized, Transparent, Scalable y To extend the functionality of this architecture implementation

caos

5

A Fault Tolerance Approach y Fault tolerant architecture for message-passing systems y It based on Rollback-recovery protocol y It creates a layer isolating the application from the Master/Worke cluster r failures

Different Paradigms   

M M

W W

W W

W W

caos

PP

PP

W W  

PP

PP

Parallel Application P P P

P

P

Message-passing Standard P

P

P

P P P Parallel Computer Structure (Fault-probable)

SPMD

6

A Fault Tolerance Approach y Fault tolerant architecture for message-passing systems y It based on Rollback-recovery protocol y It creates a layer isolating the application from the cluster failures

caos

Parallel Application Message-passing Standard Fault Masking Functions Fault Tolerance Functions Parallel Computer Structure (Fault-probable)

7

• It manages the communications between processes

The RADIC Architecture

• It takes checkpoints and event logs and sends to storage • It masks the faults using a table of the processes.

y RADIC is a fault tolerant • It detect communication failures architecture for messagepassing systems y Two kind of process perform the fault tolerance activities • It performs fault detection tasks using y Protectors a heartbeat/watchdog scheme • It stores the redundancy data y Observers

Node

Application Process

Observer

Protect or

(checkpoints and logs)

• It starts the recovery process

caos

8

The RADIC Architecture Node

y RADIC in Practice Application Process

Observer

Protect or

caos

9

The RADIC Architecture y RADIC functioning y Protectors establish the heartbeat/ watchdog y Observers connect with Protectors y It sends checkpoints and event logs y It manages the message delivery using the radictable

N0

N1

N2

P0

P1

P2

O0

O1

O2

T0

T1

T2

Communications

caos

10

The RADIC Architecture

N6

y The recovery process y N3 fails

N7

P6

P7

P8

O6

O7

O8

T6

T7

T8

N3

N4

N5

P3

P4

P5

O3

O4

O5

T3

T4

T5

N0

caos

N8

N1

N2

P0

P1

P2

O0

O1

O2

T0

T1

T2

11

The RADIC Architecture

N6

y The recovery process y N3 fails y T8 detects the faults and waits the T4 and O4 connection y T8 launches P3 using it checkpoint y O3 start the recovery process of P3

caos

N7

P6

P7

O6

O7

T6

T7

N8

P8 O8

P8 P3

O8

O3

T8

N4

N5

P4

P5

O4

O5

T4

T5

N0

N1

N2

P0

P1

P2

O0

O1

O2

T0

T1

T2

12

Recovery Side-Effect

N6

y The sytem configuration was changed y Processes P3 and P8 slowdown their activities y Protector T7 now protects two processes y The memory usage in N8 rises

caos

N7

P6

P7

O6

O7

T6

T7

N8

P8 O8 P3 O3

T8

N4

N5

P4

P5

O4

O5

T4

T5

N0

N1

N2

P0

P1

P2

O0

O1

O2

T0

T1

T2

13

Recovery Side-Effect y In order to quantify these side-effects facing different scenarios, we performed some experiments: y We applied three variations of a matrix product program: y Dynamic M/W – Loosely coupled y SPMD – High coupling in all processes y Static M/W – High dependency

caos

14

M

Recovery Side-Effect

W

W

W



Dynamic M/W – different number of nodes

caos

15

 

Recovery Side-Effect

P

P

P

P

P

P

P

P

P

SPMD – one fault in different moments

caos

16

Recovery Side-Effect Static M/W

caos

17

Static M/W

Without faults With Faults

500

Execution Time (s)

Recovery Side-Effect

600

1000x1000 Matrix product with 8 nodes‐ Static distribution  ‐ 1 fault at 25%  100 loops  ‐ ckpt each 45s

400

300

200

100

0

P0

P1

P2

P3

P4

P5

P6

P7

Processes

caos

18

Protecting the System y We incorporate a new protection level in RADIC using a dynamic redundancy functionality, allowing to mitigate or to avoid the recovery side-effects y Restoring the system configuration y Avoiding system configuration changes y Allowing to prevent faults y y y

It incorporates a transparent management of spare nodes It does not mantain any central information about the spare nodes The spare nodes use does not affect the scalability

caos

19

Protecting the System

N6

y Restoring the configuration

N7

P6

P7

O6

O7

T6

T7

N8

P8 O8 P3 O3

N9

T8

N4

N5

P3

P4

P5

O3

O4

O5

T9

T4

T5

N0

caos

N1

N2

P0

P1

P2

O0

O1

O2

T0

T1

T2

20

Protecting the System

N6

y Avoiding changes in number of active nodes

N7

P6

P7

P8

O6

O7

O8

T6

T7

T8

N9

N3

T9

N4

N5

P3

P4

P5

O3

O4

O5

T3

T4

T5

N0

caos

N8

N1

N2

P0

P1

P2

O0

O1

O2

T0

T1

T2

21

Protecting the System

N6

y Recovering with spare node

N7

N8

P6

P7

P8

O6

O7

O8

T6

T7

T8

N4

N9

N5

P3

P4

P5

O3

O4

O5

T9

T4

T5

N0

caos

N1

N2

P0

P1

P2

O0

O1

O2

T0

T1

T2

22

Protecting the System

N6

y Restoring the configuration

N7

N8

P6

P7

P8

O6

O7

O8

T6

T7

T8

N4

N9

N5

P3

P4

P5

O3

O4

O5

T9

T4

T5

N10

T10

N0

caos

N1

N2

P0

P1

P2

O0

O1

O2

T0

T1

T2

23

Protecting the System

N6

y Preventing Faults

N7

P6

P7

P8

O6

O7

O8

T6

T7

T8

N9

N3

T9

N4

N5

P3

P4

P5

O3

O4

O5

T3

T4

T5

N0

caos

N8

N1

N2

P0

P1

P2

O0

O1

O2

T0

T1

T2

24

Protecting the System

N6

y Preventing Faults

N7

N8

P6

P7

P8

O6

O7

O8

T6

T7

T8

N4

N9

N5

P3

P4

P5

O3

O4

O5

T9

T4

T5

N0

caos

N1

N2

P0

P1

P2

O0

O1

O2

T0

T1

T2

25

Implementation y We adapted the RADIC prototype (RADICMPI) in order to incorporate the Flexible Dynamic Redundancy y y y

y

Creation of management functions Definition of a new protocol to communicate with spares Increment of communication between observers and local protector Changes in the RADIC fault masking procedure including search in the spare nodes

caos

26

Implementation y

We also implemented a set of MPI non-blocking functions, allowing to run more kind of application y y

y y

We take care about the fault tolerance issues We had to change the message management kernel of RADICMPI to deal with asynchronous communications Other issues are solved too: We changed the original message log approach of RADICMPI to an event log approach in order to assure the determinism of the recovery process.

caos

27

Experiments Methodology y We conduced two kind of experiments: y Validation y y

y

Using the RADICMPI Debug Log

Spare adding task Recovery task using spare

Running a Ping-Pong program

Evaluation y y y

According with the fault moment According with the number of nodes Throughput behavior in continuous running applications

caos

• Matrix Product – Static • Matrix Product – Dynamic • Matrix Product – SPMD • N-Body simulation

28

Experiments Methodology y Validation metodology

caos

29

Experiment Design y Evaluating according the fault moment y The experiments were conduced in a twelve node cluster y We executed a product of two 1500x1500 matrix using the SPMD paradigm over 9 nodes and a product of two 1000x1000 matrixes using a static distribution over 11 nodes y We injected one fault at 25%, 50% and 75% y We measured the execution times in three situations y y y

Fault-free Without spare With spare

caos

30

Experimental Results y Evaluating according with the fault moment 800,0

1500x1500 Matrix product using  9 nodes  ‐ SPMD  cannon  algorithm‐ 1 fault 160 loops  ‐ckpt each 60s

Without failures

72,61%

Failures without spare 700,0

49,37%

Failures with spare

600,0

27,05% 13,93%

Time (s)

500,0

434,57

14,82%

14,19% 434,57

434,57

400,0

300,0

200,0

100,0

0,0

25%

50%

caos

Fault moment

75%

31

Experimental Results y Evaluating according with the fault moment

caos

32

Experiment Design y Evaluating according the number of nodes y The experiments were conduced in a twelve node cluster y We executed a product of two 1000x1000 matrix using a dynamic distribution and a product of two 1000x1000 matrixes using a static distribution y We injected one fault at 25% y We ran the programs over three cluster sizes y y y

4 nodes + spare 8 nodes + spare 11 nodes + spare

y We measured the execution times in three situations y Fault-free y Without spare y With spare

caos

33

Experimental Results y Evaluating according with the number of nodes

caos

34

Experimental Results y Evaluating according with the number of nodes

caos

35

Experimental Results y We implemented a N-Body simulation in order to study the behavior of our solution over continuous running applications

P

P

P

caos

P

36

Experimental Results y We implemented a N-Body simulation in order to study the behavior of our solution over continuous running applications

P

P

P

caos

P

37

Experiment Design y Throughput of 24x7 applications y We executed this application simulating 2000 particles in a ten node pipeline y We measured the throughput (in simulation steps per minute) in four scenarios y y

y

Fault-free Injecting three faults without spare: y Processes recovering in different nodes y Process recovering in same node without spare Injecting three faults using two initial spares and re-insert a “repaired” one after the first fault.

caos

38

Experimental Results y Throughput of 24x7 applications with three faults 35

A spare node is inserted N‐Body Simulation of 2000 particles in a 10 nodes pipeline  without affects the with 3 faults  performance

Simulation Steps per minute

30

25

20

Recovery overhead caused by: -Moment of fault (large log to process) -No optimized code

15

10

Fault Free Different Nodes

5

Same Nodes Using Spare

0

caos

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53

Elapsed time (in minutes)

39

Conclusions y We implemented a dynamic redundancy functionality that avoids or mitigates the recovery side-effects y This functionality is flexible because y It allows system configuration restablishment by dynamic insertion of spare nodes - RESTORING y It allows a transparent management of the spare nodes use AVOIDING y It incorporates a maintenance feature that also allows to prevent failures by replacing fault-probable nodes – PREVENTING y Configurable, allowing from 0 to N number of spares y We extend the RADICMPI functionality, implementing a set of MPI non-blocking functions

caos

40

Conclusions y Our results show the benefits of the dynamic redundancy solution in different scenarios. y The results also show a strong dependency between the recovery side-effects and the application characteristics and how we can adapt to each one.

caos

41

Future Work y To study the spare nodes allocating facing factors like degradation level acceptable or memory limits of a node y To integrate a fault prediction mechanism in the maintenance feature y To continue expanding the RADICMPI functionality

caos

42

Open Lines y To investigate how adapt and use RADIC II with new HPC trends like the clusters of multicore computers. y To achieve a RADIC II analytical model, allowing to determine better parameter values. y To develop a RADIC II simulator allowing to assess its behavior in large clusters y To incorporate other features towards an autonomic fault tolerant system.

caos

43

Suggest Documents