between processes. ⢠It takes checkpoints and event logs and sends to storage. ⢠It masks the faults using a table of the processes. ⢠It detect communication ...
A Fault Tolerant Architecture with Flexible Dynamic Redundancy By Guna Santos
Key Points y The increasing number of applications demanding high performance and availability (performability) y Fault tolerance plays a major role in order to keep high availability y A fault tolerant solution must generates minimal overhead in its activities, in order to not affect the application performance y After a fault ocurrence, some fault tolerant system generate system degradation, affecting the performance.
caos
2
Challenges y To provide a fault tolerant solution able to mitigate or to avoid the system degradation y Restoring the system configuration y Avoiding changes in number of active nodes y Allowing perform maintenance tasks, preventing faults
caos
3
Contents y Goals y Fault Tolerance and the RADIC Architecture y Recovery Side-Effect y Protecting the System y Implementation and Experimental Results y Conclusions y Future Work and Open Lines
caos
4
Goals y To incorporate the ability to control the system degradation to a fault tolerant architecture with following characteristics: y Flexible, Decentralized, Transparent, Scalable y To extend the functionality of this architecture implementation
caos
5
A Fault Tolerance Approach y Fault tolerant architecture for message-passing systems y It based on Rollback-recovery protocol y It creates a layer isolating the application from the Master/Worke cluster r failures
Different Paradigms
M M
W W
W W
W W
caos
PP
PP
W W
PP
PP
Parallel Application P P P
P
P
Message-passing Standard P
P
P
P P P Parallel Computer Structure (Fault-probable)
SPMD
6
A Fault Tolerance Approach y Fault tolerant architecture for message-passing systems y It based on Rollback-recovery protocol y It creates a layer isolating the application from the cluster failures
caos
Parallel Application Message-passing Standard Fault Masking Functions Fault Tolerance Functions Parallel Computer Structure (Fault-probable)
7
• It manages the communications between processes
The RADIC Architecture
• It takes checkpoints and event logs and sends to storage • It masks the faults using a table of the processes.
y RADIC is a fault tolerant • It detect communication failures architecture for messagepassing systems y Two kind of process perform the fault tolerance activities • It performs fault detection tasks using y Protectors a heartbeat/watchdog scheme • It stores the redundancy data y Observers
Node
Application Process
Observer
Protect or
(checkpoints and logs)
• It starts the recovery process
caos
8
The RADIC Architecture Node
y RADIC in Practice Application Process
Observer
Protect or
caos
9
The RADIC Architecture y RADIC functioning y Protectors establish the heartbeat/ watchdog y Observers connect with Protectors y It sends checkpoints and event logs y It manages the message delivery using the radictable
N0
N1
N2
P0
P1
P2
O0
O1
O2
T0
T1
T2
Communications
caos
10
The RADIC Architecture
N6
y The recovery process y N3 fails
N7
P6
P7
P8
O6
O7
O8
T6
T7
T8
N3
N4
N5
P3
P4
P5
O3
O4
O5
T3
T4
T5
N0
caos
N8
N1
N2
P0
P1
P2
O0
O1
O2
T0
T1
T2
11
The RADIC Architecture
N6
y The recovery process y N3 fails y T8 detects the faults and waits the T4 and O4 connection y T8 launches P3 using it checkpoint y O3 start the recovery process of P3
caos
N7
P6
P7
O6
O7
T6
T7
N8
P8 O8
P8 P3
O8
O3
T8
N4
N5
P4
P5
O4
O5
T4
T5
N0
N1
N2
P0
P1
P2
O0
O1
O2
T0
T1
T2
12
Recovery Side-Effect
N6
y The sytem configuration was changed y Processes P3 and P8 slowdown their activities y Protector T7 now protects two processes y The memory usage in N8 rises
caos
N7
P6
P7
O6
O7
T6
T7
N8
P8 O8 P3 O3
T8
N4
N5
P4
P5
O4
O5
T4
T5
N0
N1
N2
P0
P1
P2
O0
O1
O2
T0
T1
T2
13
Recovery Side-Effect y In order to quantify these side-effects facing different scenarios, we performed some experiments: y We applied three variations of a matrix product program: y Dynamic M/W – Loosely coupled y SPMD – High coupling in all processes y Static M/W – High dependency
caos
14
M
Recovery Side-Effect
W
W
W
W
Dynamic M/W – different number of nodes
caos
15
Recovery Side-Effect
P
P
P
P
P
P
P
P
P
SPMD – one fault in different moments
caos
16
Recovery Side-Effect Static M/W
caos
17
Static M/W
Without faults With Faults
500
Execution Time (s)
Recovery Side-Effect
600
1000x1000 Matrix product with 8 nodes‐ Static distribution ‐ 1 fault at 25% 100 loops ‐ ckpt each 45s
400
300
200
100
0
P0
P1
P2
P3
P4
P5
P6
P7
Processes
caos
18
Protecting the System y We incorporate a new protection level in RADIC using a dynamic redundancy functionality, allowing to mitigate or to avoid the recovery side-effects y Restoring the system configuration y Avoiding system configuration changes y Allowing to prevent faults y y y
It incorporates a transparent management of spare nodes It does not mantain any central information about the spare nodes The spare nodes use does not affect the scalability
caos
19
Protecting the System
N6
y Restoring the configuration
N7
P6
P7
O6
O7
T6
T7
N8
P8 O8 P3 O3
N9
T8
N4
N5
P3
P4
P5
O3
O4
O5
T9
T4
T5
N0
caos
N1
N2
P0
P1
P2
O0
O1
O2
T0
T1
T2
20
Protecting the System
N6
y Avoiding changes in number of active nodes
N7
P6
P7
P8
O6
O7
O8
T6
T7
T8
N9
N3
T9
N4
N5
P3
P4
P5
O3
O4
O5
T3
T4
T5
N0
caos
N8
N1
N2
P0
P1
P2
O0
O1
O2
T0
T1
T2
21
Protecting the System
N6
y Recovering with spare node
N7
N8
P6
P7
P8
O6
O7
O8
T6
T7
T8
N4
N9
N5
P3
P4
P5
O3
O4
O5
T9
T4
T5
N0
caos
N1
N2
P0
P1
P2
O0
O1
O2
T0
T1
T2
22
Protecting the System
N6
y Restoring the configuration
N7
N8
P6
P7
P8
O6
O7
O8
T6
T7
T8
N4
N9
N5
P3
P4
P5
O3
O4
O5
T9
T4
T5
N10
T10
N0
caos
N1
N2
P0
P1
P2
O0
O1
O2
T0
T1
T2
23
Protecting the System
N6
y Preventing Faults
N7
P6
P7
P8
O6
O7
O8
T6
T7
T8
N9
N3
T9
N4
N5
P3
P4
P5
O3
O4
O5
T3
T4
T5
N0
caos
N8
N1
N2
P0
P1
P2
O0
O1
O2
T0
T1
T2
24
Protecting the System
N6
y Preventing Faults
N7
N8
P6
P7
P8
O6
O7
O8
T6
T7
T8
N4
N9
N5
P3
P4
P5
O3
O4
O5
T9
T4
T5
N0
caos
N1
N2
P0
P1
P2
O0
O1
O2
T0
T1
T2
25
Implementation y We adapted the RADIC prototype (RADICMPI) in order to incorporate the Flexible Dynamic Redundancy y y y
y
Creation of management functions Definition of a new protocol to communicate with spares Increment of communication between observers and local protector Changes in the RADIC fault masking procedure including search in the spare nodes
caos
26
Implementation y
We also implemented a set of MPI non-blocking functions, allowing to run more kind of application y y
y y
We take care about the fault tolerance issues We had to change the message management kernel of RADICMPI to deal with asynchronous communications Other issues are solved too: We changed the original message log approach of RADICMPI to an event log approach in order to assure the determinism of the recovery process.
caos
27
Experiments Methodology y We conduced two kind of experiments: y Validation y y
y
Using the RADICMPI Debug Log
Spare adding task Recovery task using spare
Running a Ping-Pong program
Evaluation y y y
According with the fault moment According with the number of nodes Throughput behavior in continuous running applications
caos
• Matrix Product – Static • Matrix Product – Dynamic • Matrix Product – SPMD • N-Body simulation
28
Experiments Methodology y Validation metodology
caos
29
Experiment Design y Evaluating according the fault moment y The experiments were conduced in a twelve node cluster y We executed a product of two 1500x1500 matrix using the SPMD paradigm over 9 nodes and a product of two 1000x1000 matrixes using a static distribution over 11 nodes y We injected one fault at 25%, 50% and 75% y We measured the execution times in three situations y y y
Fault-free Without spare With spare
caos
30
Experimental Results y Evaluating according with the fault moment 800,0
1500x1500 Matrix product using 9 nodes ‐ SPMD cannon algorithm‐ 1 fault 160 loops ‐ckpt each 60s
Without failures
72,61%
Failures without spare 700,0
49,37%
Failures with spare
600,0
27,05% 13,93%
Time (s)
500,0
434,57
14,82%
14,19% 434,57
434,57
400,0
300,0
200,0
100,0
0,0
25%
50%
caos
Fault moment
75%
31
Experimental Results y Evaluating according with the fault moment
caos
32
Experiment Design y Evaluating according the number of nodes y The experiments were conduced in a twelve node cluster y We executed a product of two 1000x1000 matrix using a dynamic distribution and a product of two 1000x1000 matrixes using a static distribution y We injected one fault at 25% y We ran the programs over three cluster sizes y y y
4 nodes + spare 8 nodes + spare 11 nodes + spare
y We measured the execution times in three situations y Fault-free y Without spare y With spare
caos
33
Experimental Results y Evaluating according with the number of nodes
caos
34
Experimental Results y Evaluating according with the number of nodes
caos
35
Experimental Results y We implemented a N-Body simulation in order to study the behavior of our solution over continuous running applications
P
P
P
caos
P
36
Experimental Results y We implemented a N-Body simulation in order to study the behavior of our solution over continuous running applications
P
P
P
caos
P
37
Experiment Design y Throughput of 24x7 applications y We executed this application simulating 2000 particles in a ten node pipeline y We measured the throughput (in simulation steps per minute) in four scenarios y y
y
Fault-free Injecting three faults without spare: y Processes recovering in different nodes y Process recovering in same node without spare Injecting three faults using two initial spares and re-insert a “repaired” one after the first fault.
caos
38
Experimental Results y Throughput of 24x7 applications with three faults 35
A spare node is inserted N‐Body Simulation of 2000 particles in a 10 nodes pipeline without affects the with 3 faults performance
Simulation Steps per minute
30
25
20
Recovery overhead caused by: -Moment of fault (large log to process) -No optimized code
15
10
Fault Free Different Nodes
5
Same Nodes Using Spare
0
caos
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
Elapsed time (in minutes)
39
Conclusions y We implemented a dynamic redundancy functionality that avoids or mitigates the recovery side-effects y This functionality is flexible because y It allows system configuration restablishment by dynamic insertion of spare nodes - RESTORING y It allows a transparent management of the spare nodes use AVOIDING y It incorporates a maintenance feature that also allows to prevent failures by replacing fault-probable nodes – PREVENTING y Configurable, allowing from 0 to N number of spares y We extend the RADICMPI functionality, implementing a set of MPI non-blocking functions
caos
40
Conclusions y Our results show the benefits of the dynamic redundancy solution in different scenarios. y The results also show a strong dependency between the recovery side-effects and the application characteristics and how we can adapt to each one.
caos
41
Future Work y To study the spare nodes allocating facing factors like degradation level acceptable or memory limits of a node y To integrate a fault prediction mechanism in the maintenance feature y To continue expanding the RADICMPI functionality
caos
42
Open Lines y To investigate how adapt and use RADIC II with new HPC trends like the clusters of multicore computers. y To achieve a RADIC II analytical model, allowing to determine better parameter values. y To develop a RADIC II simulator allowing to assess its behavior in large clusters y To incorporate other features towards an autonomic fault tolerant system.
caos
43