Challenges and Issues of the Integration of RADIC into Open MPI Leonardo Fialho (
[email protected])
September 8, 2009 16th Euro PVM/MPI Users’ Group Meeting
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
Agenda
1
Introduction
2
RADIC and Open MPI Integration
3
Experimental Evaluation
4
Conclusions and Future Work
Leonardo Fialho (
[email protected])
Challenges and Issues of the Integration of RADIC into Open MPI
2
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
Agenda 1
2
3
4
Introduction RADIC Architecture Open MPI Architecture RADIC and Open MPI Integration Uncoordinated Checkpointing Message Logging Fault Detection & Management Recovery and Reconfiguration Experimental Evaluation Message Logging Performance Checkpointing Performance NAS Applications Performance Conclusions and Future Work
Leonardo Fialho (
[email protected])
Challenges and Issues of the Integration of RADIC into Open MPI
3
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
Scenario RADIC Architecture A fault tolerance proposal designed to be transparent for application developers and systems administrators, decentralised and distributed to achieve scalability, and flexible while configuration and its implementation.
Open MPI A well-established MPI library which creates a runtime environment for parallel applications. Integration Can RADIC be integrated to Open MPI keeping its original characteristics? Can Open MPI accommodate RADIC on its Modular Component Architecture? Leonardo Fialho (
[email protected])
Challenges and Issues of the Integration of RADIC into Open MPI
4
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
Objective
To have a version of the RADIC architecture which implements all MPI primitives, works with most of available resource managers, and becomes an easy to use FT solution.
Leonardo Fialho (
[email protected])
Challenges and Issues of the Integration of RADIC into Open MPI
5
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
Agenda 1
2
3
4
Introduction RADIC Architecture Open MPI Architecture RADIC and Open MPI Integration Uncoordinated Checkpointing Message Logging Fault Detection & Management Recovery and Reconfiguration Experimental Evaluation Message Logging Performance Checkpointing Performance NAS Applications Performance Conclusions and Future Work
Leonardo Fialho (
[email protected])
Challenges and Issues of the Integration of RADIC into Open MPI
6
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
RADIC Architecture
RADIC is a rollback/recovery fault tolerance architecture proposed to be integrated to existent message passing libraries. It grants to you: Automatic protection, detection, recovery, and reconfiguration. Transparent for application developers and cluster administrators. Scalable due to its distributed operation and decentralised storing. Flexible according to state saving strategies and configuration.
Leonardo Fialho (
[email protected])
Challenges and Issues of the Integration of RADIC into Open MPI
7
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
RADIC Architecture MPI Application (fault‐free)
RADIC works as a layer between application and operating systems:
MPI Standard RADIC fault masking operations RADIC fault tolerance operations
To perform fault tolerance tasks RADIC defines two entities: N‐1
Appx
Observerx
N
Parallel Machine (fault‐probable)
Appy Observery
N+1
Appz
Entities Perform:
Observerz
…
… Protectorn‐1
Leonardo Fialho (
[email protected])
Protectorn
– – – –
Protection Detection Recovery Reconfiguration
Protectorn+1
Challenges and Issues of the Integration of RADIC into Open MPI
8
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
Agenda 1
2
3
4
Introduction RADIC Architecture Open MPI Architecture RADIC and Open MPI Integration Uncoordinated Checkpointing Message Logging Fault Detection & Management Recovery and Reconfiguration Experimental Evaluation Message Logging Performance Checkpointing Performance NAS Applications Performance Conclusions and Future Work
Leonardo Fialho (
[email protected])
Challenges and Issues of the Integration of RADIC into Open MPI
9
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
Open MPI Architecture MPI Application
Open MPI is a set of frameworks which are organised in three sections:
Open MPI (OMPI) Open Run Open Run‐Time Time Environment (ORTE) Environment (ORTE) Open Portability Access Layer (OPAL)
Sections are not layers, but there is a dependency relation between them!
Operating System
App App App OMPI OMPI Open MPI A typical Open MPI runtime ORTE ORTE Library OPAL environment consists of: OPAL
– ORTE daemon – Communication library ORTE ORTE ORTE Daemon
Leonardo Fialho (
[email protected])
OPAL OPAL
App App Open MPI Open MPI Libraryy
App OMPI ORTE OPAL
ORTE ORTE Daemon Daemon
ORTE OPAL
Challenges and Issues of the Integration of RADIC into Open MPI
Ap
Open
ORT Daem 10
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
Agenda 1
2
3
4
Introduction RADIC Architecture Open MPI Architecture RADIC and Open MPI Integration Uncoordinated Checkpointing Message Logging Fault Detection & Management Recovery and Reconfiguration Experimental Evaluation Message Logging Performance Checkpointing Performance NAS Applications Performance Conclusions and Future Work
Leonardo Fialho (
[email protected])
Challenges and Issues of the Integration of RADIC into Open MPI
11
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
RADIC Integration into Open MPI Structure
Why Open MPI? Due to its modular architecture and existing fault tolerance components. How to do it? Mapping RADIC fault tolerance tasks in Open MPI’s original components and/or frameworks
Leonardo Fialho (
[email protected])
Challenges and Issues of the Integration of RADIC into Open MPI
12
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
RADIC Fault Tolerance Tasks
Uncoordinated checkpointing Pessimistic receiver-based message logging Fault detection during message passing and through a heartbeat/watchdog mechanism Process recovery on another node Maintaining the distributed and decentralised characteristic
Leonardo Fialho (
[email protected])
Challenges and Issues of the Integration of RADIC into Open MPI
13
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
Agenda 1
2
3
4
Introduction RADIC Architecture Open MPI Architecture RADIC and Open MPI Integration Uncoordinated Checkpointing Message Logging Fault Detection & Management Recovery and Reconfiguration Experimental Evaluation Message Logging Performance Checkpointing Performance NAS Applications Performance Conclusions and Future Work
Leonardo Fialho (
[email protected])
Challenges and Issues of the Integration of RADIC into Open MPI
14
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
Uncoordinated Checkpointing Observer: Create a checkpoint file Transfer checkpoint to protector Protector: Receive the checkpoint file Truncate message log
!"#$%&'()#*( $+&$%,#-./.0(
!"#$#%"#
&'#
It requires: 1*2.3)&**-.0( 2$%(
Leonardo Fialho (
[email protected])
()#
31#*-.0(
– – – –
New PML wrapper component New SnapC component New BTL component Protector daemon
Challenges and Issues of the Integration of RADIC into Open MPI
15
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
Uncoordinated Checkpointing
How to... allow processes to checkpoint independently?
On RADIC, each observer manages its own checkpoint interval. There is no checkpoint dispatcher.
How to... deal with communication channels and in-transit messages?
Before checkpoint, the observer waits for pending transfers, blocks new requests, and closes available sockets.
Services Reusing: File transfering uses a native framework as well as checkpoint creation (BLCR). Leonardo Fialho (
[email protected])
Challenges and Issues of the Integration of RADIC into Open MPI
16
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
Agenda 1
2
3
4
Introduction RADIC Architecture Open MPI Architecture RADIC and Open MPI Integration Uncoordinated Checkpointing Message Logging Fault Detection & Management Recovery and Reconfiguration Experimental Evaluation Message Logging Performance Checkpointing Performance NAS Applications Performance Conclusions and Future Work
Leonardo Fialho (
[email protected])
Challenges and Issues of the Integration of RADIC into Open MPI
17
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
Message Logging
Observer: Forwards message and control data to application’s protector Delivery the message to the application process Protector: Write the log file +","-."%& '/&)&*/&
!"#$%"&'(#)*"+,'(
!"#$"%& '(&)&*(&
0%12",21%& 03&
-"..)/"( #0//$+/(
),1(
Leonardo Fialho (
[email protected])
),1(
It requires:
– New PML wrapper component – Changes on the MPI interface 45& .*0&$+/( framework – Protector daemon
Challenges and Issues of the Integration of RADIC into Open MPI
18
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
Message Logging
How to... trap in-bound and out-bound messages?
Message Interception A new PML wrapper performs the log on the protector before delivery the message to the application process.
How to... assure the consistency of the parallel application?
Message Ordering Add an order information to each message and store this information on the observer.
Services Reusing Logging uses out-of-band framework Leonardo Fialho (
[email protected])
Challenges and Issues of the Integration of RADIC into Open MPI
19
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
Agenda 1
2
3
4
Introduction RADIC Architecture Open MPI Architecture RADIC and Open MPI Integration Uncoordinated Checkpointing Message Logging Fault Detection & Management Recovery and Reconfiguration Experimental Evaluation Message Logging Performance Checkpointing Performance NAS Applications Performance Conclusions and Future Work
Leonardo Fialho (
[email protected])
Challenges and Issues of the Integration of RADIC into Open MPI
20
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
Fault Detection: during communication tries Consecutive fails during communication tries Query to protector the application’s state Wait for recovery completion Retry the communication ,"-."&' (/'*'+/'
!"#$%"&'(#)*"+,'(
!"#"$%"&' ()'*'+)' 1"00)2"( 3&"*&$"04(
1"00)2"( ),-(
Leonardo Fialho (
[email protected])
0&12"#21&' 034()'*'+)5'
./"&'(0*)*"( &",5%"&'(
),-(
It requires: 67'
– New PML wrapper component – New ErrMgr component – Protector daemon
Challenges and Issues of the Integration of RADIC into Open MPI
21
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
Fault Detection: using a heartbeat/watchdog mechanism From the watchdog side:
From the heartbeat side:
Watchdog expires
Heartbeat unreachable
Confirms the fail
Confirms the fail
Recovers the failed process
Request recovery
#$$6"
!)'"
234/,5/,6" #$$%"
!"
#$$%" 234/,5/,%"
!&'"
#$$("
234/,5/,("
It requires:
234/,5/,%"
%,!"1,%2'
*" +,-./0.-,1)'"
%,-.,/0'%,!"1,%2' +,-./0.-,1"
*"
– New ErrMgr component – Protector daemon
+,-./0.-,1&'"
!"#$%&'()*+'
Leonardo Fialho (
[email protected])
Challenges and Issues of the Integration of RADIC into Open MPI
22
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
Fault Detection & Management How to... avoid MPI communication errors?
The observer catches errors from lower frameworks, requests the recovery, and retries the communication.
How to... avoid the fail-stop default behaviour when processes fail?
The new error manager component requests the recovery.
How to... Deal with byzantine situations?
RADIC faults are considered nodes faults. The malfunctioning node is isolated killing application processes.
Leonardo Fialho (
[email protected])
Challenges and Issues of the Integration of RADIC into Open MPI
23
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
Agenda 1
2
3
4
Introduction RADIC Architecture Open MPI Architecture RADIC and Open MPI Integration Uncoordinated Checkpointing Message Logging Fault Detection & Management Recovery and Reconfiguration Experimental Evaluation Message Logging Performance Checkpointing Performance NAS Applications Performance Conclusions and Future Work
Leonardo Fialho (
[email protected])
Challenges and Issues of the Integration of RADIC into Open MPI
24
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
Recovery and Reconfiguration Protector:
Observer:
Launch process application and its observer from the last checkpoint
!"""#$% '()*+,*+"#
%$!"""#&%
Provide messages from log Discard repeated messages
!"""-&%
'()*+,*+"#&%
It requires:
'()*+,*+"-&%
!"""% '()*+,*+"%
3%
.+/0*10/+2#
%$.+/0*10/+2#&%
!"""#
%$!"""#&%
'()*+,*+"#&% '()*+,*+ "#$% Leonardo Fialho (
[email protected])
.+/0*10/+2-&%
!"""%
3%
– New PML wrapper component – Protector daemon
!"""-&%
'()*+,*+ '()*+,*+"% Issues of "-&% Challenges and the Integration of RADIC into Open MPI
25
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
Recovery and Reconfiguration
How to... discover the new process location?
Observers plays a deterministic algorithm based on the initial protector/observer mapping.
How to... other processes update de failed process contact info?
On demand, while it tries to communicated with the failed process.
How to... avoid the Open MPI’s collective operation called MODEX?
Caching contact information on protectors.
Leonardo Fialho (
[email protected])
Challenges and Issues of the Integration of RADIC into Open MPI
26
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
Summarising Protector tasks has been integrated into the ORTE daemon Observer tasks has been integrated as frameworks components dynamically linked during startup to the application process
App OMPI ORTE OPAL
ORTE OPAL
App ())' Open MPI *+,%"-%"' Libraryy
ORTE Daemon
!"#$%&$#"'
())' *+,%"-%"'
!"#$%&$#"'
Modified and New Components: – – – –
Checkpoint (Un)Coordinator Observer P2P Management Layer Protector Daemon RADIC Error Manager
The integration of RADIC into Open MPI is called RADIC/OMPI
Leonardo Fialho (
[email protected])
Challenges and Issues of the Integration of RADIC into Open MPI
27
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
Agenda 1
2
3
4
Introduction RADIC Architecture Open MPI Architecture RADIC and Open MPI Integration Uncoordinated Checkpointing Message Logging Fault Detection & Management Recovery and Reconfiguration Experimental Evaluation Message Logging Performance Checkpointing Performance NAS Applications Performance Conclusions and Future Work
Leonardo Fialho (
[email protected])
Challenges and Issues of the Integration of RADIC into Open MPI
28
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
Experiments Description The implementation as been tested for fault tolerance functionalities, experiments has been made to depict the performance and implementation issues.
Three different experiments has been made: Message logging performance with 1 and 2 network channels Checkpointing performance according to per process application size NAS applications performance in faulty and fault free scenarios
Leonardo Fialho (
[email protected])
Challenges and Issues of the Integration of RADIC into Open MPI
29
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
Agenda 1
2
3
4
Introduction RADIC Architecture Open MPI Architecture RADIC and Open MPI Integration Uncoordinated Checkpointing Message Logging Fault Detection & Management Recovery and Reconfiguration Experimental Evaluation Message Logging Performance Checkpointing Performance NAS Applications Performance Conclusions and Future Work
Leonardo Fialho (
[email protected])
Challenges and Issues of the Integration of RADIC into Open MPI
30
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
Message Logging Performance: experiment design
Objective To compare the message latency based on its size.
NetPIPE has been used Checkpoints are disabled as well as the heartbeat/watchdog mechanism Comparison uses one and two network channels Message logging is stored on disk
Leonardo Fialho (
[email protected])
Challenges and Issues of the Integration of RADIC into Open MPI
31
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
Message Logging Performance Latency comparison for MPI communication with and without logging while using 1 channel 100,000
Latency comparison using 1 channel MPI 1ch (MPI+LOG) 1ch
3x slower because:
Latency (ussec)
10,000
– Out-of-band communication is not prioritized – Acknowledgment message is required for logging
1,000
100
1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M 8M
10
Message size (Bytes)
MPI communication and message logging should have the same priority! Leonardo Fialho (
[email protected])
Challenges and Issues of the Integration of RADIC into Open MPI
32
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
Message Logging Performance Latency comparison for MPI communication with and without logging while using 2 channels
100,000
Latency (ussec)
10,000
Latency comparison using 2 channels MPI 2ch (MPI+LOG) 2ch MPI 1ch + LOG 1ch
1,000
100
1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M 8M
10
Message size (Bytes)
Leonardo Fialho (
[email protected])
At least 3x slower because: – Out-of-band communication is not prioritized – Acknowledgment message is required for logging – Out-of-band communication cannot use load balance between available channels
Challenges and Issues of the Integration of RADIC into Open MPI
33
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
Agenda 1
2
3
4
Introduction RADIC Architecture Open MPI Architecture RADIC and Open MPI Integration Uncoordinated Checkpointing Message Logging Fault Detection & Management Recovery and Reconfiguration Experimental Evaluation Message Logging Performance Checkpointing Performance NAS Applications Performance Conclusions and Future Work
Leonardo Fialho (
[email protected])
Challenges and Issues of the Integration of RADIC into Open MPI
34
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
Checkpointing Performance: experiment design
Objective To analyse the checkpoint operation according to process size.
Measures includes checkpoint file creation, transferring and storage Different NAS applications has been used
Leonardo Fialho (
[email protected])
Challenges and Issues of the Integration of RADIC into Open MPI
35
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
Checkpointing Performance Time needed for checkpoint the entire application according process size while using 4 nodes Checkpointing Operation
80 70
– Linear until network saturation (black line) – Start to grow after state reaches 450 MB per process – All nodes experiment the same checkpointing performance
Time (seconds)
60 50 40 30 20 10 0 0
200
400
600
800
1,000
Proccess Size (MBytes)
Leonardo Fialho (
[email protected])
Challenges and Issues of the Integration of RADIC into Open MPI
36
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
Agenda 1
2
3
4
Introduction RADIC Architecture Open MPI Architecture RADIC and Open MPI Integration Uncoordinated Checkpointing Message Logging Fault Detection & Management Recovery and Reconfiguration Experimental Evaluation Message Logging Performance Checkpointing Performance NAS Applications Performance Conclusions and Future Work
Leonardo Fialho (
[email protected])
Challenges and Issues of the Integration of RADIC into Open MPI
37
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
NAS Applications Performance: experiment design (1) Objective To analyse the implementation performance with respect to fault tolerance tasks.
Applications BT, LU, and SP class C Using 4 to 32 nodes Fault free scenario Perform only 2 checkpoints: the initial an one more at mid-life Heartbeat/watchdog frequency: 1 second
Leonardo Fialho (
[email protected])
Challenges and Issues of the Integration of RADIC into Open MPI
38
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
BT class C
1,000 800 600 400 200 0 4
9
16
# of processors
1,200 1,000 800 600 400 200 0
4
25
Application do not scales for more than 9 nodes
But... Checkpointing time diminishes due to smaller process size Leonardo Fialho (
[email protected])
LU class C Elapsed timee (seconds)
1,200
Elapsed time (seconds)
Elapsed timee (seconds)
NAS Applications Performance: fault free execution
8
16
# of processors
1,200 1,000 800 600 400 200 0
32
Application scales from 4 to 32 nodes
But... Message logging mitigates the scaling gain
SP class C
4
9
16
# of processors
25
Application do not scales for more than 9 nodes
But... Checkpointing time diminishes due to smaller process size
Challenges and Issues of the Integration of RADIC into Open MPI
39
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
NAS Applications Performance: experiment design (2) Objective To analyse the implementation performance with respect to fault tolerance tasks. Applications BT, LU, and SP class D Using 8 to 32 nodes Faulty scenario Heartbeat/watchdog frequency: 1 second Performing checkpoints according table below: App
#
16 25 16 SP D 25
BT D
Running Process Ckpt. Time Size (MB) Interval 43.79 29.58 55.01 40.82
Leonardo Fialho (
[email protected])
1,980 1,400 1,715 1,251
21.58 16.28 19.17 14.90
App
#
8 LU D 16 32
Running Process Ckpt. Time Size (MB) Interval 103.84 49.69 20.63
1,747 1,061 722
Challenges and Issues of the Integration of RADIC into Open MPI
19.46 13.13 9.91
40
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
NAS Applications Performance: faulty execution
3
60
2 1
30
0
0 16
25
150
3
90
2
60 30
1
0
0 8
16
# of processors
32
# of processors
Comm./comp. ratio is maintained, protected application scales nearest to unprotected
Leonardo Fialho (
[email protected])
SP class D
5
120
4
90
3
60
2
30
1
# of faults
90
5 4
120
Elapsed time (minutes)
4
120
LU class D
# of faults
150
5
Elapsed time (minutes)
BT class D
# of faults
Elapsed time (minutes)
150
0
0 16
25
# of processors
Big overhead in a faulty scenario (for 8 nodes) caused by the large checkpoint interval and number of faults
Message logging turns application computation bound, thus no gain is obtained
Challenges and Issues of the Integration of RADIC into Open MPI
41
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
Agenda 1
2
3
4
Introduction RADIC Architecture Open MPI Architecture RADIC and Open MPI Integration Uncoordinated Checkpointing Message Logging Fault Detection & Management Recovery and Reconfiguration Experimental Evaluation Message Logging Performance Checkpointing Performance NAS Applications Performance Conclusions and Future Work
Leonardo Fialho (
[email protected])
Challenges and Issues of the Integration of RADIC into Open MPI
42
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
Conclusions
Open MPI now has a fault tolerance solution which is: – Automatic – Transparent – Scalable – Flexible (configurable) To achieve a better performance MPI communication and message logging should share the same high performance communication framework. The overhead introduced by RADIC/OMPI depends on the application’s behaviour which one wants to protect, there is no “magic number” to define it.
Leonardo Fialho (
[email protected])
Challenges and Issues of the Integration of RADIC into Open MPI
43
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
Future Work Integrate spare nodes in order to avoid performance loss after faults according to RADIC II specification. Update actual implementation with the newer Open MPI source code in order to make it available for download, testing, and use. Analyse application execution to adjust configuration parameters (dynamically) in order to reduce the overhead introduced by the fault tolerance.
Leonardo Fialho (
[email protected])
Challenges and Issues of the Integration of RADIC into Open MPI
44
Thank You
Leonardo Fialho (
[email protected])
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
BACKUP SLIDES Leonardo Fialho (
[email protected])
Challenges and Issues of the Integration of RADIC into Open MPI
46
Introduction
RADIC and Open MPI Integration
Experimental Evaluation
Conclusions and Future Work
Bibliography Duarte, A., Rexachs, D., Luque, E.: Increasing the cluster availability using RADIC. Cluster Computing (2006) 1–8 Gabriel, E., Fagg, G.E., Bosilca, G., Angskun, T., Dongarra, J.J., Squyres, J.M., Sahay, V., Kambadur, P., Barrett, B., Lumsdaine, A., Castain, R.H., Daniel, D.J., Graham, R.L., Woodall, T.S.: Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation. EuroPVM/MPI (2004) 97–104 Hursey, J., Squyres, J., Mattox, T., Lumsdaine, A.: The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI. IPDPS (26-30 March 2007) 1–8 Gropp, W., Lusk, E.: Fault Tolerance in Message Passing Interface Programs. Int. J. High Perform. Comput. Appl. 18(3) (2004) 363–372 Bouteiller, A., Cappello, F., Herault, T., Krawezik, K., Lemarinier, P., Magniette, M.: MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging. Supercomputing (2003) 25 Santos, G., Duarte, A., Rexachs, D., Luque, E.: Providing Non-stop Service for Message-Passing Based Parallel Applications with RADIC. Euro-Par (2008) 58–67
Leonardo Fialho (
[email protected])
Challenges and Issues of the Integration of RADIC into Open MPI
47