Internal and external checkpoint/restart request mechanisms. â Command line tools ... The use of more than one checkpoint/restart system to form a consistent.
UNCLASSIFIED
Approaches for Parallel Applications Fault Tolerance Richard L. Graham Advanced Computing Laboratory Los Alamos National Laboratory LA-UR-06-6526 LA-UR-??? UNCLASSIFIED
UNCLASSIFIED
Overview • Problem definition • Introduction to the Open MPI collaboration • Fault Recovery – Data transmission errors – Network failures – Process failure
• Future work
UNCLASSIFIED
1
UNCLASSIFIED
Problem definition
Network Node 2
Node 1 Registry
Disk B
Disk A UNCLASSIFIED
UNCLASSIFIED
Guiding Principles • End goal: Increase application MTBF • Automation is desirable - more likely to be used • No One-Solution-Fits-All – – – – – –
Hardware characteristics Software characteristics System complexity System resources available for fault recovery Performance impact on application Fault characteristics of application
UNCLASSIFIED
2
UNCLASSIFIED
Related Work Non Automatic
Automatic Ckpt Based
Log Based Optimistic Log
Cocheck
Framework Independent of MPI Ste96
Casual Log Pessimistic Log
Optimistic recovery in distributed systems N faults with coherent ckpt SY85
other
Manetho n faults EZ92
Starfish Enrichment of MPI AF99 Clip Semi-trasparant Ckpt Clp97
API
Egida Rav99
LAM/MPI
Comm Layer
MPICH-V/CL
FT-MPI MPI/FT Redundance of Tasks BNC01
Pruitt 98 2 faults sender based Sender based Msg Log 1 fault sender based JZ87
MPICH-V N faults Distributed logging
MPI-FT N faults Centralized Sever LNLE00 LA-MPI RG03
UNCLASSIFIED
UNCLASSIFIED
Open MPI Collaboration • The University of Tennessee
• Cisco
• Indiana University
• Mellanox
• HLRS
• Voltaire
• The University of Huston
• Sun Microsystems
• Sandia National Laboratory
• Myricom
• LANL
• IBM • QLogic
UNCLASSIFIED
3
UNCLASSIFIED
Contributors to this talk • Tim Woodall
• George Bosilca
• Galen Shipman • Brian Barrett • Ralph Castain • Jeff Squyres • Josh Hursey • Mitch Sukalski • Graham Fagg
UNCLASSIFIED
UNCLASSIFIED
Design Features Assisting in Fault Tolerance
LA-UR-??? UNCLASSIFIED
4
UNCLASSIFIED
Components • Formalized interfaces – Specifies “black box” implementation – Different implementations available at run-time – Can compose different systems Interface 1 on the fly
Caller
Interface 2
Interface 3
Fault Tolerance Is Optional
UNCLASSIFIED
UNCLASSIFIED
OpenRTE Architecture
UNCLASSIFIED
5
UNCLASSIFIED
Open MPI Non Automatic
Automatic Ckpt Based
Log Based Optimistic Log
Cocheck
Framework Independent of MPI Ste96
Casual Log Pessimistic Log
Optimistic recovery in distributed systems N faults with coherent ckpt SY85
other
Manetho n faults EZ92
Starfish Enrichment of MPI AF99
API
Clip Semi-trasparant Ckpt Clp97
Egida Rav99
LAM/MPI
Comm Layer
MPICH-V/CL
FT-MPI MPI/FT Redundance of Tasks BNC01
Pruitt 98 2 faults sender based Sender based Msg Log 1 fault sender based JZ87
Working/ In Progress
MPICH-V N faults Distributed logging
MPI-FT N faults Centralized Sever LNLE00
Planned in Near future
New LA-MPI RG03
UNCLASSIFIED
UNCLASSIFIED
Data Transmission Errors
LA-UR-??? UNCLASSIFIED
6
UNCLASSIFIED
Data Transmission Errors
Network Node 2
Node 1 Registry
Disk B
Disk A UNCLASSIFIED
UNCLASSIFIED
Errors Handled • Data corruption on the network • Data corruption between the NIC and main-memory • Dropped Packets
UNCLASSIFIED
7
UNCLASSIFIED
General Point-To-Point Design • Component Architecture:
– “Plug-ins” for different capabilities (e.g. different networks) – Tunable run-time parameters
OB1 - NO FT
DR - FT
•Dynamic Connection Management
Only Differences –On initial send to peer connection is established via OOB connection –Resources allocated upon connection
•Shared Resource Allocation UNCLASSIFIED
UNCLASSIFIED
Implementation features • Refinement of the LA-MPI implementation • Main-memory to Main memory Checksum/CRC – Ack/Nack – Retransmit Corrupt packets
General Point-To-Point Design • Component Architecture:
– “Plug-ins” for different capabilities (e.g. different networks) – Tunable run-time parameters
OB1 - NO FT
DR - FT
•Dynamic Connection Management
Only Differences –On initial send to peer connection is established via OOB connection –Resources allocated upon connection
•Shared Resource Allocation UNCLASSIFIED
UNCLASSIFIED
IB-2 Failure Low Latency BTL
Bulk Data BTL
IB - 1
IB - 1
IB - 2
IB - 2
Myrinet
Myrinet
Before Failure
Ethernet IB - 1
IB - 1 After Failure
Myrinet
Myrinet
Ethernet UNCLASSIFIED
11
UNCLASSIFIED
Implementation Features • Requires error detection - more expensive • Error detection – ORTE – Watchdog timers
• Reconnect • Remove NIC from list of available resources
UNCLASSIFIED
UNCLASSIFIED
Device failover
UNCLASSIFIED
12
UNCLASSIFIED
NIC Failover: Ping-Pong (MB/sec)
UNCLASSIFIED
UNCLASSIFIED
Checkpoint/Restart
LA-UR-??? UNCLASSIFIED
13
UNCLASSIFIED
Goals • Support a variety of checkpoint/restart protocols – Coordinated [First implementation] – Uncoordinated
• Support a variety of checkpoint/restart systems – Berkeley Labs Checkpoint/Restart (BLCR) [First implementation] – User level checkpoint/restart (self) [First implementation] – Others (Condor, libckpt, …)
• Internal and external checkpoint/restart request mechanisms – Command line tools – API
• Support process migration
UNCLASSIFIED
UNCLASSIFIED
Goals • Designed to support fault tolerance research – Extensible set of MCA frameworks with clearly defined interfaces
• Improved interconnect support – tcp, self, Infiniband, Myrinet, …
• Checkpoint/restart system heterogeneity – The use of more than one checkpoint/restart system to form a consistent global checkpoint of an application.
• Improved user interface to support transparency and reduce complexity
– User does not need to know which checkpoint/restart systems or protocols are being used to checkpoint or restart an application
• Attention paid to performance and scalability
UNCLASSIFIED
14
UNCLASSIFIED
Architecture • OPAL Checkpoint/Restart Service (CRS) – Single process checkpoint/restart system interface
• ORTE Snapshot Coordinator (SnapC) – Launch and monitor a distributed checkpoint/restart – Support checkpoint server architecture
• OMPI Checkpoint/Restart Coordination Protocol (CRCP) – Distributed checkpoint/restart coordination protocol interface – Support at least Coordinated and Uncoordinated protocols
UNCLASSIFIED
UNCLASSIFIED
Architecture • Multilevel notification mechanism – Allows all layers in Open MPI to take action around a checkpoint/restart request
• MCA framework design allows for minimal changes to the Open MPI core
• Many mechanisms available for an application to choose (not) to use checkpoint/restart fault tolerance – Compiler option – Runtime option(s)
UNCLASSIFIED
15
UNCLASSIFIED
Implementation status • OMPI CRCP framework still in development • Checkpoint/restart protocol support: – Coordinated
• Checkpoint/restart system support: – BLCR, self
• Interconnects: – self – tcp – Others as time permits
• Command line tools: – ompi-checkpoint, ompi-restart, ompi-ps
• Current development on a branch, with plans to merge to trunk soon