Approaches for Parallel Applications Fault Tolerance - Semantic Scholar
Recommend Documents
In the communication network, several error detection methods can be directly sup- ported by the hardware: such as a parity bit in cacti byte, detection of disconnected links and several ... For software control flow monitoring, the program is.
8421 Wilshire Blvd., Suite 201. Beverly Hills, CA ... The results of Table 1-1 reflect primarily spacecraft ... the following ways: (i) switching from one fault-handling.
cated state machines [13, 15, 4] and Byzantine quorum systems [14]) and ... tacks) that are beyond the current means of fault tolerance and cryptography.
tially implemented as middleware on top of the OS: Windows. CE, VxWorks, Virtuoso [7], TEX [8]. As a pilot application, this FA was deployed on the embedded.
environments to repair each component failure in the robot, the .... ers, algorithms have been developed which reconfigure the data or code ..... Radiation hard-.
for caches are limited either by the number of faults that can be tolerated or by the ... that the processor can operate without a cache under certain circumstances.
wireless coverage it is necessary to integrate fault-tolerance in the design of ... after failure and no integrated wireless and optical fault-tolerance is performed.
Jun 12, 2011 - Permission to make digital or hard copies of all or part of this work for personal or ... operator can individually recover after failure and it can re- cover using a ... of a plan. Since input data comes from disk, it can be read.
Approaches for Parallel Applications Fault Tolerance - Semantic Scholar
Internal and external checkpoint/restart request mechanisms. â Command line tools ... The use of more than one checkpoint/restart system to form a consistent.
UNCLASSIFIED
Approaches for Parallel Applications Fault Tolerance Richard L. Graham Advanced Computing Laboratory Los Alamos National Laboratory LA-UR-06-6526 LA-UR-??? UNCLASSIFIED
UNCLASSIFIED
Overview • Problem definition • Introduction to the Open MPI collaboration • Fault Recovery – Data transmission errors – Network failures – Process failure
• Future work
UNCLASSIFIED
1
UNCLASSIFIED
Problem definition
Network Node 2
Node 1 Registry
Disk B
Disk A UNCLASSIFIED
UNCLASSIFIED
Guiding Principles • End goal: Increase application MTBF • Automation is desirable - more likely to be used • No One-Solution-Fits-All – – – – – –
Hardware characteristics Software characteristics System complexity System resources available for fault recovery Performance impact on application Fault characteristics of application
UNCLASSIFIED
2
UNCLASSIFIED
Related Work Non Automatic
Automatic Ckpt Based
Log Based Optimistic Log
Cocheck
Framework Independent of MPI Ste96
Casual Log Pessimistic Log
Optimistic recovery in distributed systems N faults with coherent ckpt SY85
other
Manetho n faults EZ92
Starfish Enrichment of MPI AF99 Clip Semi-trasparant Ckpt Clp97
API
Egida Rav99
LAM/MPI
Comm Layer
MPICH-V/CL
FT-MPI MPI/FT Redundance of Tasks BNC01
Pruitt 98 2 faults sender based Sender based Msg Log 1 fault sender based JZ87
MPICH-V N faults Distributed logging
MPI-FT N faults Centralized Sever LNLE00 LA-MPI RG03
UNCLASSIFIED
UNCLASSIFIED
Open MPI Collaboration • The University of Tennessee
• Cisco
• Indiana University
• Mellanox
• HLRS
• Voltaire
• The University of Huston
• Sun Microsystems
• Sandia National Laboratory
• Myricom
• LANL
• IBM • QLogic
UNCLASSIFIED
3
UNCLASSIFIED
Contributors to this talk • Tim Woodall
• George Bosilca
• Galen Shipman • Brian Barrett • Ralph Castain • Jeff Squyres • Josh Hursey • Mitch Sukalski • Graham Fagg
UNCLASSIFIED
UNCLASSIFIED
Design Features Assisting in Fault Tolerance
LA-UR-??? UNCLASSIFIED
4
UNCLASSIFIED
Components • Formalized interfaces – Specifies “black box” implementation – Different implementations available at run-time – Can compose different systems Interface 1 on the fly
Caller
Interface 2
Interface 3
Fault Tolerance Is Optional
UNCLASSIFIED
UNCLASSIFIED
OpenRTE Architecture
UNCLASSIFIED
5
UNCLASSIFIED
Open MPI Non Automatic
Automatic Ckpt Based
Log Based Optimistic Log
Cocheck
Framework Independent of MPI Ste96
Casual Log Pessimistic Log
Optimistic recovery in distributed systems N faults with coherent ckpt SY85
other
Manetho n faults EZ92
Starfish Enrichment of MPI AF99
API
Clip Semi-trasparant Ckpt Clp97
Egida Rav99
LAM/MPI
Comm Layer
MPICH-V/CL
FT-MPI MPI/FT Redundance of Tasks BNC01
Pruitt 98 2 faults sender based Sender based Msg Log 1 fault sender based JZ87
Working/ In Progress
MPICH-V N faults Distributed logging
MPI-FT N faults Centralized Sever LNLE00
Planned in Near future
New LA-MPI RG03
UNCLASSIFIED
UNCLASSIFIED
Data Transmission Errors
LA-UR-??? UNCLASSIFIED
6
UNCLASSIFIED
Data Transmission Errors
Network Node 2
Node 1 Registry
Disk B
Disk A UNCLASSIFIED
UNCLASSIFIED
Errors Handled • Data corruption on the network • Data corruption between the NIC and main-memory • Dropped Packets
UNCLASSIFIED
7
UNCLASSIFIED
General Point-To-Point Design • Component Architecture:
– “Plug-ins” for different capabilities (e.g. different networks) – Tunable run-time parameters
OB1 - NO FT
DR - FT
•Dynamic Connection Management
Only Differences –On initial send to peer connection is established via OOB connection –Resources allocated upon connection
•Shared Resource Allocation UNCLASSIFIED
UNCLASSIFIED
Implementation features • Refinement of the LA-MPI implementation • Main-memory to Main memory Checksum/CRC – Ack/Nack – Retransmit Corrupt packets
General Point-To-Point Design • Component Architecture:
– “Plug-ins” for different capabilities (e.g. different networks) – Tunable run-time parameters
OB1 - NO FT
DR - FT
•Dynamic Connection Management
Only Differences –On initial send to peer connection is established via OOB connection –Resources allocated upon connection
•Shared Resource Allocation UNCLASSIFIED
UNCLASSIFIED
IB-2 Failure Low Latency BTL
Bulk Data BTL
IB - 1
IB - 1
IB - 2
IB - 2
Myrinet
Myrinet
Before Failure
Ethernet IB - 1
IB - 1 After Failure
Myrinet
Myrinet
Ethernet UNCLASSIFIED
11
UNCLASSIFIED
Implementation Features • Requires error detection - more expensive • Error detection – ORTE – Watchdog timers
• Reconnect • Remove NIC from list of available resources
UNCLASSIFIED
UNCLASSIFIED
Device failover
UNCLASSIFIED
12
UNCLASSIFIED
NIC Failover: Ping-Pong (MB/sec)
UNCLASSIFIED
UNCLASSIFIED
Checkpoint/Restart
LA-UR-??? UNCLASSIFIED
13
UNCLASSIFIED
Goals • Support a variety of checkpoint/restart protocols – Coordinated [First implementation] – Uncoordinated
• Support a variety of checkpoint/restart systems – Berkeley Labs Checkpoint/Restart (BLCR) [First implementation] – User level checkpoint/restart (self) [First implementation] – Others (Condor, libckpt, …)
• Internal and external checkpoint/restart request mechanisms – Command line tools – API
• Support process migration
UNCLASSIFIED
UNCLASSIFIED
Goals • Designed to support fault tolerance research – Extensible set of MCA frameworks with clearly defined interfaces
• Improved interconnect support – tcp, self, Infiniband, Myrinet, …
• Checkpoint/restart system heterogeneity – The use of more than one checkpoint/restart system to form a consistent global checkpoint of an application.
• Improved user interface to support transparency and reduce complexity
– User does not need to know which checkpoint/restart systems or protocols are being used to checkpoint or restart an application
• Attention paid to performance and scalability
UNCLASSIFIED
14
UNCLASSIFIED
Architecture • OPAL Checkpoint/Restart Service (CRS) – Single process checkpoint/restart system interface
• ORTE Snapshot Coordinator (SnapC) – Launch and monitor a distributed checkpoint/restart – Support checkpoint server architecture
• OMPI Checkpoint/Restart Coordination Protocol (CRCP) – Distributed checkpoint/restart coordination protocol interface – Support at least Coordinated and Uncoordinated protocols
UNCLASSIFIED
UNCLASSIFIED
Architecture • Multilevel notification mechanism – Allows all layers in Open MPI to take action around a checkpoint/restart request
• MCA framework design allows for minimal changes to the Open MPI core
• Many mechanisms available for an application to choose (not) to use checkpoint/restart fault tolerance – Compiler option – Runtime option(s)
UNCLASSIFIED
15
UNCLASSIFIED
Implementation status • OMPI CRCP framework still in development • Checkpoint/restart protocol support: – Coordinated
• Checkpoint/restart system support: – BLCR, self
• Interconnects: – self – tcp – Others as time permits
• Command line tools: – ompi-checkpoint, ompi-restart, ompi-ps
• Current development on a branch, with plans to merge to trunk soon