3Department of Computer Science, Louisiana State University, USA. 1 / 13 ... Outline. 1 Introduction and Challenges. 2 Middleware and Abstractions.
Reliable Replica-Exchange Molecular Dynamics Simulation in the Grid using SAGA CPR and Migol Andre Luckow,1 Shantenu Jha,2,3 Andre Merzky,2 Joohyun Kim2 and Bettina Schnor1 1 Institute
of Computer Science, University of Potsdam, Germany for Computation & Technology, Louisiana State University, USA 3 Department of Computer Science, Louisiana State University, USA
2 Center
1 / 13
A distributed system is a system on which I cannot get any work done, because some machine I have never heard of has crashed. (Leslie Lamport)
2 / 13
Outline
1
Introduction and Challenges
2
Middleware and Abstractions
3
Replica-Exchange Framework
4
Efficient Job Scheduling – SAGA Glide-In
5
Conclusion
3 / 13
Replica-Exchange Simulations Replica Exchange: Hello Distributed World
Replica Exchange: Hello Distributed World
Replica-Exchange (RE) • Task Level Parallelism • are Task Level to Parallelism simulations used – Embarrassingly – Embarrassingly understand important distributable! distributable! physical phenomena – coupled – Loosely – Loosely coupled from protein •folding to Create replicas of initial • Create replicas of initial configuration binding affinity calculations configuration • Spawn 'N' replicas over for computational drug different machine • Spawn 'N' replicas over discovery. different machine • Run for time t ; Attempt
RR1 1 RR22 RR33 RRN N
Text
• Run for timeconfiguration t ; Attemptswap • Run for furtherintime t; Repeat configuration swap Pleasingly distributed: T till finish • Run for furthercoupled time t; Repeat principal loosely – till finish however some
Grids are heterogeneous and dynamic. RE-Simulations are loosely coupled, but require some synchronization, i. e. a simulation can be blocked by a single stalling process. Reliability: Every thing fails all the time!
Grids are heterogeneous and dynamic. RE-Simulations are loosely coupled, but require some synchronization, i. e. a simulation can be blocked by a single stalling process. Reliability: Every thing fails all the time! To run efficiently run in a distributed environment RE simulations require: A Middleware, which handles the efficient and reliable execution of jobs. A high-level programming abstraction, which is suitable for expression of the RE logic and to effectively utilize distributed resources. 5 / 13
The Middleware: Migol
Job Broker Service
WS MDS
User Application Information Service Compute Resource Compute Resource Compute Resource GRAM GridFTP WS GRAM GridFTP WS GRAM GridFTP Application Application Application SAGA/ Migol Library/SAGA Migol Adaptor Migol Library/SAGA Adaptor Adaptor
The Simple API for Grid Applications (SAGA) is a high-level programmatic abstraction that provides standardised interfaces to primary functions of distributed environments. The SAGA Checkpoint Recovery API (CPR) is an extension of the standard SAGA API. SAGA CPR provides an abstraction for starting, monitoring and recovering of checkpoint-restartable jobs
7 / 13
SAGA CPR Examples
saga::cpr::service service (saga::url ("migol://flotta.haiti.cs.uni-potsdam.de:8443/ wsrf/services/migol/AIS-JGroups")); saga::cpr::self = service.get_self ();
8 / 13
SAGA CPR Examples
saga::cpr::service service (saga::url ("migol://flotta.haiti.cs.uni-potsdam.de:8443/ wsrf/services/migol/AIS-JGroups")); saga::cpr::self = service.get_self ();
The RE-Manager provides a framework for loosely coupled replica-exchange simulations. Fault-Tolerance services, such as monitoring and automatic recovery, are provided by the Migol middleware. SAGA CPR provides the ideal abstraction for the orchestration of replica processes. The framework is general purpose and extensible to different usage patterns, deployment scenarios and other simulation codes. 12 / 13