Reliable Replica-Exchange Molecular Dynamics ... - CiteSeerX

3 downloads 0 Views 2MB Size Report
3Department of Computer Science, Louisiana State University, USA. 1 / 13 ... Outline. 1 Introduction and Challenges. 2 Middleware and Abstractions.
Reliable Replica-Exchange Molecular Dynamics Simulation in the Grid using SAGA CPR and Migol Andre Luckow,1 Shantenu Jha,2,3 Andre Merzky,2 Joohyun Kim2 and Bettina Schnor1 1 Institute

of Computer Science, University of Potsdam, Germany for Computation & Technology, Louisiana State University, USA 3 Department of Computer Science, Louisiana State University, USA

2 Center

1 / 13

A distributed system is a system on which I cannot get any work done, because some machine I have never heard of has crashed. (Leslie Lamport)

2 / 13

Outline

1

Introduction and Challenges

2

Middleware and Abstractions

3

Replica-Exchange Framework

4

Efficient Job Scheduling – SAGA Glide-In

5

Conclusion

3 / 13

Replica-Exchange Simulations Replica Exchange: Hello Distributed World

Replica Exchange: Hello Distributed World

Replica-Exchange (RE) • Task Level Parallelism • are Task Level to Parallelism simulations used – Embarrassingly – Embarrassingly understand important distributable! distributable! physical phenomena – coupled – Loosely – Loosely coupled from protein •folding to Create replicas of initial • Create replicas of initial configuration binding affinity calculations configuration • Spawn 'N' replicas over for computational drug different machine • Spawn 'N' replicas over discovery. different machine • Run for time t ; Attempt

RR1 1 RR22 RR33 RRN N

Text

• Run for timeconfiguration t ; Attemptswap • Run for furtherintime t; Repeat configuration swap Pleasingly distributed: T till finish • Run for furthercoupled time t; Repeat principal loosely – till finish however some

synchronization required between tasks.

Text hot

T

hot t

t

Exchange attempts

t

Exchange attempts

t

300K

300K

4 / 13

Distributed Replica-Exchange Simulations Challenges

Grids are heterogeneous and dynamic. RE-Simulations are loosely coupled, but require some synchronization, i. e. a simulation can be blocked by a single stalling process. Reliability: Every thing fails all the time!

5 / 13

Distributed Replica-Exchange Simulations Challenges

Grids are heterogeneous and dynamic. RE-Simulations are loosely coupled, but require some synchronization, i. e. a simulation can be blocked by a single stalling process. Reliability: Every thing fails all the time! To run efficiently run in a distributed environment RE simulations require: A Middleware, which handles the efficient and reliable execution of jobs. A high-level programming abstraction, which is suitable for expression of the RE logic and to effectively utilize distributed resources. 5 / 13

The Middleware: Migol

Job Broker Service

WS MDS

User Application Information Service Compute Resource Compute Resource Compute Resource GRAM GridFTP WS GRAM GridFTP WS GRAM GridFTP Application Application Application SAGA/ Migol Library/SAGA Migol Adaptor Migol Library/SAGA Adaptor Adaptor

Monitoring Restart Service

Migol Component Globus Component 6 / 13

The Middleware: Migol

WS MDS

Job Broker Service 1) submit 2) registerService

User

Application Information Service Compute Resource Compute Resource Compute Resource GRAM GridFTP WS GRAM GridFTP WS GRAM GridFTP Application Application Application SAGA/ Migol Library/SAGA Migol Adaptor Migol Library/SAGA Adaptor Adaptor

Monitoring Restart Service

Migol Component Globus Component 6 / 13

The Middleware: Migol

3) query

Job Broker Service

WS MDS

1) submit 2) registerService

User

Application Information Service Compute Resource Compute Resource Compute Resource GRAM GridFTP WS GRAM GridFTP WS GRAM GridFTP Application Application Application SAGA/ Migol Library/SAGA Migol Adaptor Migol Library/SAGA Adaptor Adaptor

Monitoring Restart Service

Migol Component Globus Component 6 / 13

The Middleware: Migol

Job Broker Service

3) query

4) startJob

2) registerService

WS MDS

1) submit User

Application Information Service Compute Resource Compute Resource Compute Resource GRAM GridFTP WS GRAM GridFTP WS GRAM GridFTP Application Application Application SAGA/ Migol Library/SAGA Migol Adaptor Migol Library/SAGA Adaptor Adaptor

Monitoring Restart Service

Migol Component Globus Component 6 / 13

The Middleware: Migol

Job Broker Service

3) query

4) startJob

2) registerService

WS MDS

1) submit User

Application Information Service Compute Resource Compute Resource Compute Resource GRAM GridFTP WS GRAM GridFTP WS GRAM GridFTP Application Application Application SAGA/ Migol Library/SAGA Migol Adaptor Migol Library/SAGA Adaptor Adaptor

5) update

Monitoring Restart Service

Migol Component Globus Component 6 / 13

The Middleware: Migol

Job Broker Service

3) query

4) startJob

2) registerService

WS MDS

1) submit User

7) restart

Application Information Service Compute Resource Compute Resource Compute Resource GRAM GridFTP WS GRAM GridFTP WS GRAM GridFTP

5) update

Monitoring Restart Service

6) monitor Application Application Application SAGA/ Migol Library/SAGA Migol Adaptor Migol Library/SAGA Adaptor Adaptor

Migol Component Globus Component 6 / 13

SAGA CPR

The Simple API for Grid Applications (SAGA) is a high-level programmatic abstraction that provides standardised interfaces to primary functions of distributed environments. The SAGA Checkpoint Recovery API (CPR) is an extension of the standard SAGA API. SAGA CPR provides an abstraction for starting, monitoring and recovering of checkpoint-restartable jobs

7 / 13

SAGA CPR Examples

saga::cpr::service service (saga::url ("migol://flotta.haiti.cs.uni-potsdam.de:8443/ wsrf/services/migol/AIS-JGroups")); saga::cpr::self = service.get_self ();

8 / 13

SAGA CPR Examples

saga::cpr::service service (saga::url ("migol://flotta.haiti.cs.uni-potsdam.de:8443/ wsrf/services/migol/AIS-JGroups")); saga::cpr::self = service.get_self ();

saga::cpr::checkpoint remd_chkpt("remd_chkpt"); remd_checkpoint.add_file (saga::url ("gsiftp://qb.loni.org/work/remd/chkpt.dat"));

9 / 13

Replica-Exchange Framework

RE-Manager Replica Manager

SAGA-Migol

Migol

Grid Resource GridFTP

NAMD NAMD NAMD

GRAM

Replica-Agent SAGA-CPR/ Migol

RE-Framework Migol/SAGA CPR Globus

10 / 13

Replica-Exchange Framework

RE-Manager Replica Manager

SAGA-Migol

Migol

1) File Staging Grid Resource GridFTP

NAMD NAMD NAMD

GRAM

Replica-Agent SAGA-CPR/ Migol

RE-Framework Migol/SAGA CPR Globus

10 / 13

Replica-Exchange Framework

RE-Manager Replica Manager

Migol

SAGA-Migol 2a) Register Job

1) File Staging

2b) Submit Job Grid Resource

GridFTP

GRAM 2c) Start job

NAMD NAMD NAMD

Replica-Agent SAGA-CPR/ Migol

RE-Framework Migol/SAGA CPR Globus

10 / 13

Replica-Exchange Framework

RE-Manager Replica Manager

Migol

SAGA-Migol 2a) Register Job

1) File Staging

2b) Submit Job Grid Resource

GridFTP

3) Update Job Metadata

GRAM 2c) Start job

NAMD NAMD NAMD

Replica-Agent SAGA-CPR/ Migol

RE-Framework Migol/SAGA CPR Globus

10 / 13

Replica-Exchange Framework

RE-Manager Replica Manager

Migol

SAGA-Migol 2a) Register Job

1) File Staging

2b) Submit Job Grid Resource

GridFTP

3) Update Job Metadata

GRAM 2c) Start job 4) mpirun

NAMD NAMD NAMD

Replica-Agent SAGA-CPR/ Migol

RE-Framework Migol/SAGA CPR Globus

10 / 13

Replica-Exchange Framework

RE-Manager Replica Manager

Migol

monitor

SAGA-Migol

2a) Register Job 1) File Staging

2b) Submit Job Grid Resource

GridFTP

3) Update Job Metadata

GRAM 2c) Start job

monitor

4) mpirun NAMD NAMD NAMD

Replica-Agent SAGA-CPR/ Migol

RE-Framework Migol/SAGA CPR Globus

10 / 13

Efficient Job Scheduling – SAGA Glide-In

RE Application

RE-Manager Enhanced Job Model/SAGA Glide-In

SAGA File

SAGA CPR/Job

Replica-Agent Replica Replica

SAGA Based Glide-In Framework

Replica Replica

SAGA Advert SAGA CPR Replica Replica

Resource 1

Replica Replica

SAGA Advert

SAGA Reference Implementation

Replica-Agent Replica Replica

Replica Replica

SAGA Advert SAGA CPR Replica Replica

Replica Replica

Resource 2

11 / 13

Conclusion

The RE-Manager provides a framework for loosely coupled replica-exchange simulations. Fault-Tolerance services, such as monitoring and automatic recovery, are provided by the Migol middleware. SAGA CPR provides the ideal abstraction for the orchestration of replica processes. The framework is general purpose and extensible to different usage patterns, deployment scenarios and other simulation codes. 12 / 13

Thank you!

Any Questions ???

13 / 13