Environment â Using Federated Grids for .... has also enabled us to easily introduce new analysis and ... A distributed environment is required for two primary.
1
SPICE: Simulated Pore Interactive Computing Environment – Using Federated Grids for “Grand Challenge” Biomolecular Simulations. Shantenu Jha‡ ,Peter Coveney‡ , Matt Harvey‡ ‡ Centre
for Computational Sciences, UCL, London, WC1H0AJ, UK
Abstract— SPICE aims to understand the vital process of translocation of biomolecules across protein pores by computing the free energy profile of the translocating biomolecule along the vertical axis of the pore. Without significant advances at the algorithmic, computing and analysis levels, understanding problems of this size and complexity will remain beyond the scope of computational science for the foreseeable future. A novel algorithmic advance is provided by a combination of Steered Molecular Dynamics and Jarzynski’s Equation (SMD-JE). Grid computing provides the required new computing paradigm as well as facilitating the adoption of new analytical approaches. SPICE uses sophisticated grid infrastructure to couple distributed high performance simulations, visualization and instruments used in the analysis to the same framework. This paper outlines the scientific motivation and describes why distributed resources are critical for the project. We describe how we utilize the resources of a federated transAtlantic Grid to use SMD-JE to enhance our understanding of the translocation phenomenon in ways that have not been possible until now. Finally, we briefly document the challenges encountered in using a grid-of-grids and some of the solutions devised in response.
I. I NTRODUCTION The transport of biomolecules like DNA, RNA and poly-peptides across protein membrane channels is of primary significance in a variety of areas. For example, gene expression in eukaryotic cells relies on the passage of mRNA through protein complexes connecting the cytoplasm with the cell nucleus. Although there has been a flurry of recent activity, both theoretical and experimental [1], [2] aimed at understanding this crucial process, many aspects remain unclear. The details of the interaction of a pore with a translocating biomolecule within the confined geometries are critical in determining macromolecular transport across a membrane. Of the possible computational approaches, classical molecular dynamics (MD) simulations of biomolecular systems have the ability to provide insight into specific aspects of a biological system at a level of detail not possible for other simulation techniques. MD simulations can be used to study details of phenomenon that are often not accessible experimentally [3] and would certainly not
be available from simple theoretical approaches. However, the ability to provide such detailed information comes at a price: MD simulations are extremely computationally intensive – prohibitively so in many cases. The first fully atomistic simulations of the hemolysin pore capable of capturing the interaction in full have appeared very recently [4]. They address however, only static properties - structural and electrostatic - and have not attempted to address the dynamic properties of the translocating DNA. The lack of more attempts at atomistic simulations of the translocation process is due in part to the fact that the computational requirements for simulations of systems of this size for the required timescales have hitherto not been possible. A back-of-the-envelope estimate of the required computational resources, helps to explain why there has not been any significant computational contribution to understanding the dynamical aspects of the translocation problem. The physical time scale for translocation of large bio-molecules through a transmembrane pore is typically of the order of tens of microseconds. It currently takes approximately 24 hours on 128 processors to simulate one nanosecond of physical time for a system of approximately 300,000 atoms. Thus, it takes about 3000 CPU-hours on a tightly coupled machine to simulate 1ns. Therefore a straightforward vanilla MD simulation will take 3 ×107 CPU-hours to simulate 10 microseconds – a prohibitively expensive amount. Consequently, approaches that are “smarter” than vanilla classical equilibrium MD simulations are required. Relying only on Moore’s law (simple speed doubling every 18 months) we are still a couple of decades away from a time when such simulations may become routine. Thus, advances in both the algorithms and the computational approach are imperative to overcome such barriers. SPICE aims to understand the vital process of translocation of biomolecules across protein pores in ways that have not been possible until now by using grid infrastructure that facilitates implementing novel algorithms, analysis techniques and the effective utilisation of the computational resources of a federated trans-Atlantic Grid. The ability to use a federated resource was expedited by a
2
joint call for proposals (CFP) by the US National Science Foundation (NSF) and UK Engineering and Physical Sciences Research Council (EPSRC) towards the end of 2004 and the beginning of 2005. The CFP asked for projects that aimed to utilize the combined computational resources of the US and UK to demonstrate a capability that would not have been possible using either just the US or the UK grid infrastructure. The call that emanated from the Directorate for Computing and Information Science and Engineering (CiSE) at the NSF indicated that, “This capability will enable scientists and engineers to create applications that are not specifically tied to any geographic location, but rather applications that are able to utilize distributed resources in a transparent manner.” Additionally, the NSF CFP mentioned that proposals “highlight the grid component as well as the benefits of the cross-Atlantic collaboration.” In response to the CFP, three computational science groups from UCL, Tufts and Brown Universities teamed up with a middleware team from NIU/Argonne to meet the challenge [5]. Joining forces had the obvious advantage of furnishing combined resources, which admittedly was a major goal, but it also brought together several groups representing domain specialist and middleware experts to collectively address the challenges inherent in exploiting the resources in a grid fashion. The successful demonstrations of these projects (SPICE, Vortonics and NEKTAR) at SC05 established that there exist applications that are sufficiently mature to make effective use of resources that span computational grids. Furthermore it also underscored the need to run at even larger scales, i.e., on grids-of-grids. There have been successful efforts at exploiting federated grids [6], [7], [8] previously. It is interesting to note that some of the same problems encountered here were first identified earlier — especially in the TeraGyroid project [6] — and continue to persist. This in spite of the fact that grids have ostensibly matured in the recent past. The three scientific projects had new and distinct motivations for using federated grids from earlier attempts [5]. For example, the NEKTAR and Vortonics projects are examples of a single code instance running on several resources of a federated grid as opposed to several identical but independently executing codes running on the resources of a federated grid. Similarly, the SPICE project utilizes optical lightpaths to connect computation and visualization, with the visualization in turn being used to steer the simulation. In this paper, we focus on the SPICE project – an application of significant relevance to the life sciences. Having outlined the scientific motivation and aims of the project, along with the “how”, we will establish “why” the project required a federated grid. The reasons are novel and with far reaching implications. We briefly discuss some scientific results in Section IV. In Section V, we
Fig. 1. Snapshot of a single stranded DNA beginning its translocation through the alpha-hemolysin protein pore which is embedded in a lipid membrane bilayer. Water molecules are not shown. Fig. 1b is a cartoon representation showing the seven fold symmetry of the hemolysin protein and the beta-barrel pore.
provide details of how the grid-of-grids was utilized and present our experience in using federated grids – ranging from some of the simple problems encountered to the more difficult issue of grid interoperability. II. N OVEL A LGORITHMS AND C OMPUTING PARADIGM To enhance our understanding of the translocation of DNA across protein pores, there is a need now to adopt new algorithmic approaches in conjunction with new computing paradigms, as without significant advances at the algorithmic, computing and analysis levels, understanding problems of this nature will remain beyond the scope of computational biologists. In this context, a novel algorithmic advance is provided by a combination of Steered Molecular Dynamics (SMD) and Jarzynski’s Equation [9]; Grid computing provides the required new computing paradigm as well as facilitating the adoption of new analytical approaches. The application of an external force in a SMD simulation increases the timescale that can be simulated up to and beyond microseconds, whilst Jarzynski’s Equation provides a means of computing the equilibrium free-energy profile (FEP) in the presence of non-equilibrium forces. Thus, SMD simulations provide a natural setting to use Jarzynski’s equality, and hence the combined approach is referred to as the SMD-JE approach [10]. The PMF (Φ) is defined as the free energy profile (FEP) along a well defined reaction coordinate. By computing
3
the PMF for the translocating biomolecule along the vertical axis of the protein pore, significant insight into the translocation process can be obtained. Rather than a single detailed, long running simulation over physical timescales of a few microseconds (the typical time to solution for which will be months to years on one large supercomputer), SMD-JE permits a decomposition of the problem into a large number of simulations over a coarse grained physical timescale with limited loss of detail. Multiple SMD-JE non-equilibrium simulations of several million time-steps (the equivalent of several nanosecond equilibrium simulations) can be used to study processes at the microsecond timescale. In order to realistically benefit from this approach, however, it is critical to find the values of the parameters that provide an “optimal PMF”. These parameters are the pulling velocity (v) and spring constant (κ) of the pulling atom to the SMD atoms. A rigorous analytic relationship between the combined statistical and systematic fluctuations of the PMF and the values of v and κ does not exist, thus, there is a need to determine the parameter values which will minimize systematic and statistical errors. SPICE implements a method (hence referred to as SMD-JE, see Ref. [11] for more details) to compute the free energy profile (FEP) along the vertical axis of the protein pore. By adopting the SMD-JE approach, the net computational requirement for the problem of interest can be reduced by a factor of 50-100. The important caveat however, is that this methodology requires the introduction of two new variable parameters, with a corresponding uncertainty in the choice of the values of these parameters. Fortunately, the computational advantages can be recovered by performing a set of “preprocessing simulations” which, along with a series of interactive simulations, help inform an appropriate choice of the parameters. Grid computing: Providing a suitable computing paradigm When formulated as an SMD-JE problem there is an intrinsic ability to decompose an otherwise very substantial problem into smaller problems of shorter duration. We use the capabilities developed by the RealityGrid project [12], [13], [14] to make the SMD-JE approach amenable to an efficient solution on the Grid, by using the ability provided by the Grid to easily launch, monitor and steer a large number of parallel simulations. The RealityGrid steering framework - the architecture of which is outlined in Fig. 2a - has also enabled us to easily introduce new analysis and intuition building approaches; in particular, here we make use of haptic devices within the framework for the first time as if they were just additional computing resources. To benefit from the advantages of the SMD-JE approach and to facilitate its implementation at all levels – interactive simulations for such large systems, the pre-processing
Fig. 2. Schematic architecture of an archetypal RealityGrid steering configuration is represented in Fig. 2a. The components communicate by exchanging messages through intermediate grid services. The dotted arrows indicate the visualizer sending messages directly to the simulation, which is used extensively for interactive simulations.
simulations and finally the production simulation set – we use the infrastructure of a federated trans-Atlantic grid. A distributed environment is required for two primary reasons: Firstly as explained above the SMD-JE approach permits the problem of single very long running simulation to be converted into multiple, shorter (in time) simulations. Each simulation still remains “supercomputing” class, i.e., requires hundreds of processors on a HPC resource. In order to reduce the time-to-solution in any meaningful way, the many smaller simulations need to be launched over a large number of resources. Secondly, as the model being studied is very big and complex, a large number of processors are needed to provide sufficient compute-power in order for the simulation to be interactive. Consequently, it is rather unlikely that all resources required for interactive simulations will be available locally — that is simulations will be local to the visualization engine and both of these will be co-located with the scientist. Thus some mechanism of coordinating high-end distributed resources is required. As a consequence of requiring geographically distributed resources, high-end interactive simulations are dependent on the performance of the network. Interactive simulations use a visualizer as a steerer, e.g., to apply a force to a subset of atoms; such interactive simulations require, almost uniquely, reliable bi-directional communication — there is flow of data from the simulation to the visualizer as well as the visualizer to the simulation. Unreliable communication leads not only to a possible loss of interactivity, but equally seriously, a significant slowdown of the simulation as it stalls waiting for data from the visualization [15]. Thus interactive MD simulations require high quality-of-service (QoS) – as defined by low latency, jitter and packet loss – networks to ensure reliable bi-directional communication. This leads to the interesting situation where large-scale interactive computations require both computational and visualization
4
resources to be co-allocated with networks of sufficient QoS. Currently, such high QoS networks are provided through optical lightpaths and the optically networked Global Lambda Infrastructure Facility [16]. III. S IMULATION M ETHOD AND A NALYSIS The first stage is to use “static” visualization (visualizations not coupled to running simulations) to understand the structural features of the pore. However, the need to understand the functional consequences of structure, as well as the desire for information on forces and dynamic responses, requires the coupling of simulations to the visualizations. These are referred to as interactive molecular dynamics (IMD). Given the size of the model, in order that the simulation can compute forces quick enough so as to provide the scientist with any sense of interactivity typically requires performing simulations on 256 processors. These initial simulations along with real-time interactive tools are used to develop a qualitative understanding of the forces and the DNA’s response to forces (Fig. 3). This qualitative understanding helps in choosing the initial range of parameters over which we will try to find the optimal value. IMD simulations are then extended to include haptic devices to get an estimate of force values as well as to determine suitable constraints to place. Checkpoint and cloning of simulations features provided by the RealityGrid infrastructure can also be used for verification and validation tests without perturbing the original simulation and for exploring a particular configuration in greater detail. In interactive mode, the user sends data back to the simulation running on a remote supercomputer, via the visualizer, so that the simulation can compute the changes introduced by the user. When using 256 processors (or more) of an expensive high-end supercomputer it is not acceptable that the simulation be stalled (or even slowed down) due to unreliable communication between the simulation and the visualization - a general purpose network is not acceptable. Thus advanced networks that ensure capacity for high rate/high volume communications, and well bounded quality of service in terms of packet latency, jitter and packet loss are critical in such interactive scenarios. Once we have gathered sufficient insight from the interactive phase, we proceed to the batch phase. We used the grid infrastructure in Fig. 5, to perform to completion 72 parallel MD simulations in under a week with each individual simulation running on 128 or 256 processors (depending upon the machine used). This required approximately 75,000 CPU hours: it is unlikely that such computations would be possible in under a week without a grid infrastructure in place. Our approach thus advances high performance computing into a new domain by using the same grid middleware to integrate high-performance
(a)
(b)
(c) Fig. 3. Snapshots of a ss-DNA as it translocates through the alphahemolysin pore. The ss-DNA is steered along the direction of the vertical axis of the pore by applying a force to the C3’ atom. Water and lipid atoms are not shown. Notice how the strand of DNA stretches as it nears the constriction (near the middle) in the beta barrel portion of the pore.
5
computing, visualization and instrumentation within the same framework. This facilitates the solution of a problem that would otherwise probably not be possible. We now present some results to substantiate this claim. IV. R ESULTS We will discuss the dependence of the PMF on the individual variable parameters. This emphasises the subtle interplay between statistical and systematic errors and the fact that there is no analytical method that provides a direct means to determine the best parameters to use to compute the PMF. A. Choice of the sub-trajectory length We are interested in the PMF along the entire axis of the approximately cylindrical pore. In general, the further the center of mass (COM) of the SMD atoms from its initial position, the greater the statistical and systematic errors; hence when the PMF is required over a long trajectory, it is advantageous to break up a single long trajectory into smaller trajectories. Once again there is no information upfront on the optimal length of a sub-trajectory. As the parameter values used in the computation of the final PMF need to be the same for all sub-trajectories, we ˚ close to the centre choose a sub-trajectory of length 10A of the pore. In addition to helping lower the computational requirement, this has the advantage of being most likely to be free of boundary effects and is probably the most representative sub-trajectory. B. Choice of the force constant The proper choice of the force constant (κ) of the spring is important. Intuitively the force constant is a measure of how strongly the SMD atoms are coupled to the “fictitious” pulling atom. It needs to be large enough that the SMD atoms respect the constraints of the pulling atom, but at the same time it must not be too large or else the the PMF will become too noisy. It turns out that fluctuations for ˚ are extremely large, but far less than for κ = 1000pN/A ˚ For κ =10pN/A, ˚ however, the SMD atoms κ = 10 pN/A. are almost un-coupled to the pulling atoms which results in a large variation in the space sampled and resulting PMFs for the different v values. C. Choice of the pulling velocity Given that the initial and final coordinates are well defined, the larger v, the smaller the computational time required for a single simulation. Thus the number of samples that can be simulated for a fixed computational cost is greater for larger v. It would seem to be beneficial to always opt for a larger number of samples, so as to reduce the statistical fluctuations. This proves to be incorrect,
however, for too large a velocity produces “irreversible work” which results in deviations from the equilibrium PMF (the putatively correct PMF). Consequently, too large a velocity can be a major source of systematic error, as the quicker the DNA is pulled through the pore in the simulations, the less time it has to “sample correctly” the possible configurations. In the extreme limit of adiabatic translocation the PMF generated will be accurate. In general the slower the v, the more accurate the sampling; however this can’t be quantified easily, and a doubling of v could result either in an unchanged PMF or possibly a highly inaccurate PMF. It is important to normalize the statistical error for the difference in computational requirements. In the compu˚ tational time that one sample at a v of 12.5A/ns can ˚ be generated, eight samples at a v of 100A/ns can be generated. Thus, the statistical error√of a set of samples of the former should be set to be 8 of the latter. The statistical error (σstat ) in Fig. 4 have been normalized to account for the difference in the computational costs. From Figs. 4(a-d), the following observations can be made: The ˚ has least σstat , but largest systematic PMF for κ =10pN/A ˚ (σsys ) errors. The σstat is largest for κ =1000pN/A. ˚ Thus κ =100pN/A provides a tradeoff value. But for ˚ there is an insignificant difference in PMF κ =100pN/A ˚ values between v=12.5 and 25A/ps as well as in the the values of σsys . Consequently, given that for the same computational cost, the number of samples that can be simulated for the former is twice as large as the latter, an ˚ and v=12.5A/ns. ˚ optimal set of values are κ =100pN/A We have established from Fig. 4 that it is best to use ˚ κ =100pN/A˚ and v=12.5A/ns to compute the PMF. A more detailed analysis can be found in [11]. V. D ISCUSSION A. Determining the optimal parameters using SMD-JE: We discuss two potential drawbacks in our implementation of the SMD-JE approach: Firstly, it does not provide any assurance that we have found the “globally optimal” parameters. The optimal values that we’ve determined are a function of the set over which we’ve searched - which in turn is partially determined by insight gained during the real-time visualization and interactive phase of our analysis. More extensive priming may result in a different set of optimized parameters. These priming simulations are computationally expensive as well; thus there is a tradeoff between dedicating more resource to finding parameters that are potentially closer to the global optimum, and consuming less resources obtaining parameters close enough to the globally optimal parameters. Secondly, our approach does not determine what the optimal length of a sub-trajectory should be. It is possible that our results depend on the choice of length of the sub-trajectory.
6
Φ (Kcal/mol)
κ = 10 pN/Å 60 40 20 0 −20 −40 −60 −80 −100 −120 −140
v = 12.5 v= 25 v = 50 v = 100 1
2
3
4 5 6 7 8 displacement of COM (Å)
9
10
(a)
Φ (Kcal/mol)
κ = 100 pN/Å 60 40 20 0 −20 −40 −60 −80 −100 −120 −140 −160
Fig. 5. Diagram representing the federated US-UK grid – comprised of the US TeraGrid and the UK NGS. SPICE used a subset of the TeraGrid nodes (NCSA, SDSC and PSC), but used all nodes on the UK high-end NGS.
v = 12.5 v = 25 v = 50 v = 100 0
1
2
B. Setting up a grid-of-grids: Hiding the heterogeneity
3 4 5 6 7 8 displacement of COM (Å)
9
10
9
10
9
10
(b)
κ = 1000pN/Å 50
Φ (Kcal/mol)
0 −50 −100 v = 12.5 v= 25 v = 50 v = 100
−150 −200 0
1
2
3 4 5 6 7 8 displacement of COM (Å)
(c)
v = 12.5 Å/ns 40 20 Φ (Kcal/mol)
0 −20 −40 −60 −80
κ = 10 κ = 100 κ = 1000
−100 −120 0
1
2
3 4 5 6 7 8 displacement of COM (Å)
(d) Fig. 4. Plots showing the calculation of the optimal parameters (κ, v) based upon an analysis of the statistical and systematic errors.
Arguably the most important step after securing the resources required to run on a grid is the ability to gridenable the applications of interest. In general, rather than wholesale refactoring of codes, grid-enablement should be carried out by interfacing the application codes to suitable grid middleware through well defined user-level APIs. This has the advantage that complex parallel code can be grid-enabled without changing the programming model and with minimal changes to the code. Also, this approach has the extremely important advantage of hiding the heterogeneity of the software stack and site-specific variability of the different resources from the application. Additionally, once the application has been grid-enabled, the application is essentially sheltered from future, potentially disruptive changes in the software stack. For the SPICE project, a parallel MD application (NAMD [17]) is interfaced with the RealityGrid (ReG) steering library through the client side API. The client side of the ReG Steering Library [13] has a well defined simple interface, is portable, runs easily on grids enabled with the Globus Toolkit 2.x (GT2) (e.g., UK National Grid Service (NGS) resources and the TeraGrid) and provides all the functionality required to computationally steer the application. Thus the ReG steering infrastructure provides a uniform interface to widely differing hardware and software stack for an application. Further details on the working of the RealityGrid infrastructure can be found in Reference [18]. This serves the application by allowing the application to ignore many of the issues — some quite challenging — introduced by grid computing.
7
C. Federating Grids: Lessons Learnt Having ensured that our applications are grid-enabled and run satisfactorily on each individual site, the next logical step was to use multiple sites simultaneously from a grid federated from the UK NGS and US Teragrid as shown in Fig. 5. We discuss our some of our main experiences when using federated grids. 1) Special configuration issues: Due to local priorities, some resources are configured that make the deployment of user level middleware difficult. In particular, we encountered the “hidden IP address” problem on more than one resource – whereby internal nodes of the compute resources are not network addressable (i.e., not “visible”) to other compute sites. This poses a problem for example, when the master process — which may be running on a node which is not visible to the “external” world — is required to communicate with a visualization process running on a different machine. There are compelling reasons to configure the compute nodes of a single computational resource such as a large cluster or SMP with so-called ”hidden IP” addresses (e.g., security and IPv4 address depletion) which make sense if the machine is expected to run only local applications (i.e., in which all the processes are executing within that single machine). However, when that same computational resource is called upon to operate as part of a computational grid the hidden IP addresses severely undermines the computer’s contribution to the grid. The reason for this is clear: Grid applications, by their very nature, attempt to harness the power of distributed computational resources. All but embarrassingly parallel problems (e.g., parameter sweeps) require coordination, and therefore, communication between the processes. Message passing programs (i.e., MPI applications) are an excellent example of a nontrivial programming model designed to harness distributed processing power that also fall particular prey to hidden IP addresses. Some sites have bridged this utility gap. For example, the Pittsburgh Supercomputer Center (PSC) has implemented a software (their qsocket library) and hardware (their Access Gateway Nodes) solution that allows them to use hidden IP addresses while also accommodating IPbased inter-cluster communication. While their solution goes a long way towards addressing the problems introduced by hidden IP addresses there are still some minor issues (i.e., it does not support UDP-based traffic and routing multiple processes through single, or even a few, gateway nodes can present a bottleneck). 2) Different levels of maturity of the grid infrastructure: In particular, UKLight/GLIF was either not deployed at all or was barely (functionally stable) deployed on most UK resources. This eliminated the use of several resources — although in the case of HPCx there were additional problems which contributed to its not being usable (e.g.,
the hidden IP address problem). 3) Lack of Automated Coordination: The level of human intervention in coordinating resources is still too high to meet the definition of grid computing. For example, with advanced reservations made by hand, schedulers did not work always and required last minute corrections and tweaking. The current mode of operation is cumbersome, highly prone to error (one of the authors had to exchange about a dozen emails correcting three distinct errors introduced by two different administrators for one reservation request), and is not a scalable solution. Unfortunately it is difficult to co-schedule more than one resource on a single grid; the lack of scalability across grids, can be attributed to the fact that a bespoke solution is required for every different grid used. 4) Persistent and Stable Resources: Whereas demonstrations have a proof-of-concept role to fulfill, most can get by on infrastructure that is semi-permanent or even transient (e.g., set up for a particular conference). On the other hand, production scale science requires infrastructure that is stable over the long term. The challenges of providing a persistent, stable and usable infrastructure are significantly greater, thereby making it difficult to carry through production science. Hardware failure and security issues cause serious disruption, especially if there are single points of failure. For example, for a duration close to SC05, the number of UK resources whose utilization could be coordinated with the US TeraGrid nodes was reduced to one. As luck would have it there was then a security breach on that one UK node. It took several weeks to sanitize that node, during which there was no UK node that could be used in conjunction with the US TeraGrid nodes. The importance of redundant infrastructure thus cannot be over-emphasized. 5) Advantages of a Collaborative Approach: The SPICE, Nektar and Vortonics projects along with the middleware specialists and the resource providers led to the creation of a small community that were attempting to use the same set of resources over roughly the same period of time. This enabled not only users to share experiences and problems with other users, middleware experts and resource providers, but enabled resource providers to share expertise and information amongst each other. Variants of the same problem were encountered across different resources and for different applications. For example, the hidden IP problem caused problems for both the RealityGrid steering infrastructure and MPICH-G2 based applications on different resources on either side of the Atlantic. The resulting shared expertise and ‘collective debugging’ (e.g., “is it just my application or does this machine have problems?”) not only helps save time, but given the absence of more formal grid-wide debugging mechanisms is critically important. The ability to collectively influence policy should not be underestimated. For
8
example, partly due to the needs of the three projects and based upon our experience that manual solutions do not scale well, TeraGrid developed a web interface for advanced (cross-site) reservations. Although this does not completely automate the process, it does remove the need for human intervention at one more level. This illustrates how the collaboration was utilized for collective problem solving — from mundane low level technical details to the higher level concern of influencing policy. 6) Barriers to Federation: The primary barrier to interoperability is the varying levels of evolution and maturity of the constituent grids. This is probably due to differing requirements and boundary conditions imposed on the grid infrastructure reflecting differing expertise, priorities for the software stacks and funding of the various grid projects and initiatives. If somehow deployment ceased to be an issue, there still remains the purely technical issue of developing the required infrastructure. We appreciate the fact that co-scheduling computational resources is a difficult problem [19] in itself. But complicating matters further, is the fact that sooner or later, demand for lightpaths will increase and we will be faced with the more general situation of having to worry about coordinating and co-scheduling lightpaths with compute resources. There are encouraging attempts underway and progress is being made in this direction [20], [21] however, much remains to be done. And, although co-scheduling is critical, the issue of grid interoperability for large scale applications does not end there. There is a distinct need for additional niche “services”, for example, it is important to ensure resources on grids are configured for user-level middleware (e.g., MPICH-G2) as well as more application specific software (e.g., steering infrastructure). It is important to stress that this paper has dealt with an application which has utilized a federated grid without any explicit a priori effort to provide interoperability by resource providers. The ability to federate in this case is the result of effort to adapt an application to the distinct individual grids primarily at the user and middleware levels. This approach to interoperability however, is not scalable. In fact (it can be argued that) the probability of success is likely to decrease exponentially with every additional independent grid. VI. C ONCLUSION SPICE provides an example of an important, largescale problem that benefits tremendously from using federated grids. In particular it is a good example of the advantages — both quantitative and qualitative — that steering simulations of large biomolecular systems provide. It can be argued, that SPICE (actually, all three projects) not only successfully exploited the combined US-
UK computational resources but also required it. SPICE thus demonstrates the value of federated grids. The use of interactive simulations to explore further interesting and important problems, as well as its uptake as a generally useful computational approach will however, require a stable, easy to use infrastructure that addresses satisfactorily the issues of co-scheduling lightpaths, compute and visualization resources. It has been stated that, “Applications must be the lifeblood (of the grid)!” [22]. There is however the classical chicken-and-egg conundrum: before applications can utilize any infrastructure, the infrastructure must be mature and widely and stably deployed and supported. But before any infrastructure will be widely and stably deployed, there must be obvious projects that require and can use the infrastructure. In this paper we have established that there exist some first generation scientific grid applications [23] that are able to utilize a grid-of-grids. We maintain that it is thus timely to provide an obliging infrastructure. This we believe will require, making the basic components of a grid interoperable and in addition to more will need to be done to enable services like co-scheduling that also span grids in order to facilitate large scale applications to exploit federated grids. It is worth noting that the grid computing infrastructure used here for computing free energies by SMD-JE can be easily extended to compute free energies using different approaches (e.g., thermodynamic integration [14]). This opens up our approach to many different problems in computational biology, e.g., drug design, cell signaling, where the computation of free energy is critical. Equally important, exactly the same approach used here can be adopted to attempt larger and even more challenging problems in computational biology, as there is no theoretical limit to how well our approach scales; the only constraint is the availability of computational resources. VII. ACKNOWLEDGMENTS SPICE utilizes infrastructure from the RealityGrid project. In particular we would like to thank Stephen Pickles, Robin Pinning, Andrew Porter, Robin Haines (Manchester) and Radhika Saksena (UCL) from RealityGrid. Special appreciation is due for Sergiu Sanielevici (PSC) for coordinating many aspects of the cross-site runs and the collaboration. We thank Bruce Boghosian, Suchuang Dong, Lucas Finn, Nick Karonis and George Karniadakis for working along with us as part of the Vortonics and Nektar sister projects. This work has been supported by EPSRC Grant EP/D500028/1 and EPSRC RealityGrid project GR/R67699. R EFERENCES [1] D. K. Lubensky and D. R. Nelson. Phys. Rev E, 31917 (65), 1999; Ralf Metzler and Joseph Klafter. Biophysical Journal, 2776 (85),
9
[2] [3] [4] [5]
[6] [7] [8]
[9] [10] [11]
[12] [13] [14] [15]
[16] [17] [18] [19] [20] [21] [22] [23]
2003; Stefan Howorka and Hagan Bayley, Biophysical Journal, 3202 (83), 2002. A. Meller et al, Phys. Rev. Lett., 3435 (86) 2003; A. F. Sauer-Budge et al. Phys. Rev. Lett. 90(23), 238101, 2003. M. Karplus and J. A. McCammon. Molecular Dynamics Simulations of Biomolecules. Nature Structural Biology, 9(9):646–652, 2002. Aleksij Aksimentiev et al., Biophysical Journal, 88, pp3745-3761, 2005. B. Boghosian, P. Coveney, S. Dong, L. Finn, S. Jha, G. Karniadakis, and N. Karonis. Nektar, SPICE and Vortonics – Using Federated Grids for Large Scale Scientific Applications. In Submitted to the Proceedings of Challenges of Large Applications in Distributed Environments (CLADE) 2006. http://www.realitygrid.org/publications/triprojects clade06.pdf. R. Blake, P. V. Coveney, P. Clarke, and S. M. Pickles. The Teragyroid Experiment – Supercomputing 2003. Scientific Programming, 13(1):1–17, 2005. J. Chin and P.V. Coveney, Chirality and Domain Growth in the Gyroid Mesophase, preprint (2005). P. Fowler, S. Jha, and P. V. Coveney. Grid-based Steered Thermodynamic Integration Accelerates the Calculation of Binding Free Energies. Phil. Trans. Royal Society of London A, 363(1833):1999–2015, August 2005. http://www.pubs.royalscoc.ac.uk/philtransa.shtml. C. Jarzynski. Phys. Rev. Lett. 2690 (78) 1997; C. Jarzynski. Phys. Rev. E 041622 (65) 2002. Sanghyun Park et al. Journal of Chemical Physics, 3559, 119 (6), 2003. S. Jha, P. V. Coveney, M. J. Harvey, and R. Pinning. SPICE: Simulated Pore Interactive Computing Environment. Proceedings of the 2005 ACM/IEEE conference on Supercomputing, page 70, 2005. dx.doi.org/10.1109/SC.2005.65. The RealityGrid Project http://www.realitygrid.org. S. Pickles et al. The RealityGrid Computational Steering API Version 1.1 http://www.sve.man.ac.uk/Research/AtoZ/RealityGrid/ Steering/ReG steering api.pdf. P. Fowler, S. Jha & P. V. Coveney, Phil. Trans. Royal Soc. London A, pp 1999-2016, vol 363, No 1833, 2005. S. Jha, M. J. Harvey, P. V. Coveney, N. Pezzi, S.Pickles, R. L. Pinning, and Peter Clarke. Simulated Pore Interactive Computing Environment (SPICE) – Using Grid Computing to Understand DNA Translocation Across Protein Nanopores Embedded in Lipid Membranes. Proceedings of the UK e-Science All Hands Meeting, September 19-22 2005. http://www.allhands.org.uk/2005/proceedings/papers/455.pdf. http://www.glif.is. L. Kale et al. Journal of Computational Physics. 151:283-312, 1999. J. Chin, J. Harting, S. Jha, P. V. Coveney, A. R. Porter, and S. M. Pickles. Contemporary Physics, 44:417–432, 2003. Grid Resource Allocation Agreement Protocol Working Group https://forge.gridforum.org/projects/graap-wg, http://www.fzjuelich.de/zam/RD/coop/ggf/graap/. J. McLaren and M. McKeown. HARC: Highly Available Robust Co-scheduler. e-print http://www.realitygrid.org/publications/HARC.pdf. G-Lambda: Coordination of a Grid Scheduler and Lambda Path Service over GMPLS. http://www.gtrc.aist.go.jp/g-la mbda/. G. Allen, T. Goodale, M. Russell, E. Seidel, and J. Shalf. Classifying and Enabling Grid Applications, page 601. Grid Computing: Making the Global Infrastructure a Reality. Wiley, 2004. J. Chin, M. J. Harvey, S. Jha, and P. V. Coveney. Scientific Grid Computing: The First Generation. Computing in Science and Engineering, 10(2):24–32, Sept-Oct 2005.