The BioPAUÁ Project: A Portal for Molecular Dynamics Using Grid Environment Alan Wilter1, Carla Osthoff1, Cristiane Oliveira1, Diego E. B. Gomes2, Eduardo Hill2, Laurent E. Dardenne1, Patrícia M. Barros1, Pedro A. A. G. L. Loureiro2, Reynaldo Novaes3, and Pedro G. Pascutti2 1
LNCC, Laboratório Nacional de Computação Científica, Petrópolis, Brazil {alan, osthoff, cris,dardenne, patricia}@lncc.br http://www.lncc.br/ 1 IBCCF, Instituto de Biofísica Carlos Chagas Filho,UFRJ Rio de Janeiro, Brazil {diego, hill, ploureiro, pascutti}@biof.ufrj.br http://www.biof.ufrj.br 1 HP-Brasil, Hewllet Packart Brasil Porto Alegre, Brazil
[email protected]
Abstract. This paper describes BioPAUÁ project first release of a portal for Molecular Dynamics (MD) using a computational grid environment. It unites MD simulations and analyses tools with grid technologies to provide support to biomolecular structures in silico experiments. BioPAUÁ Project goal is to offer a tool, as well the facility, for researches working in several important fields (e.g., bioinformatics, structural biology, biochemistry, medicinal chemistry), even they don’t have especial skills in MD simulations. For that, we employ a Web system interface to facilitate the use of our services, over a network of clusters widely spread in Brazil, which composes the network PAUÁ. Our methodology is based on MYGRID middleware and uses GROMACS package in order to run molecular dynamics simulations. This work is developed by LNCC/MCT, with IBCCF/UFRJ collaboration, and supported by HP Brazil R&D.
1 Introduction The increasing popularity of the Internet, associated with the availability of powerful computers and high-speed networks, is changing the traditional way of doing high performance computing. These new technologies allow the association of a great variety of distributed computational resources, such as super computers, high performance storage systems and workstations, which can be used as a unified resource and thus give form to a Grid computing infrastructure. The Grid concept is similar to the electric power net, and it has the aim to couple with distributed heterogeneous resources to offer a consistent and cheap access to a service, despite their physical location [1].
MYGRID [2, 3] is a Grid middleware that supports the execution of bag-of-tasks (BoT) applications and provides a global execution environment allowing the remote execution of several tasks in parallel through the machines that the user can access. BoT applications are those parallel applications whose tasks are independent of each other [4, 5]. However, very few users of BoT applications are currently using computational Grids, despite the potential dramatic increase in resources Grids can bring. We believe that this state of affairs is due to the complexities involved in using Grid technology, and the slow deployment of existing Grid infrastructure. Molecular dynamics simulation is an example of a class of applications that can require a large amount of computational resources. The Molecular Modelling and Dynamics Laboratory of the Biophysics Institute Carlos Chagas Filho (IBCCF), at Federal University of Rio de Janeiro (IBCCF/UFRJ group), has been studying intermolecular interactions for a family of variant proteases complexed with some potential inhibitors by means of Molecular Dynamics (MD) techniques [6]. Theses in silico experiments involve nanosecond time scale, and usually are composed of a protein in a box filled of water molecules, sometimes with ten of thousand atoms, at periodic boundary conditions. Then, as there are several protease polymorphisms with several inhibitors to be studied, this problem becomes both computing and data intensive, and moreover, each simulation is independent. Such a BoT application can be devised here, being suitable for a Grid. Moreover, being MD a new fruitful field in development, the validation of the socalled experiments in silico and its consequent acceptation by the scientific community can only be achieved when endorsed by reproducing, theoretically, results of biological experiments performed in vitro or in vivo [7, 8]. Such condition imposes the improvement and tests of protocols elaborated to MD, which, invariably, implies several simulations for a system in order to determine which is the best methodology to be utilised. Since tasks in MD are very computationally demanding, usually taking several days of simulation and can be ran independently, they are a desirable case to be set in a grid of computers. In this paper, we describe the BioPAUÁ portal, which unites Molecular Dynamics tools and Grid technologies to show that their combined effect can advance experimental science. BioPAUÁ Project goal is to offer a tool, as well the facility, for researches working in several important fields (e.g., bioinformatics, structural biology, medicinal chemistry), even they don’t have especial skills in MD simulations. Therefore, BioPAUÁ Project is developing a Web Portal for biomolecular modelling researches to access high performance computing power trough computational grid at PAUA Network, employing MYGRID (middleware for computational grid) with a Web system interface based on the integration of tools needs for studies involving Molecular Dynamics (MD) simulations. The first Portal release is mainly proposed to scientific community, specialists and ones not yet familiar to MD, and offers to them a Molecular Dynamics Research Application Web environment through a Grid Computing Platform. The Portal accepts jobs of MD defined by the user and will spreads them over a cluster and a grid of computers. This paper is organised as follows. Section 2 describes the nature of Molecular Dynamics experiments and motivates the Portal tool. Section 3 describes the implementation of the Portal Grid Computing environment. Section 4 presents Grid Com-
puting technology background. Section 5 discusses related works. We conclude in Section 5 and present future works.
2 Molecular Dynamics: in silico Experiments The natural following step after genomics studies is the determination of threedimensional structures of possible target proteins, which are important, e. g., for biotechnological process and development of new drugs. However, only a small amount of such possible target proteins may have their 3D spatial structure determined by experimental techniques, such as X-ray crystallography and nuclear magnetic resonance (NMR). Anyway, among the biomolecules with their spatial structure available at Protein Data Bank (http://www.pdb.org), we can find the class of proteolytic enzymes, which have became a therapeutic target against infectious diseases, resulting in the development of new drugs, and much of this task has been done with help of computer aided approach. This methodology is now well-known as the term “rational drug design” (RDD). Then, in this class, we can find several HIV proteases type 1 complexed with some inhibitors [9], and from their structures we are able to build others structural models for different polymorphism by Comparative Modelling. In the last years, the rational drug design approach, based on exploring the structural complementary concept to generate specific antagonists for target molecules, has been endorsed. Knowing in detail the spatial structure of enzymes, by means of developed computational tools and help of a structural data bank, we can attempt to identify molecules whose specificity can match the active site of such enzyme, thus, leading them to become its inhibitors. This procedure not only reduces time and costs involved in searching new prototype drugs, but is also less hazardous, since it needs less biological manipulation. For instance, the control of AIDS, since 1995, in many countries, has been accomplished by employing protease inhibitors developed by this approach [10, 11]. An area of RDD field of investigation is the study of intermolecular interactions between protease/ligand complexes. Structural fluctuations and conformational flexibility are important for enzymatic activity. Having in mind that involved internal movements enclosing different time scales and amplitudes, a dynamical analysis of an enzyme is required in order to understand its function and interaction with inhibitors at atomic scale. Thus, theoretical models, as well as experimental ones, complexed with different ligand compounds (prototype inhibitors), need long MD simulations in order to extract valuable information. A system composed by an enzyme/inhibitor complex, inserted in aqueous solvent, with the aim of simulating physiological conditions, can reach easily tens of thousands atoms. The computational simulation of a such system by MD demands a lot of time (days to weeks, even months) in a dedicated high-end computer to reproduce only a few nanoseconds, besides huge space for data storage (tens to hundreds megabytes). These issues limit the investigation of prototype drugs that, many times, need to be tested in tens of complex combinations, requiring each one long simulations. Moreover, there is also the problem of investigating how proteases with particular
mutations, some presenting drug resistance [12], cope with ligand. These factors characterise the application of MD to the investigation of complexes a massive problem to be dealt by computers, which guide us to the use of computational grids as a possible solution.
3 BioPAUÁ Portal Description The BioPAUÁ portal provides submission and execution in MD services, based on GROMACS package [13], trough a grid computing platform. However, even being our wish, we can not surround, at ease, all possibilities of GROMACS utilisation, which is also able to do, e. g., normal mode analysis, free energy calculations etc., besides others sophisticate options. Even in MD and optimisation protocols that we offer, we have to delimit the options based on our experience to facilitate the introducing, for the user, in the MD techniques field and letting be him/her able to start doing MD simulations immediately, as painless as possible. Nevertheless, even the expert in GROMACS is able to use our grid facility via Portal, as described bellow. So, it means that, at first, any sort of user is eligible to exploit BioPAUÁ. Therefore, to access the Portal services, the user must first register, accepting our Term of Use. The general Portal layout is simple and thought to be easily accessible (Fig. 1. illustrates its Welcome page). Once logged in, the user will be able to use the MD services in the following manner: i) By submitting any PDB file, containing amino acids residues, with or without a topology file for ligand (an ITP file in GROMACS format). Any residue or molecule group not identified automatically by GROMACS and without it respective topology provided by the use will be neglected of simulation. The user must be aware of this. ii) By submitting any PDB file, without being necessarily a protein, just a ligand for example, as long as PDB file has its residues identification matching the ones recognised by GROMACS (like nucleotides bases, some saccharides and lipids) or with its respective topology file in ITP format. iii) By submitting just a TPR file, previously designed by the user expert in GROMACS. In this mode, any likely GROMACS application is virtually feasible. There is also the possibility of building particular topologies, whose parameters are not inserted in GROMACS topology database, by accessing the Dundee PRODRG2 Server [14] (http://davapc1.bioch.dundee.ac.uk/programs/prodrg/). But, BioPAUÁ does not concern about making topologies, which inevitably involves expertise in MD theory. Users can monitor their jobs, cancelling and/or removing them, and eventually, download results files, with logs, trajectories, energies, analyses etc. They will also be able to extend their simulations, which it is really appealing since users can plan their simulation projects in several stages, and this is truly recommended. Besides, in the output files, there is one listing the GROMACS commands executed sequentially by BioPAUÁ server, initially, with the intention of allowing users to reproduce simulations and analysis by themselves in a local computer, and so achieving some GROMACS skills.
Fig. 1. Screenshot of the BioPAUÁ Portal presentation page: user will only have access to MD services after registering and logging in. The Portal web address is http://www.biopaua.lncc.br
3.1 GROMACS Software Package GROMACS [15] (http://www.gromacs.org) is a programme to solve Newtonian equations of motion for systems up to millions of atoms, in a extremely high performance, since it uses very optimized code, especially for x86 processors. It’s probably the fastest MD programme available. It can be applied not only for biomolecular systems, as for instance, polymers. It also comes with a large selection of flexible tools for trajectory analysis, ready for visualization with graphical tools. In addition, GROMACS is free software, available under the GNU General Public License. Nowadays, a large scientific community uses, develops and supports this package. Because of these remarkable features, GROMACS became our natural choice for running MD simulations in the Portal’s background, being necessary, of course, to adapt it to our web interface and spread jobs over grid facility.
4 Grid Computing Technology Background
4.1 PAUÁ Network Project PAUÁ, which means “everything” in Tupi-Guarani [16], is an initiative created by HP Brazil R&D to build a countrywide Brazilian Grid. PAUÁ currently involves 11 different universities and research centres that collaborate with HP Brazil R&D in what we call the “HP Brazil’s research ecosystem”. The goals of PAUÁ are twofold. The first goal is to take advantage of a number of computational resources available on the different research centres as well as HP Brazil R&D itself, creating a wide, geographically distributed Grid along the country. The second goal is to foster grid research, so that the solution currently being developed is constantly improved based on its own usage and experiences. PAUÁ is a 250-node grid that supports the execution of Bag-of-Tasks applications whose tasks are independent. Bag-of-Tasks (BoT) applications are those parallel applications whose tasks are independent of each other. Due to the independence of their tasks, BoT applications can be successfully executed over widely distributed computational grids. One can argue that BoT applications are the applications most suited for computational grids, where communication can easily become a bottleneck for tightly coupled parallel applications. In PAUÁ parlance, the peer-to-peer resource exchange network is called OURGRID. Therefore, the site scheduler is an OURGRID peer. The broker and job scheduler is termed MYGRID. 4.2 OURGRID / MYGRID Project The OURGRID Project (http://www.ourgrid.org) is a collaborative effort involving Hewlett-Packard (HP) and Federal University of Campina Grande (UFCG) to research and develop solutions of usage and management of computational grids. Grid Computing has appeared with the enticing promise of turning computing into utility. The vision is plug in the grid and solve your problem. However, turning the grid vision into reality is no trivial matter. Despite the great progress made in the last years, Grid Computing is still far from reality to most users. The OURGRID Project aims to deliver grid technology that can be used today by current users to solve present problems. The three major components of the OURGRID toolkit [17] and their interaction are the OURGRID Community, MYGRID and SWAN. MYGRID is the user broker when dealing with the grid. The OURGRID Community is responsible for assembling grid to be used by MYGRID instances. The OURGRID Community is a peer-to-peer system that addresses the issue of grid assembly for applications that benefit from best-effort resource allocation. An approach often used in peer-to-peer systems to design software as a simple system, stressing ease of deployment in order to ease make its adoption. Doing that, they pro-
vide today grids that are useful, scalable and flexible to a significant portion of grid community. Each peer in the community is an institution that owns a number of resources and occasionally needs more computing power than these resources can provide. Whenever a peer node needs more power, it requests resources to the community. Whenever it has idle resources, it allocates them to the requesters. To encourage resource contribution to the network, OURGRID uses a resource allocation mechanism called Network of favours [18]. The Network of Favours is an autonomous reputation scheme that rewards peers that contribute more. In this way, there is an incentive for each peer to contribute as much as possible to the system. As the user broker, MYGRID is responsible for providing the user with high-level abstractions that allow him to deal with the grid. The key abstractions provided are tasks, jobs, and grid machines. A job is a collection of tasks, which are the units of work that can be executed in parallel. Each resource in the grid is called a grid machine. The user is responsible for providing a description of his jobs, and an entry point for the OURGRID Community in the form of URL for an OURGRID Peer. MYGRID uses the information to request grid machines to the peer and to schedule the user application on the resources it gains access to. This allows the user to specify the requirements of jobs and tasks, and MYGRID to match tasks to grid machines. An instance of the OURGRID Community is composed of a set of peers. Each peer represents a site. We call users in the same site local users and all other ones community users. Each local user uses a MYGRID instance to interact with the local OURGRID Peer. Each MYGRID requests grid machines to the peer based on the jobs its user has submitted, and the peer is then responsible for assembling a set of grid machine that satisfy the requests in its site, after that, it requests resources to the other peers in the OURGRID community. Each peer is responsible to allocate both the local resources and those obtained from the community among its local users. When a peer provides access to his resources to unknown community users, several security issues emerge. The resource owners need to certify that community users cannot damage the resources made available or use them for malicious purposes.
5 Related Works The contributions of this work are twofold. First it contributes to applications portal Grid Computing research, second it contributes to bioinformatics tools research area. Application portal Grid Computing contribution is related to OURGRID Community portal development. OURGRID Toolkit design guidelines dictate that all of its components must be extensible, encompassing and easy to deploy. The OURGRID Community is extensible and encompassing through its Grid Machine Interface. It makes possible the transparent addition of new types of resources to the system, and being a simple set of operations, it allows implementations of interface to be possible to virtually any resource that can receive data and run a computation. Nevertheless, grid-computing abstractions deal with concepts that usually have no interest for other researches then computer science. Therefore, BioPAUÁ portal interface presents grid users abstractions in order to cache bioinformatics grid computing implementations
interface issues. In this sense, while in development, hence, closed to community, the Portal tool has been used by us and some results of researches in protease/inhibitors complexes were achieved [6], as well as for falcipain-2 complexed with E64 and Z-LR-AMC [19]. We can also emphasise that Portal is able to send to user some basics initial analyses done over simulations in a graphical form, and then, they can be easily visualised. RMSD and energies are some examples of analysis output already implemented in this first Portal release (Fig. 2).
Fig. 2. An example of RMS fluctuation graphical output generated by GROMACS analysis tool via BioPAUÁ Portal for falcipain-2, as studied by Gomes et al. [19]
Bioinformatics tools research area contribution is related to scientific community in the fields of Molecular Biology and Pharmacology, not requiring from them any extensive knowledge in computation. There are other web-based bioinformatics projects. For instance, BioBrew is an open source Linux cluster distribution based on the popular Rocks (http://www.rocksclusters.org) cluster software and enhanced for bioinformatics; and Biology Workbench (http://workbench.sdsc.edu) is a web-based tool for biologists. WorkBench [20] allows biologists to search many popular protein and nucleic acid sequence databases. Databases searching is integrated with access to a wide variety of analysis and modelling tools, all within a pot and click interface that eliminates file format compatibility problems. Unlike BioPAUÁ, those projects do not address grid computing interface system issues. The closest work to BioPAUÁ grid computing issues is “The National Fusion Collaboration” that focuses on enabling fusion scientists to explore grid capabilities in support of experimental science. The Virtual Control Room [21] unites collaborative, visualization and Grid Technologies to show how their combined effect can advance fusion experiments, engaging more scientists from geographically distributed team of scientists and resources. However, this work is based on GLOBUS toolkit (http://www.globus.org). GLOBUS Toolkit includes software services and libraries for resource monitoring, discovery, and management, plus security and file manage-
ment, and technologies that do not address scheduling or resource management directly, like OURGRID toolkit. They rather provide the grid building blocks, the common foundation on which grids are building. Furthermore, OURGRID toolkit interface it is also possible to wrap other grid software, such as GLOBUS, in order to provide interoperability.
6 Conclusions and Future Works We present here, the BioPAUÁ first release portal project Molecular Dynamics Tool for GRID environment and our main contributions are in the field of application portal Grid Computing research and bioinformatics tools research area. The bioinformatics portal is proposed for scientific community in the fields of Molecular Biology and Pharmacology not yet familiar to Molecular Dynamics (MD) simulations. The first release is a Molecular Dynamics Research Application Web environment to investigate proteins dynamical proprieties, emphasising protein/ligand complexes, but not only this, since an expert user of GROMACS is also able of building any specific simulation and submit it to PAUÁ Grid. For future work we are improving the first version to offer more options of simulation and analyses, as well as its interface. Likewise, we are also investigating solutions to introduce checkpoints to resubmit grid computing failed jobs trough GROMACS, and/or through MYGRID checkpoint facilities. We are also investigating to add new Molecular Biology and Pharmacology applications, like highthroughput drug screening. Finally, BioPAUÁ project is now part of the PortalGiGa project (http://www.rnp.br), supported by “Rede Nacional de Pesquisas (RNP)”, and therefore is going to be used to investigate gigabyte network issues, as well as it will take advantage of a such bandwidth, since jobs involved in MD delivers files of many megabytes.
References 1. Foster, I., Kesselman, C., Tuecke, S.: The Anatomy of the Grid-Enabling Scalable Virtual Organizations. To appear in the International Journal of Supercomputing Applications 2. Cirne, W., Marzullo, K.: Open Grid: A User-Centric Approach for Grid Computing. 13th Symposium on Computer Architecture and High Performance Computing (September 2001) 3. Cirne,W., Paranhos D., Costa, L., Santos-Neto, E., Brasileiro, F., Sauvé, J., Silva, F. B. A., Osthoff, C., Silveira, C.: .Proceedings of the ICCP'2003 - International Conference on Parallel Processing (October 2003) 4. Allcock, W., Bester, J., Bresnahan, J., Chervenak, A., Foster, I., Kesselman, C., Meder, S., Nefedova, V., Quesnel, D., Tuecke, S.: Data Management and Transfer in HighPerformance Computational Grid Environments. Parallel Computing (2001) 5. Silva, F. Scherson, I. D.: Efficient Parallel Job Scheduling Using Gang Service. International Journal of Foundations of Computer Science. (June 2001)
6. Batista, P., Wilter, A. Durham, E. H. A. B., Pascutti, P. G.: Molecular Dynamics Simulations Applied to the Study of Subtypes of HIV 1 Protease Common to Brazil, Africa and Asia. Journal Cell Biochemistry and Biophysics (2005) in press. 7. Collins, J. R., Burt, S. K., Erickson, J. W.: Flap Opening in HIV-1 Protease Simulated by Activated Molecular Dynamics. Nature Struct. Biol. 2 (1995) 334-338 8. Wang, W., Kollman, P. A.: Computational Study of Protein Specificity: The Molecular Basis of HIV 1 Protease Drug Resistance. Proceedings of the National Academy of Science, USA 98: 26 (2001) 14937 14942 9. Deeks, S. G., Smith, M., Holodniy, M, Kahn, J. O.: HIV-1 Protease Inhibitors. A Review for Clinicians. Journal of the American Medical Association 277 (1997) 145-153. 10. Chou, K-C. Tomasselli, G., Reardon, I. M., Heinrikson, R. L.: Predicting Human Immunodeficiency Virus Protease Cleavage Sites in Proteins by a Discriminant Function Method. Proteins: Struct. Funct. Genet. 24 (1996) 51-72 11. Prabu-Jeyabalan, M., Nalivaika, E. and Schiffer, C.A.: How Does a Symmetric Dimer Recognize an Asymmetric Substrate? A Substrate Complex of HIV-1 Protease. J Mol Biol 301 (2000) 1207-1220 12. Caride, E., Hertogs, K., Larder, B., Dehertogh, P., Brindeiro, R., Machado, E., De Sá, C. A. M., Eyer-Silva, W. A., Sion, F. S., Passioni, L. F. C., Menezes, J. A., Calazans, A. R., Tanuri, A.: Genotypic and Phenotypic Evidence of Different Drug-Resistance Mutation Patterns between Non-B Subtype Isolates of Human Immunodeficiency Virus Type 1 Found in Brazilian Patients Failing HAART. Virus Genes 23: 2 (2001) 193-202 13. Lindahl, E., Hess, B., van der Spoel, D.: Gromacs 3.0: A Package for Molecular Simulation and Trajectory Analysis. J. Mol. Mod. 7 (2001) 306-317 14. van Aalten, D.M., Bywater, R., Findlay, J.B., Hendlich, M., Hooft, R.W., Vriend, G.: PRODRG, a Program for Generating Molecular Topologies and Unique Molecular Descriptors from Coordinates of Small Molecules. J. Comput. Aided Mol. Des. 10 (1996) 255-262 15. van der Spoel, D., Lindahl, E., Hess, B., van Buuren, A. R., Apol, E., Meulenhoff, P. J., Tieleman, D. P., Sijbers, A. L. T. M., Feenstra, K. A., van Drunen, R., Berendsen, H. J. C.: Gromacs User Manual Version 3.2, www.gromacs.org (2004) 16. Cirne, W., Brasileiro F., Costa, L.: Scheduling in Bag-of-Task Grids: The PAUÁ Case. In proceedings 16th Symposium on Computer Architecture and High Performance Computing (2004) 17. Andrade, N., Costa, L., Germoglio, G., Cirne, W.: Peer-to-peer Grid Computing with the OurGrid Community. Proceedings of the SBRC (2005) 18. Andrade, N., Brasileiro, F., Cirne, W., Mowbray, M.: Discouraging Free Riding in a Peerto-Peer CPU-sharing Grid. In Proc. 13th IEEE Symposium on High Performance Distributed Computing (HPDC’04) (2004). 19. Gomes, D. E. B., Rössle, S. C. S., Bisch, P. M., Pascutti, P. G.: Molecular Modeling and Dynamics of Falcipain-2 Protease Complexes, a Contribution for Drug Development against Malaria. Submitted to Biophysical Chemistry (2005). 20. Rojnuckarin, A., Livesay, D. R., Subramaniam, S.: Biomolecular Reaction Simulation Using Weighted Ensemble Brownian Dynamics and the University of Houston Brownian Dynamics Program. Biophysical Journal 79 (2000) 686-693 21. Keahey, K., Papka, M. E., Peng, Q., Schissel, D., Abla, G., Araki, T., Burruss, J., Feibush, E., Lane, P., Klasky, S., Leggett, T., McCune, D., Randerson, L.: Grids for Experimental Science: The Virtual Control Room, in proceedings of the Challenges of Large Applications in Distributed Environments (CLADE), Honolulu, HI (June 2004)