Karan Bhatiaâ, Celine Amoreiraâ , Stephen Mockâ, Sriram Krishnanâ, Jerry Greenbergâ, ... job scheduling middleware (Globus [24], Sun Grid Engine.
SDSC TR-2006-7
Grid Workflow Challenges in Computational Chemistry Karan Bhatia∗ , Celine Amoreira† , Stephen Mock∗ , Sriram Krishnan∗ , Jerry Greenberg∗ , Brent Stearn∗ and Kim Baldridge∗† ∗ San
Diego Supercomputer Center UC San Diego MC 0505, 9500 Gilman Dr, La Jolla, CA 92093 † University of Zurich {karan,mock,sriram,jpg,flujul}@sdsc.edu, {kimb,amoreira}@oci.unizh.ch
Abstract— Traditionally, access to high-end clusters and the development of computational codes optimized for the cluster environment have been the limiting factor for high performance computing in many scientific domains. However, over the past few years, the commoditization of cluster hardware, the development of cluster management software (e.g., ROCKS, OSCAR), the standardization of applications software and their highperformance capabilities (MPI, CONDOR), and the development of service oriented infrastructures, have made access to largescale computing commonplace in many disciplines. We argue that the challenge in today’s environment is the integration of capabilities across multiple separate independent applications, including access to public datasets, and the creation and publication of explicit workflows. In this paper, we describe this specific challenge as it relates to supporting novel science being conducted in the domain of Computational Chemistry.
I. I NTRODUCTION As Grid Computing technology has matured and its adoption in scientific disciplines has advanced, the technical challenges have evolved from an initial focus on algorithm design for parallel and clustered systems to more social challenges such as data publication and search, metadata specification and standardization, and tool integration. The technology trends that have enabled this change include: wide adoption of easyto-use and easy-to-manage commodity clusters (Rocks [27], OSCAR [4]); wide availability of HPC libraries (MPI) and job scheduling middleware (Globus [24], Sun Grid Engine [11], Portable Batch System [22]); the development of easily deployed grid security components (GAMA [17], PURSE [6]); and the application of service-oriented-architectures (SOAs) and its constituent technologies – WSDL for service description, SOAP for communication, portals for end-user environments (GridSphere [8], OGCE [3], Jetspeed [1], GEMSTONE [7]). These technologies taken together allow grid computing infrastructure for any domain to be built rapidly and cost effectively. Computational Science, in the meantime, is also evolving from isolated city-states [29] of research towards a multidisciplinary integrated view across different scientific scales and integrating multiple toolsets. In [20], Foster describes an evolution towards a service-oriented science: “scientific research enabled by distributed networks of interoperating services”. For example, large scale biological endeavors, such as
the NIH-funded National Biomedical Computation Resource (NBCR) [9] describes this as multi-scale modeling and lists it as one of its core challenges. With the current state of technology, a user who wishes to leverage multiple applications in a particular scientific effort must exert significant effort in training for each application, and must develop his or her own mechanisms for interoperability across the applications. As the number of applications increases, this effort can easily dominate the day-to-day activities of the user. Some of the training effort is semantic in nature; that is, the proper use of scientific applications requires a good understanding of the underlying science associated with the applications. However, many of the bottlenecks, we argue, are arbitrary in nature relating to commandline variances, input file formats, data formats, poor or missing documentation, or simply historical accident. These are often not science-based, but related to the particular implemention or runtime environment for the applications. We term these syntatic bottlenecks and focus our infrastructure efforts here. Service-oriented architectures (SOA), workflow systems and component architectures are all technologies being developed to, in part, support the integration of separate applications. However, important differences exist. SOAs provide programmatic access to applications and can work in concert with workflow systems to provide strongly typed dataflow environment. Many workflow systems support SOAs, including Kepler [12] and Taverna [19], while others support grid applications running within Globus or Condor [13], [18]. Component architectures such as the Common Component Architecture [14] integrate applications in a much tighter way, but require significant code changes and integration effort. Our focus in this paper is on understanding the requirements of a workflow system leveraging an underlying SOA for applications. The SOA is being built as part of some combined efforts, including those of the UniZH group scientific efforts, the NSF-NMI team effort, and efforts of the National Biomedical Computation Resource (NBCR) infrastructure. The first part of this paper focuses on the particular workflow challenges in the domain of Computational Chemistry, illustrating as a case study the tools and procedures needed to leverage community datasets, utility programs and both quantum mechanical and
classical application codes in a scientifically interesting way. The scientific results of such a coupling of codes is beyond the scope of this paper and will be published separately. The second part of the paper discusses the infrastructure requirements resulting from the case study, in particular the technical challenges to building a workflow system. Finally, in Section 4, we conclude with a summary of some lessons learned. II. C ASE S TUDY: W ORKFLOWS IN C OMPUTATIONAL C HEMISTRY Currently, there are significant research efforts underway in the area of protein-ligand interactions, including the development of many different techniques and associated software tools. The primary objective is the prediction of the optimal orientation and conformation of a small molecule (e.g., ligand, drug prototype, etc) in a pocket embedded in a protein. Such investigations can be solved using a variety of different methods, depending on the specific questions being asked. Ab initio quantum mechanical method (QM) is the grail for highly accurate results for molecular systems, based on algorithms to solve the Schrodinger equation. Unfortunately, most problems in ligand-protein interaction studies are too large to be considered fully by QM methods alone. Therefore, approximations are incorporated that consider less electronic structure detail, often emphasizing molecular energy as a function of the nuclear positions only. Classical, molecular mechanics-based (MM) methods use less accurate strategies that neglect the electronic motion, but only require modest compute time. The MM-based methods are fully empirical in nature, and are specifically tuned for classes of molecules such as proteins, enzymes, or other macromolecular systems. Molecules are considered as collections of masses held together by classical forces. The equations based on classical mechanics and parameters defining the energy surface of a molecule are collectively referred to as force field. The contributions to the molecular energy include, bond stretching, angle bending dihedral deformations, van der Waals, electrostatic interactions, etc. An important issue for force field applications is the feasibility for transferability of the model to a wider range of macromolecules. The MM method is largely applied for basic understanding of low energy conformations on the Potential Energy Surface (PES) through molecular structure refinement, molecular dynamics (MD) simulation, Monte Carlo simulations and/or ligand docking simulation. Molecular Dynamics, also based on the empirical force field and basic assumptions as described for MM, aims to reproduce the time-dependent motion of a molecule by solving Newton’s second law for all the atomic degrees of freedom. New positions and velocities of the atoms are predicted and saved as a trajectory file. Performing this process for a range of time steps with additional controls on temperature gives the description of the behavior of the molecular motion as it traverses over the PES. In contrast to QM and MM, both of which attempt to reach the minimum in the energy function, molecular
dynamics is able to overcome conformational barriers with the input of temperature. The ability for Molecular Dynamics to search conformational space enables the placement of families of structure on the PES, which can be further minimized for prediction of local stationary points. Molecular docking tools attempt to estimate the optimal three-dimensional configurations between a protein and a ligand using a variety of classical means, most typically by invoking an empirical force field function Typically docking searches involve a large number of ligands and associated conformations, together with a cost function, called a scoring function, which helps in the determination of the degree of compatibility between the ligand and the protein to form a reaction complex. A. QM/APBS Strategy Rational drug design relies on computational models that determine whether and how small drug ligands interact with large molecules like proteins. This is done by calculating the ligand-protein configuration that minimizes the binding energy. The methods are often complex because the code performs both energy calculations as well as nonlinear optimization. In this work, we have proposed an alternative framework, enabling user-specific exploration using a QM/APBS/MD hybrid method. The ligand is treated using QM methods, and the full complex is treated using classical electrostatics (APBS) and empirical force field (MD) methodology. The first phase of development involves only QM/APBS, with the ultimate goal being to envelop the classical electrostatics directly into the QM procedure, and combine with MD methods. Computationally, the hybrid methodology involves many individual steps: The electrostatic effects are numerically solved using the Poisson-Boltzmann equation (APBS [23]); the atomic charges and radii for the ligand are determined using QM, employing the software GAMESS [28]; QM provides more accurate structural and energetic information for the ligand, important factors in the determination of which structure is preferred in the pocket of the protein or enzyme. The atomic charges and radii are needed for the ligand/protein complex electrostatics computation. The overall goal of the hybrid methodology is to understand binding energy and mechanism associated with the complexation of a ligand and associated protein, and how this interaction varies with position, substitution (residue mutation), and environmental conditions. In particular, it is desirable to define a large set of parameters required in defining the binding energy where the parameters have to do with ligand positioning and environmental conditions, with the added flexibility to run investigations over a wide set of possible mutations in structure. In addition, we are interested in determining more accurate energy functionals for predicting the total binding energy for a protein-ligand complex. Automation of the techniques enables all of these possibilities, (Figure 1). Scientifically, the task of minimizing the binding energy as a function of ligand position in the protein pocket, is a search in 6D parameter space. There are multiple criteria in the
2 5
4
1
3
Fig. 1.
Various steps involved in the hybrid method for study of ligand-protein interactions.
deterimination of optimal fit, one being that the ligand fit in the pocket without steric interaction to other atoms. The evolution of the theoretical methods and pipeline development is being guided and refined based on a real research ligand-protein system under investigation with an experimental laboratory. Results so far indicate that the method is a viable alternative to the all-classical methods currently available in the sense of providing an accurate accessment of ligand conformation in a protein pocket, with an overall flexibility in the choice of optimization procedure to find the optimal position and binding energy. We hope to refine our analysis based on significantly more testing. B. Practical Considerations and Associated Tools Methods used in the initial workflow involve optimizing the orientation of the ligand to match the cavity of the binding site of a protein. The cavity inside the protein, in our case the active site of the enzyme, is predefined by the orientation of an original cofactor, limiting the extent of translantional and rotational degrees of freedom that need to be considered. In our research problem, the different conformers of the ligand are generated from the flexibility applied to specifically defined torsional angle motions. The charge distribution is computed for each conformer generated along this restricted flexibility metric. The quantum chemical software, GAMESS, is employed to determine optimal ligand hydrogen atom positions , atomic charges, and atomic radii, all required for the subsequent ligand/protein interaction simulations involving APBS. For each orientation of the ligand, the respective protein-ligand complex is created, with the ligand aligned as in the original
cofactor. The binding energy is estimated by solving the Poisson-Boltzmann equation, and further refined by addition of non-polar terms that are estimated as a function of the SolventAccessible Surface Area weighted by a term associated with the surface tension. 1) Initial Structure Generation Initial protein structures are obtained from the Protein Data Bank [16], which is the largest repository for 3-D biological macromolecules structures determined experimentally. The structures of small molecules can be obtained from the LigandDepot [2] database. The completeness, format, and general accuracy of these files needs to be assured before use in subsequent computational modeling studies. In particular, the PDB structural files include water molecules networks that need to be deleted before investigations, as the computations typically involve a dielectric continuum solvent model and not explicit solvent. Since we wish to study the complex in a continuum solvent, the water network attached to the protein must be deleted using tools such as WHATIF. In the PDB file, the crystal structures typically do not contain hydrogen atoms since crystals do not diffract well enough to enable hydrogen atoms to be resolved. Therefore, it is necessary to have a tool to add hydrogen atoms appropriately to all structures extracted from these databases. 2) Input Type Conversion OpenBabel [10] is a program to manipulate and convert a wide variety of data types to other formats used in
molecular modeling and computational chemistry. In addition, the OpenBabel tool also has the ability to add hydrogen atoms. A checking point based on molecular graphical tool support is required at this step to see if the protonation state of the ligand that the tool assigned is the one required by the user. 3) Protein Structure Preparation PDB2PQR [5] is a tool that enables preparation of the protein structure for subsequent computation. This tool converts the PDB file into a PQR format, which is required for the continuum electrostatics computation using the APBS software. During the converison process, the software also adds the missing hydrogen atoms onto the protein structure. By using different options available in the tool, one can also “optimize” the protein structure to obtain the most favorable hydrogen bonding network. A PQR file format resembles the PDB format except that the occupancy and temperature parameters are replaced by atomic charges and radii for each atom. The atomic charges and radii are primarily assigned from experimental data. The data is based on the structure of regular amino acid structure, and typically does not account for nonnatural residue types. There are many charge schemes one might choose for these assignments. In this case, we have chosen the PARSE charges for the assignment. 4) Ligand Conformation Preparation Investigations involving the interaction of a ligand in a protein active site requires identification of ligand motion paths. In the most exhaustive case, this would be full random 3D motion within the pocket. However, it may be the case that the researcher can identify the trajectory that they wish to study, and can reduce the complexity of the full search space. In addition, it is necessary to have some ability to score the liklihood of a particular conformation in terms of steric and electrostatic fit criteria. In the present workflow investigation, the orientation space of the ligand inside the protein is simplified from what is known experimentally, since the ligand motion is limited to rotation around two specific dihedral angles. For this purpose, we have created a general set of tools that enables the automated on-the-fly identification of torsional angle motions to be considered in the set up of the possible ligand conformations to investigate. The LigPrep service enables one to choose any dihedral angle from a view of the ligand structure. The user defines the number of steps and degree of rotation and the tool creates all ligand conformations that will need to be considered. These conformations are additionally placed within the protein pocket to create the complex structure. 5) QM Method for Calculating Atomic Charges A second step in ligand preparation (Step 5 in Figure 1) involves carrying out a higher level of calculation to obtain more accurate structure and properites thereby enabling more accurate ligand/protein complex interac-
tion results. In fact, one would like to carry out such computations on both the ligand and the protein, however such computations can be prohibitively expensive for full protein systems. As a compromise, most of the higher level treatment is concentrated at the site of the ligand docking, including the ligand itself, and the region directly around the ligand. We use the computational chemistry software GAMESS for this step, which performs high level quantum chemistry structure and property computations for general molecules. For each of the ligand conformations that are created in Step 4 of Figure 1, a GAMESS computation provides a structure with optimal hydrogen positions, a much better representation of the charge and radii input, both necessary for the subsequent steps involving ligand-protein complex electrostatics computations. Again, there are many options for predicting atomic charges. In our initial work, we invoke the CHELP charge schemes. However with the ability to create a workflow for the steps involved in docking studies, we can easily envision the ability to investigate effects of different charge schemes. 6) Ligand-Protein Electrostatics Computation The Adaptive Poisson-Boltzmann Solver, APBS, is software for evaluation of the electrostatic poroperties of nanoscale biomolecular systems. This software is invoked for description of the electrostatic interactions between the ligand and the protein. The software enables consideration of the complex in salty, aqueous environment. Such computations are important for understanding many aspects of biomolecular interactions, including understanding of ligand-protein and protein-protein binding kinetics, implicit solvent molecular dynamics and solvation and binding energies for determination of equilibrium binding constants for aid in drug design. The APBS software requires an input file that has all the parameters necesary to solve the Poisson-Boltzmann equation. Our future work will involve investigations to increase the accuracy of these estimates using more integrated classical/quantum mechanical strategies and improved methods for estimating electrostatic and non-electrostatic components. Application of the above steps in a high throughput screening manner will enable one to more easily refine the energy function, which can then be used as a scoring function to find the best conformation of a ligand in a binding site of a protein for a general system. III. W ORKFLOW C HALLENGES The workflow described in the previous section would require the scientific user to expend significant time and effort managing the syntactic interoperation between the different applications, including file conversions, explicit data management, and execution management, all which result in a workflow that is tedious, errorprone, and difficult to repeat. Infrastructure tools can and should be developed to enable the end-user to focus on the semantic aspests of the workflow
and automatically manage the execution of the workflow. Additionally, such tools should be general-purpose in order to support other variants of this workflow for other users. In this section, we describe the specific challenges in building general purpose infrastructure tools to support workflows such as the one described in the previous section. A. Data Representation One of the main challenges within the biomedical application space is a lack of community standards for data representation. The data that is common to a variety of applications in this case is the definition of a molecule and its structure. Data is represented in a number of semantically similar but syntactically different ways: The protein coordinate file traditionally uses a specialized PDB file format, which is more recently moving to a PDBML (XML-based representation of the same information [31]). APBS uses a PQR format, and GAMESS can use either a cartesianbased or internal coordinate-based file. While the protein data bank itself stores the molecule without hydrogen atoms, both the APBS and GAMESS software require the hydrogens to be added. OpenBabel, a common data format interchange application, does not support PQR files, but can be used to add hydrogen atoms. PDB2PQR, a separate application that is provided with APBS, also has the option to add missing hydrogen atoms to a PDB file, while simultaneously converting the datafile to a PQR format. The PDB format consists of a sequence of atom descriptions each on a separate line and representing the atom type, residue name, and the 3-dimensional co-ordinates. The values are separated by spaces but parsing can be fragile, for example, in some molecules the RType is CA (alpha carbon) which can easily be confused with a calcium atom. Furthermore, not all PDB files are properly entered into the PDB repository. Many proteins have non-integer residue-numbers (eg. 2DBL.pdb) that can break many tools. The PQR format for APBS is similarly formated, with each line describing the characteristics of each atom in the structure. A significant difference is the addition of a charge and radius attribute for each atom. Calculation of the charges of atoms is nontrivial and can be done using a number of methods some of which are computationally intensive. The molecule data is not the only data format that can be standardized across applications; others include molecular orbital coefficients, atomic energies, and basis sets. Discussion: There are a two main strategies in dealing with the data representation problem at the infrastructure level: first, support the development a community standard across the breadth of applications or develop translators across the different formats. The Chemical Markup Language (CML) [26] takes the this approach and tries to define a common molecule format to be used by all applications. CML, however, is quite broad in scope and doesn’t provide enough details for this particular workflow. In general, standards are difficult in this domain because the applications typically require and
understand only a subset of all the information present within a common global format. A second approach is to develop translators that map from one format to another (also called shims by workflow projects like Taverna). In the worst case, this would require n2 translators, where n is the total number of data formats. Additional data formats would require 2n more translators – n from the new format to the existing formats, and vice versa. This approach can theoretically be very unscalable. However, in practice, the value of n is somewhat manageable (of the order of tens), and translators between every permutation are not usually needed. Furthermore, if all the data formats being used are represented as XML, third party techniques such as XSLT can be used to perform the translations. If the data is not represented in XML, metadata languages such as Data Format Description Language (DFDL) [15] can be used to describe the data, and translators can operate on the abstract data descriptions provided by the same. Semantic metadata techniques can aide in the automatic translations of the data between formats. Semantic mediators use ontologies created for the different data formats to automatically generate translators between data formats as needed. This approach is applicable for semantically similar, yet structural different, data. As we mentioned earlier, different data formats typically add different application-specific fields. This implies that semantic mediation techniques may themselves not be sufficient to perform all types of translations that we need – custom translations would still have to be done for the fields that have to be generated using computational techniques. In practice, for our work, we use a combination of the various techniques. Working with the NBCR community, we have defined our own XML-based schema for representing molecule data that works for the applications for this workflow and likely will work with some other applications that we are interested in such as Amber. The format represents atomic coordinates but uses XML optional parameters for specifying application specific values. We have modified the GAMESS application to (optionally) use this common molecule format directly, and have written wrappers for the other applications as needed. We are investigating mediation techniques for additional data that can be represented. B. Application Parameter Formats Related to the data representation issue is the issue of specifing control parameters for the applications. APBS, GAMESS, AMBER and virtually all computational chemistry codes use custom input formats that are diffcult to parse and generate. GAMESS, for example, has more then 110 “control” groups, each of which has several parameters and some with more then 30 parameters. In addition, the parameters are not independent, some parameters only make sense in the presence of other parameters. Text files representing these inputs are not able to specify these dependencies leading to a high learning curve and higher frequency of errors.
Discussion: Since the parameters for each application directly relate to the scientific needs and capabilities of the application, it can be argued that the complexity of the parameters are necessary semantic information where generic infrastructure tools can provide little. Part of our work has been to modify the GAMESS application to natively support XML-based input structures and have described the inputs and their various dependencies using XML Schema. Recently the schema has been revised to accommodate data types that are common across various computational chemistry codes and has been separated into different schemas for input and output. Both the input and output schema contain elements for coordinates, molecular orbital coefficients and Hessian matrices. In both the input and output schema the elements defining the Cartesian coordinates are in the molecular data type format which is also shared by the APBS and AMBER schemas. In the future, we plan to develop strong data types for other common data as well. In addition to supporting workflows across applications, the XML specification of the application parameters can also be used to automatically generate user interfaces that respect the dependencies. Many efforts like AUIML [25] have previously explored interface creation by defining a technology-neutral XML specification of buttons, menus, and so on, that can be rendered differently based on the platform and application. With XML Schema, the developer can leverage the data’s internal structure to automate the creation of appropriate interfaces in the language or framework needed by the developer. Using the GAMESS XML input schemas, we have developed such auto-generation capabilities for parameter files in this pipeline. C. Support for Computational Intensive Applications Computational hardware clusters are a popular and costeffective means of speeding up many scientific applications, and in this described workflow, both APBS and GAMESS have native support for the use of MPI for parallel execution, which can dramatically speed up a calculation. In addition, both software codes can be launched through high-performance job management systems, such as Sun Grid Engine [11] or Portable Batch System [22], thereby supporting a large number of jobs for high-throughput computing. However, for the user, the use of clusters in many ways complicates the workflow: the user must manage login credentials (if they have access to many clusters, this can be cumbersome), provide explicit data management between clusters or to/from thedesktop, and learn the intricacies of the particular execution management system deployed. discussion: Typically, users are given commandline access to clusters and are required to provide their own data management and execution management using the set of tools provided on that cluster. We argue that what is needed is to provide a Service-Oriented-Architecture (SOA) on top of the physical cluster that supports application-appropriate interfaces. For example, instead of commandline access, the NBCR SOA provides users with a programatic interface to the core
calculateBindingEnergy() calculateBindingEnergyBlocking() calculateSolvationEnergy() calculateSolvationEnergyBlocking() calculateElectrostaticPotential() calculateElectrostaticPotentialBlocking() queryStatus() getOutputs() destroy() Fig. 2.
APBS Web Service interface
functionality of APBS (see Figure 2) including calculating binding energy, solvation energy and electrostatic potential. This interface is not meant to be used directly by the enduser, but can be easily used as a remote proceedure call in end-user tools and using high-level scripting languages. Client applications simply execute the above remote proceedure calls (either blocking or non-blocking versions are provided), and await the output. Security is integrated via the various tools and is used for authorization by the service implementation. If the user is authorized, the job parameters (not shown) are used to construct a valid APBS job that is then submitted to the job management system for execution on the cluster or grid. D. Application Composability Composing applications together into a workflow as in Figure 1 requires extensive information about each application including the data formats, the parameter files, and the specifics for managing the execution of each application. Even with the infrastructure tools described above, such as application-level web services and XML-based data and parameter representation, composing these applications into a workflow is difficult because they were not originally designed to fit into a pipeline; the inputs and outputs of the various applications do not neatly fit together. Hence the workflow description language needs to support user-defined combination of inputs from outputs of the previous stage – that is, the output type of the current service may not match exactly with the input type of the next service. For example, the GAMESS service input requires a GAMESSIN datatype that is composed of a number of parameters, one of which is the molecule produced by the LigPrep service. Multiple data sources may need to be combined to form the input of a task. In addition, programmatic conversions may need to be applied to data to get them into the correct format for the following task. For example, the charge calculations that GAMESS produces are typically cut by hand out of the GAMESS.out output file and pasted into the PQR file for APBS execution. In addition, the workflow language needs to support some programatic constructs such as foreach. For this workflow, the LigPrep service generates a series of datafiles that are used as separate inputs to the next task of the workflow, the GAMESS service. The LigPrep output consists of a set of N
L XM
XM L
GAMESS Input Parameters
XM
L
User Input
Ligprep Service
Fig. 3.
Molecule
GAMESS Input XML Document
... ... ... ... ...
Combination of multiple outputs into inputs for the next stage in the workflow.
rotational conformation molecules, STEP1.xml - STEPN.xml. GAMESS will be run once for each of the molecule files generated by LigPrep once the input parameters for GAMESS are combined with the molecule derived from the LigPrep service output file. Discussion: The key here is the workflow specification language and its expressibility. The language needs enough power to allow the user (or a visual tool working on the user’s behalf) to express complex pipelines and non-trivial recombination between the steps of the workflow. Since we are firmly in the XML toolspace, we have been experimenting with using the XSLT [30] and XPATH [30]: we have used XSLT is used to specify transformations among datatypes and have used this technique for the transformations needed in this workflow. XPATH enables nested parts of a data document to be extracted and inserted into another data document at a specified point in the data. However, XPATH and XSLT are complex and would require sophisticated visual tools for allowing the user to compose workflows.
by any number of individuals for their own sets of molecules and ligands. Publishing a worklfow such that it can be easily reused presents a unique set of challenges. When publishing a workflow, the workflow description must also specify which inputs are bound, or generated by other parts of the workflow, and which are unbound and must be provided at runtime. Making this determination requires a generalization step that can be quite complex. We are looking at various methods for this generalization, including simple heuristics and machine learning techniques. Discussion: In order to publish the workflow, the workflow description must be sent to some factory that can publish and instantiate the workflow as needed. In a SOA, it makes sense to publish the workflow as a service itself. In this case, the challenge becomes conversion of the workflow description into a published service. Current web service technology for dynamically creating and publishing new services on the fly are still immature, but look promising [21].
E. Repeatablity and Publishability
The contributions of this paper are twofold. First, we describe in detail a particular case study of a workflow across multiple applications, public databases and utility programs in the domain of Computational Chemistry. Second, we show how this workflow can benefit from infrastructure tools and the specific technical challenges in building such tools. Specific workflow infrastructure challenges include • multiple data representation formats for semantically similar data, • highly variable application parameter specification with arbitrary dependencies or constraints, • integrating remote services for computation, • developing a workflow description language that is powerful enough to manipulate inputs and outputs,
Ideally a workflow that has been composed and tested by an expert in that scientific domain could be made available to others who wish to perform the same set of tasks, thus saving the new researcher the effort of creating and debugging the workflow themselves as well as allowing them to benefit from the expertise of the original domain scientist. For example, once the QM/APBS workflow is properly set up, a scientist should be able to simply select a new molecule and ligand that they are interested in, specify a few options to the tasks like the number of conformations they would like to calculate from LigPrep, and then launch the calculation. The goal is to compose a persistent workflow service that inexperienced end-user can use with little instruction, and that can be reused
IV. C ONCLUSIONS
generalizing and publishing workflows so they can be utilized by other users. Based on our experiences adding workflow support to the GEMSTONE environment, we argue that infrastructure tools can be developed utilizing many of the XML-based standards: service-oriented architectures for providing access to remote services, XML-Schema for data and parameter representation, and XSLT and XPATH for workflow manipulation. •
V. ACKNOWLEDGMENTS We thank the NSF for supporting the Gemstone project through the Middleware Grant SCI-0438430. We also would like to thank the NIH for supporting NBCR through the National Center for Research Resources program grant P41RR08605 that supports the development of the computational infrastructure. R EFERENCES
[23] M. Holst and F. Saied. Numerical solution of the nonlinear poissonboltzmann equation: Developing more robust and efficient methods. In J. Comput. Chem., 16, 1995. [24] Argonne National Lab. The Globus Toolkit. http://www.globus.org/. [25] R. Merrick. Auiml: An xml vocabulary for describing user interface. http://xml.coverpages.org/userInterfaceXML.html#auiml, May 2001. [26] P. Murray-Rust and H. S. Rzepa. Chemical Markup, XML, and the World Wide Web. 4. CML Schema. In J. Chem. Inf. Comput. Sci., volume 43, 2003. [27] P. Papadopoulos, M. Katz, and G. Bruno. NPACI Rocks: Tools and Techniques for Easily Deploying Manageable Linux Clusters. Concurrency and Computation: Practice and Experience Special Issue, 2001. [28] M.W. Schmidt, K.K. Baldridge, J.A. Boatz, S.T. Elbert, M.S. Gordon, J.J. Jensen, S. Koseki, N. Matsunaga, K.A. Nguyen, S. Su, T.L. Windus, M. Dupuis, and J.A. Montgomery. GAMESS. In J. Comput. Chem., 14, 1993. [29] L. Stein. Creating a bioinformatics nation. Nature, 417, May 2002. [30] W3C. Xsl transformations (xslt) version 1.0. http://www.w3.org/TR/xslt, Nov 1999. [31] J. Wesbrook, N. Ito, H. Nakamura, K. Henrick, and H. Berman. Pdbml: the representation of archival macromolecular structure data in xml. Bioinformatics, 21(7):988–992, 2005.
[1] Jetspeed: Open Source Enterprise Information Portal. http://portals.apache.org/jetspeed-1/. [2] The ligand depot. http://ligand-depot.rutgers.edu/. [3] OGCE - Open Grid Computing Environments Collaboratory. http://www.ogce.org/. [4] OSCAR: Open Source Cluster Application Resources. http://oscar.openclustergroup.org/. [5] PDB2PQR: An automated pipeline for the setup, execution, and analysis of Poisson-Boltzmann electrostatics calculations. http://pdb2pqr.sourceforge.net/. [6] PURSe: Portal-Based User Registration Service. http://www.gridscenter.org/solutions/purse/. [7] The Gemstone Project. http://grid-devel.sdsc.edu/gemstone/. [8] The Gridsphere Portal Framework. http://www.gridsphere.org. [9] The National Biomedical Computation Resource (NBCR). http://nbcr.sdsc.edu. [10] The Open Babel Project. http://openbabel.sourceforge.net. [11] The Sun Grid Engine (SGE). http://www.sun.com/software/gridware/5.3/index.xml. [12] I. Altintas, C. Berkley, E. Jaeger, M. Jones, B. Ludaescher, and S. Mock. Kepler: An Extensible System for Design and Execution of Scientific Workflows. In 16th International Conference on Scientific and Statistical Database Management (SSDBM’04), 2004. [13] K. Amin and G. vonLaszewski. GridAnt: A Grid Workflow System. Argonne National Laboratory, Feb 2003. [14] Rob Armstrong, Dennis Gannon, Al Geist, Katarzyna Keahey, Scott R. Kohn, Lois McInnes, Steve R. Parker, and Brent A. Smolinski. Toward a common component architecture for high-performance scientific computing. In HPDC, 1999. [15] M. Beckerle and M. Westhead. Dfdl primer. Technical report, Global Grid Forum, 2004. [16] H.M. Berman, J.Westbrook, Z.Feng, G.Gilliland, T.N.Bhat, H.Weissig, I.N.Shindyalov, and P.E.Bourne. The protein data bank. Nucleic Acids Research, 28:235–242, 2000. [17] K. Bhatia, S. Chandra, and K. Mueller. GAMA: Grid Account Management Architecture. In IEEE International Conference on EScience and Grid Computing, 2005. [18] E. Deelman, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, S.Patil, M. Su, K. Vahi, and M. Livny. Pegasus : Mapping scientific workflows onto the grid. In Across Grids Conference, 2004. [19] T. Oinn et al. Taverna: A tool for the composition and enactment of bioinformatics workflows. In Bioinformatics Journal, volume 20(17), 2004. [20] I. Foster. Service-oriented science. Science, 308(5723):814–817, May 2005. [21] A. Harrison and I. J. Taylor. Wspeer - an interface to web service hosting and invocation. In 19th IEEE International Parallel and Distributed Processing Symposium, 2005. [22] Robert L. Henderson. Job scheduling under the portable batch system. In IPPS ’95: Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing, pages 279–294, London, UK, 1995. SpringerVerlag.