Applications of Grid Computing in Genetics and Proteomics Jorge Andrade1 , Malin Andersen1,2 , Lisa Berglund1 , and Jacob Odeberg1,2 1
Department of Biotechnology, Royal Institute of Technology (KTH), AlbaNova University Center, SE-106 91 Stockholm, Sweden {jorge, jacob, malina}@biotech.kth.se,
[email protected] http://www.biotech.kth.se 2 Department of Medicine, Atherosclerosis Research Unit, King Gustaf V Research Institute, Karolinska Institutet, Karolinska University Hospital, Stockholm, Sweden
Abstract. The potential for Grid technologies in applied bioinformatics is largely unexplored. We have developed a model for solving computationally demanding bioinformatics tasks in distributed Grid environments, designed to ease the usability for scientists unfamiliar with Grid computing. With a script-based implementation that uses a strategy of temporary installations of databases and existing executables on remote nodes at submission, we propose a generic solution that do not rely on predefined Grid runtime environments and that can easily be adapted to other bioinformatics tasks suitable for parallelization. This implementation has been successfully applied to whole proteome sequence similarity analyses and to genome-wide genotype simulations, where computation time was reduced from years to weeks. We conclude that computational Grid technology is a useful resource for solving high compute tasks in genetics and proteomics using existing algorithms.
1
Introduction
Bioinformatics is a relatively new field of biological research involving the integration of computers, software tools, and databases in an effort to address biological questions. Areas include human genome research, simulations of biological and biochemical processes, and proteomics (for example protein folding simulations). With an increasing amount and complexity of data in genomics and genetics generated by today’s high-throughput screening technologies and the development of advanced algorithms for mining complex data, computational power now sometimes defines the practical limit. High performance computing or alternative solutions are required to undertake the intensive data processing and analysis. Grid computing [1], offers a model for solving massive computational problems by subdividing the computation in a set of small jobs, executed in parallel on geographically distributed resources. However, the current job management process on Grid environments is relatively complex and non-automated. Biologists who want to take advantage of B. K˚ agstr¨ om et al. (Eds.): PARA 2006, LNCS 4699, pp. 791–798, 2007. c Springer-Verlag Berlin Heidelberg 2007
792
J. Andrade et al.
Grid resources face a process of having to manually submit their jobs, periodically check the resource broker for the status of the jobs (“Submitted”, “Ready”, “Scheduled”, “Running”, or “Finished” status), and finally get the results with a raw file transfer from the remote storage area or remote worker to the local file system of their user interface. Different solutions for increasing the usability, scalability and stability in computational Grids have recently been proposed [2], [3]. The presented implementation represents a model by which access and utilization of Grid resources is greatly facilitated, allowing biologist and other nonGrid-experts to exploit the Grid power without necessarily having knowledge of Grid related details and procedures. The utility of this implementation is demonstrated by application to two computationally expensive bioinformatics tasks: Whole proteome sequence similarity analysis and genotype simulations for genome wide linkage analysis
2
Methods
In order to make the interaction with the complex computational environments on Grids more straightforward to the biologically oriented scientists, the following tasks were automated: Proxy setup handles the user authentication as a member of a Virtual Organization (VO) and grants the user access to the Grid resources. By default, twelve hours is the time for the proxy to be in effect. After the proxy expires, the task of re-creating new proxy is automatically scheduled in the local Grid client. Job submission involves the remote distribution of the split input data files or databases, as well as the executable binary files to the Grid workers. For each Grid job submitted, a Grid job specification is created using the Resource Specification Language (RSL). Processing. After job submission, a local temporary installation of datasets and executables in the allocated remote nodes is performed. After that, parallel execution is started in remote nodes, and a constant monitoring of the current job’s status is performed. Job re-submission in case of job failure or excessive delay in Grid queue systems is also handled. Job collection. When specific Grid jobs are finished, partial results are downloaded from the remote Grid workers to the local computer. This module is also able to handle parallel retrieval of several finished jobs. The figure 1 shows a graphical description of the Grid framework configuration used for this implementation.
3
Implementation
A Perl script based Grid broker that ensure unique user authentication was implemented, allowing the user to remotely deploy and execute pre-existing algorithms or software across available Grid resources at submission time. The presented solution is adjusted to NorduGrid ARC [4], but can be easily adapted to any Globus based Grid middleware.
Applications of Grid Computing in Genetics and Proteomics
793
Fig. 1. Grid computing Framework for application in Bioinformatics
This implementation can be adapted to tasks suitable for parallelization where an existing Linux executable exists. The implementation consists of two Perl scripts: gridjobsetup.pl. Manages two main tasks. Firstly, the “big” computationally expensive task is partitioned into a user-selected number of smaller equally sized atomistic jobs, each corresponding to a fraction of the total data. Secondly, for each datra fraction, a Grid job specification is created using the resource specification language (RSL). gridbroker.pl. This is the Grid broker. Its function is to manage the submission, monitoring and collection of the Grid jobs. Following node allocation and job submission, gridbroker.pl performs temporary installations of the deployed executable on the Grid nodes/remote workers, and parallel execution of the Grid jobs is started. gridbroker.pl constantly monitors the parallel execution of the distributed tasks, and in the case of job failure or if a job or set of jobs are excessively delayed in the work-queue scheduler, gridbroker.pl manages the resubmission of this job or set of jobs to different available Grid workers. When jobs reach the status of “finished”, forked download of specific job-results to the user local file system is performed. The partial Grid job results are finally concatenated to generate the output file. A fraction of the Perl implementation of the broker is shown below. The code shows a loop that manages the submission of a user defined number of Grid jobs; a vector of Grid job identifiers is created
794
J. Andrade et al.
in memory and in an archive. This vector will then be used to mange the monitoring and downloading of the jobs. A log file that registers submission start and finish times is also created. Fraction of the Algorithm that Manage the Submission of Grid Jobs Input: XRSL-specification(s) of a number of Grid jobs; for each Grid job,a set of specific input parameters. Action: Submit the given number of Grid jobs. Output: Vector of Job’s id and file with timings. 1. Process XRSL-specification 2. Create a time-log-file and register the start of submission 3. Create and open a job-id-file 4. For each job (a) Select the cluster(s) to which the job will be Submitted (b)Submit the job (c)Collect the retrieved job-id (d)Push the collected job-id in a vector (e)Push the collected job-id in a job-id-file 5. Register in time-log-file the end of submission 6. Close time-log-file 7. Close job-id-file Fraction of Algorithm that Manage the Monitoring and Downloading of Finished Grid Jobs (The following algorithm shows the constantly monitoring of job’s status using the previously created vector of jobs identifiers; in case of job “failure”, re-submission of jobs is performed, jobs that have successfully reached the status of “finished” are downloaded.)
Input: job-id vector and job-id-file. Action: Monitoring and collection of Grid jobs and resubmission if "job-failure". Output: Collection of finished Grid Jobs and time-log-file. 1. While number of downloaded jobs