Data-intensive and high-performance computing are poised to significantly ... In
contrast, the application of high-performance computing to biology has been ...
Journal of Physics: Conference Series
Related content
Bringing high-performance computing to the biologist's workbench: approaches, applications, and challenges
- Data-intensive multispectral remote sensing of the nighttime Earth for environmental monitoring and emergency response M Zhizhin, A Poyda, V Velikhov et al.
To cite this article: C S Oehmen and W R Cannon 2008 J. Phys.: Conf. Ser. 125 012052
- High-Performance Computing in Astrophysical Simulations Viktor Protasov, Alexander Serenko, Vladislav Nenashev et al.
View the article online for updates and enhancements.
- Photons, photosynthesis, and highperformance computing: challenges, progress, and promise of modeling metabolism in green algae C H Chang, P Graf, D M Alber et al.
Recent citations - Sivakumar Kulasekaran et al - Anuj R. Shah et al
This content was downloaded from IP address 185.107.94.33 on 17/09/2017 at 10:10
SciDAC 2008 Journal of Physics: Conference Series 125 (2008) 012052
IOP Publishing doi:10.1088/1742-6596/125/1/012052
Bringing high-performance computing to the biologist’s workbench: approaches, applications, and challenges Christopher S Oehmen and William R Cannon Pacific Northwest National Laboratory, Richland, WA 99354, USA E-mail:
[email protected] Abstract. Data-intensive and high-performance computing are poised to significantly impact the future of biological research which is increasingly driven by the prevalence of high-throughput experimental methodologies for genome sequencing, transcriptomics, proteomics, and other areas. Large centers such as NIH’s National Center for Biotechnology Information, The Institute for Genomic Research, and the DOE’s Joint Genome Institute) have made extensive use of multiprocessor architectures to deal with some of the challenges of processing, storing and curating exponentially growing genomic and proteomic datasets, thus enabling users to rapidly access a growing public data source, as well as use analysis tools transparently on high-performance computing resources. Applying this computational power to single-investigator analysis, however, often relies on users to provide their own computational resources, forcing them to endure the learning curve of porting, building, and running software on multiprocessor architectures. Solving the next generation of large-scale biology challenges using multiprocessor machines—from small clusters to emerging petascale machines—can most practically be realized if this learning curve can be minimized through a combination of workflow management, data management and resource allocation as well as intuitive interfaces and compatibility with existing common data formats.
1. Introduction The role of high-performance computing across the spectrum of science domains is as varied as the domains themselves. In astronomy, high energy physics, and climate modeling, just to name a few, one can hardly imagine the fields without advanced computing because of the sheer volume and multiscale nature of the challenges driving those fields, and the complexity of the algorithms that are utilized to understand at a mechanistic level the relationships between entities. In domains such as chemistry and physics, computational studies—often driven by highperformance computing—are more and more commonly recognized as the third pillar of investigation, alongside the conventional pillars of experimentation and theory. Using highperformance computing to advance the forefront of computational chemistry, for example by investigating larger and more complex systems and interactions using fundamental theory, is therefore extremely valuable to chemistry at large, in part because computational studies can be validated by and augment the knowledge gained from experimental studies. In contrast, the application of high-performance computing to biology has been more conservative in scope. Although living systems obey the same natural laws that govern physics and chemistry, brute-force application of the equations describing these laws is an intractable strategy because of the size and complexity of the systems. Biologists have instead focused on solving subproblems that can be reasonably approximated using physics, chemistry, information theory, statistics, cellular automata, or other formulations. High-performance computing in biology has most prominently been used in three directed ways: (1) as a capacity resource to drive throughput for data analysis or data mining [1-6], (2) enabling large-scale simulation and modeling [7-11], and (3) as a mathematical tool driving algorithms aimed at revealing underlying
c 2008 IOP Publishing Ltd
1
SciDAC 2008 Journal of Physics: Conference Series 125 (2008) 012052
IOP Publishing doi:10.1088/1742-6596/125/1/012052
properties of biological systems [12. 13]. These applications of high-performance computing are described in more detail in the following sections. 1. High-throughput data analysis and data mining includes bioinformatics and image analysis where the primary task is pattern matching or feature extractions. Commonly, the data of interest in this category is growing exponentially with time, so the immediate challenge is simply to provide the throughput required to perform the analysis. Examples of this rapidly growing data include genome sequence data, experimental process data (like transcriptomics or proteomics) and cell imaging data whose availability is driven by capture and analysis capacity of engineered components and increasing optical resolution. 2. High-performance Grid and cluster computing has also played a prominent role in large-scale biological simulations. Large-scale biophysical or biosystems models are increasing their demand for high performance computing as multiscale techniques are devised and implemented. Scalability is a challenge in this area primarily because of increasing complexity of underlying theoretical process descriptions and the computational challenges associated with integrating models across spatial and temporal scales—a challenge reminiscent of other scientific domains such as astrophysics where some processes occur at submicroscopic spatial scale and extremely short time scales, critically linked to processes that evolve over distance and time scales for observable celestial objects and beyond. 3. The underlying properties of biosystems are often inferred by using techniques such as network inference or graphical analysis. The goal is to map the behavior of biological systems onto a wellcharacterized mathematical representation (e.g. a network or graph) so that trends, relationships, and features might be inferred based on mathematical properties of the representation. Often, this reduced representation allows for more straightforward analysis of the system leveraging advances in mathematical theory, and can often reveal underlying structure that is not evident without the mathematical metaphor. For these three general categories much work has been done with respect to high performance algorithm design and implementation. However, making use of these advances is often restricted to specialized users, or else it is implemented using a service-oriented model where computing infrastructure is not directly controlled by the end user. While this has the benefit of freeing the bench biologist from having to endure the learning curve of using high performance computing, a remote execution model also prevents potential users from taking full advantage of the increasing availability of local clusters and emerging architectures as well as emerging algorithms and software that can run on them. Historically, this has been acceptable to the biology community largely because the scope of computational task that a single investigator could realistically be expected to address did not often require advanced computing resources. However, as the cost of high-throughput technologies such as genome sequencing, transcriptomics, and proteomics put them within reach of a growing community of investigators, the need for increased computational capacity for individual investigators will likely grow as well. The goal of this paper is to offer a perspective on how high-performance computing has and will continue to evolve to meet the ever-changing needs of biology from the standpoint of the changing needs of individual investigators. We also enumerate some future challenges that will need to be addressed to facilitate this evolution. Specific examples are included to illustrate how advanced architectures have traditionally been used in biology and how their increasing availability and demand are leading to new usage models where computing is more tightly integrated into analytical workflows. We focus on bioinformatics and proteomics applications in high performance computing because of their potential for immediate impact to the bench 2
SciDAC 2008 Journal of Physics: Conference Series 125 (2008) 012052
IOP Publishing doi:10.1088/1742-6596/125/1/012052
biologist and because of the availability of optimized computational implementations in these areas.
2. High-performance computing in genomic analysis Large-scale genome sequencing centers are among the most visible users of high-performance computing in biology. The volume of published genomic sequence data continues to increase exponentially, driven by ongoing improvements in throughput of sequencing technologies. Commercialization by Illumina, 454 Life Sciences, and Applied Biosystems has led to relatively inexpensive access to genome sequencing technologies. Even for these high-throughput systems sequencing genomes is done by sequencing many fragments of a genome, then trying to assemble the final genome by correctly “connecting” the fragments. This is a challenging problem, not only because of the (albeit nominal) uncertainty inherent in the sequencing process, but by the presence of confounding sequences (such as highly repetitive regions of a genome). Often, high performance computing is needed to solve this “assembly problem” because of the combinatorial complexity associated with matching and ordering sequence fragments. Following assembly is the annotation process where encoding regions (genes) and their corresponding proteins are mapped onto the assembled genome. Assembly and annotation for a typical genome both require substantial computational power, as illustrated by the fact that most sequencing systems are shipped with multiprocessor clusters attached. Making these sequences available to the general biology community is a driving priority after sequencing, assembly, and annotation are done. It is most often accomplished by providing datasets and analytical tool suites via websites or web services, as in the case of The Institute for Genomic Research (TIGR, http://www.tigr.org/), DOE’s Joint Genome Institute (JGI, http://jgi.doe.gov/) or NIH’s National Center for Biotechnology Information (NCBI, http://www.ncbi.nlm.nih.gov/). Many of these tools allow users to formulate a query on sequence or sequence-related data to find conserved patterns between some set of sequences and a reference set of sequences. Often, when a user submits a query to one of the large-scale genome centers, the task is routed via an Internet transaction to a supercomputing resource that queues the request and processes the data based on the query’s priority and the current system load. This procedure results in a high degree of throughput for the sequencing center (i.e., a large number of queries that can be handled in a given time interval) and a reasonable turnaround for users who need only a handful of queries processed. In this way, users can interact with a preprocessed large-scale dataset of biosequence data, searching for conserved patterns between species of interest. One of the most commonly used tools for finding these conserved patterns is via pairwise sequence alignment using an algorithm called BLAST [14, 15], which rapidly calculates an optimal alignment for a pair of gene or protein sequences (or translated versions thereof) producing a set of scores that describes the quality and statistical reliability of that alignment being due to common inheritance vs. the chance of a random alignment of the same quality. BLAST searches are at the heart of many areas of bioinformatics analysis, including genome annotation, phylogenic studies, ortholog identification, and others. Given BLAST’s prevalence in bioinformatic studies, it is not surprising that accelerating this pattern matching process to drive multiple genome-scale analysis has been the focus of much high performance computing algorithm development. Parallel sequence analysis applications include ScalaBLAST [2], MPI BLAST [16], pioBLAST [17], TurboBLAST [18], beoBLAST [19], and others [20-25]. What these implementations have in common is that they (1) attempt to accelerate the rate of BLAST calculations in proportion to available compute resources, and most often (2) require users to obtain and install specialized software to use those resources. With the exception of applications like Soap-HT-BLAST [24], which allows users to schedule multiple BLAST queries using web services over distributed resources, the prevailing approach of these software tools is to assume that users will run applications on homogeneous clusters or on Grid architectures. 3
SciDAC 2008 Journal of Physics: Conference Series 125 (2008) 012052
IOP Publishing doi:10.1088/1742-6596/125/1/012052
The two access models to high-performance resources mentioned so far represent polar opposites in terms of user interaction. At one extreme is the Web services model that is typified by launching remote tasks on dedicated, but remote resources (i.e., submitting a BLAST query through the NCBI BLAST webpage). The advantage of this model is that details of utilizing and interacting with the high performance resources are hidden from users. The primary disadvantage is that the Web services that drive these approaches are not always easily modified for use on compute resources when those resources are controlled directly by users. At the opposite end of the spectrum are the software packages that must be compiled and/or installed directly by end users on high-end systems. The advantage of this approach is that these solutions can be used to take advantage of whatever resources users have. However, users are often forced to learn to use high performance systems and schedulers and, in the worst case, must build the software themselves. To address the evolving need for biology users to run large-scale calculations on their own dedicated resources, there is a growing trend toward solutions that bring local, dedicated high-performance computing into the analytical pipeline of individual researchers. One way to achieve this is to natively construct a web interface to dedicated resources. This allows multiple users at a single location to use their local resources but to dedicate those resources in a configuration that is more amenable to a small number of large-scale calculations than a large number of individual calculations. An example of this approach is illustrated in figure 1. We have deployed a website [26] allowing internal Pacific Northwest National Laboratoryy users to run ScalaBLAST jobs on a dedicated cluster. The idea is to provide an intuitive interface that allows users to submit ScalaBLAST jobs that are run in parallel. The user model is not the traditional view in which users submit queries in small batches or one at a time. Instead, it is expected that whole or multiple-genome size queries are requested. The entire local cluster can be dedicated to servicing these requests effectively serving a local community of researchers who need larger-scale computational tasks performed without requiring any of them to endure the learning curve of high performance computing and without them having to compete with worldwide users on large-scale centralized resources. Additionally, the priority and scheduling of jobs can be tailored to the specific local community. This approach can be straightforwardly modified to accommodate different partitions of the ScalaBLAST job, enabling large scale all vs. all or collections of genome-specific calculations to be performed. In this way, certain predetermined ways of using BLAST can be encapsulated using a small number of run-time parameters and uploaded files to support a wide range of larger-scale users on a dedicated system. A similar interface can be created for other ScalaBLAST installations, making it possible for other users to more intuitively take advantage of their local computing resources. This same access model can be applied broadly to other computational tools, simplifying access for users and at the same time allowing them to take advantage of the increasing availability of multiprocessor computing resources at their native institutions.
4
SciDAC 2008 Journal of Physics: Conference Series 125 (2008) 012052
IOP Publishing doi:10.1088/1742-6596/125/1/012052
Figure 1. ScalaBLAST launching website for internal users. Users with access to the website can launch parallel ScalaBLAST calculations using a dedicated multiprocessor cluster. Scheduling of the parallel compute task is done automatically when the user clicks the “launch job” button. The system automatically interfaces the native scheduler to give status of pending, in process and completed jobs.
Taking this concept one step further to more closely integrate high-performance computing with the end users leads to scientific workflow systems that abstract away many of the details of complex tasks such as scheduling, data moving, and launching analysis tools on highperformance resources. In this approach, analytical data-driven pipelines can be constructed using a sequence of computational tasks, each viewed as a “component.” An example of such a pipeline for genome sequencing would be to have high-performance components for genome assembly linked with gene prediction tools and automated annotation using sequence similarity. Each of these components can also be used in other workflows, but together they could form a powerful core capability for genome sequencing, yet be available and accessible to individual investigators.
3. High-performance computing in proteomics A second area where high-performance computing has had a strong presence in biology is proteomics, defined as high-throughput analysis of expressed proteins from cell populations, in this example as measured by mass spectrometry (MS). A typical approach in MS-based proteomics is to analyze the peptides derived from cellular proteins rather than the proteins themselves because the smaller peptides are more amenable to mass measurement. The challenge, however, is that the thousands of proteins derived from a cell can result in millions of potential peptides, each of which must be identified before proteins can be reliably identified. As a result, the speed of computational analyses of MS proteomics data has always been at the forefront of priorities for analytical chemists. Because many of the early proteomics facilities used desktop servers for data analysis, farms of servers were the most straightforward approach to increasing throughput of MS data. For example, the PNNL proteomics center (ncrr.pnl.gov) now holds over 70 TB of MS data regarding hundreds to thousands of experiments. Because of the sheer size of the data and the elaborate computational pipeline to analyze the data, interpretation of the data by a bioinformaticist may be limited to working with the processed and stored results contained in the center’s database rather than real-time interaction or even re-analysis with minimally proccessed MS data. This limitation can restrict the type of analysis that can be performed by an analyst. As discussed above, there is a growing trend toward solutions that bring local, dedicated high-performance computing into the analytical pipeline of individual analysts. As with 5
SciDAC 2008 Journal of Physics: Conference Series 125 (2008) 012052
IOP Publishing doi:10.1088/1742-6596/125/1/012052
ScalaBLAST, PNNL has deployed a website [26] allowing internal users to run MSPolygraph [6], a high-performance peptide identification software tool, on a dedicated cluster. Our goal is to incorporate the high performance MSPolygraph interface into a desktop workflow used by bioinformatics analysts to interpret and mine the proteomics data. To that end, we are incorporating analysis tools into a computational environment for this purpose, the Biological Resource Manager (BRM) [27]. BRM is a software environment that provides the user with data retrieval and integration capabilities and user-directed access to a variety of analysis tools. Using BRM to access the supercomputing resource, analysts will be able to reanalyze and mine large sets of proteomics data quickly. In one example workflow, output from a large-scale MSPolygraph analysis run can be visualized by overlaying identified peptides onto a visual representation of the species’ chromosome. Using the visual analysis tool Platform for Proteomics Peptide and Protein data exploration (PQUAD) [28], the mined peptides can be overlaid on any of six open reading frames located anywhere on the chromosome and color-coded by experimental condition to enable users to browse for peptides or proteins of interest. The rapid analysis of the large datasets and facile visualization of the results also allow analysts to identify miscalled genes and misidentified peptides. This analysis model enables bioinformatics analysts to use supercomputing resources that are integrated into a workflow, and as a result brings a new level of analysis capabilities that otherwise would be out of reach.
4. Bringing high-performance computing to the workbench: present and future challenges If enabling scientists to harness the power of ever-growing resources like sequence data and highthroughput experimental methodologies is the goal, then a natural priority is helping bench biologists to incorporate high-performance computing into their normal routine. As we transition further into the metagenome era, typical investigations are more and more likely to focus not on single genes or proteins, but on complex interactions at the community level and beyond. Certainly software tools will need to be developed to address some of these emerging challenges, but it is likely that on some level these new solutions will be driven by large-scale analysis, based on some combination of sequence analysis, experimental processing, modeling and simulation, graph or network theory, and other areas. Implementing these solutions efficiently will necessitate evolving paradigms for access and interactions with high-performance architectures. The historical notion of web-based access to remotely controlled resources will need to be augmented by more tightly coupled interactions between compute resources and end users. We have presented a spectrum of alternatives to this conventional model, including access to local, dedicated resources through web interface or web services; application workflow environments that interface with high performance computing through a component-based model; and a tightly integrated application model where users can dynamically direct high performance computing resources for iterative hypothesis refinement. These approaches are not meant to represent an exhaustive list but instead are meant to underscore the shifting user scenario in modern biology, as the landscape of questions that can be addressed by bench biologists continues to expand rapidly. The cost of obtaining large-scale data continues to decrease to well within the reach of individual investigators and the volume of public data continues to grow exponentially. At the same time, availability of multiprocessor architectures is increasing. As a consequence, high-performance computing is poised to play an increasingly prominent role in biological sciences. However, significant challenges remain for bringing high-performance computing all the way to the workbench. A few of these challenges are enumerated below. Computational science challenges Keeping up with the evolving data deluge in biology will require constantly evolving computational algorithms and data management techniques. Analysis that copes with current data 6
SciDAC 2008 Journal of Physics: Conference Series 125 (2008) 012052
IOP Publishing doi:10.1088/1742-6596/125/1/012052
sizes will quickly be overwhelmed in a matter of months. Emphasis should be placed on efficiently scaling algorithms so that continued advances in computational infrastructure have the greatest impact in biology. To support those algorithms advances in high-performance data management, including provenance capture, archiving, accessibility, metadata management and storage and retrieval, will become even more crucial for repeatability of results and dissemination of new information. Mathematical challenges Current analysis that is reliable at a given statistical confidence will quickly become unacceptable as datasets continue to grow, overwhelming search problems with false positives, for instance. In tandem with improvements in performance of computational algorithms, biology will need increasingly powerful statistical models just to maintain the current capacity to evaluate biological hypotheses. Technical challenges One concern facing computational biology is that many computational applications, including pattern matching, feature searching, graph analysis, and others, are often not limited by floatingpoint capacity of systems on which they are running. Rather, they are limited by availability, latency, or bandwidth to memory, disk, or interconnect. New computing paradigms and new approaches to hardware are needed that more naturally map to the core computational tasks associated with biological applications. Policy challenges The model of tightly integrating high-performance computing into an analytical pipeline presumes the availability of dedicated compute resources. The prevailing way in which clusters, high end systems and emerging architectures are deployed does not match this model. Rather, most systems are multi-user batch systems. We have presented workflows that interact with such systems, but real-time access will always be preferred when analysis is meant to be interactive, rather than offline. Unlike many other computational domains, evaluating a biological hypothesis is often approached in a highly iterative way, making the traditional batch mode of operation a less than desirable access model. In addition, conventional batch supercomputing is also often only granted through highly controlled channels to individuals trusted to interact intimately with the compute resource. Transparently integrating high-performance computing into biological workflow models presumes a lower degree of interaction between the end user and the resource (preferably no direct interaction), opening the possibility for a more lax user access policy. However, such user access is a challenging notion to most conventional security models. Pushing scientific frontiers in biology Increasing the utility of high-performance computing for the biology community has the potential to enable a revolution in the scope and breadth of biological investigation. Solving some of the scientific, technical, and procedural issues to enable this revolution will pave the way for practical and impactful advances in biology to address large-scale challenges in renewable energy, climate management, advancing human health, and many other areas. Acknowledgments Dr. Oehmen and Dr. Cannon are supported by the Data-intensive Computing for Complex Biological Systems project funded by the Office of Advanced Scientific Computing Research, and under the LDRD Program at the Pacific Northwest National Laboratory, a multiprogram national laboratory operated by Battelle for the U.S. Department of Energy under Contract DE-AC0676RL01830; and by the Director, Office of Science, Office of Biological and Environmental Research, Life Sciences Division, U.S. Department of Energy under Contract No. DE-AC037
SciDAC 2008 Journal of Physics: Conference Series 125 (2008) 012052
IOP Publishing doi:10.1088/1742-6596/125/1/012052
76SF00098. The Molecular Science Computing Facility (MSCF) in the Wiliam R. Wiley Environmental Molecular Science Laboratory (EMSL) is a national scientific user facility sponsored by the U.S. DOE, OBER and located at PNNL. The authors thank Lee Ann McCue and Terence Critchlow for helpful discussions and Darren Curtis, Daniel Crawl, Elena Peterson, Anuj Shah, and Douglas Baxter for technical work on web interfaces and workflow design for ScalaBLAST and NWPolygraph applications. References [1] Muin M and Fontelo P 2006 Technical development of PubMed interact: an impoved interface for MEDLINE/PubMed searches BMC Med. Inform. Deci.s Mak. 6:36 [2] Oehmen C and Nieplocha J 2006 ScalaBLAST: A scalable implementation of BLAST for High Performance Data-Intensive Bioinformatics Analysis IEEE Trans. Parallel. Dist. Sys. 17:740-9 [3] Singh D, Trehan R, Schmidt B and Bretschneider T 2008 Comparative phyloinformatics of virus genes at micro and macro levels in a distributed computing environment BMC Bioinformatics 9 suppl 1:S23 [4] Thallinger G, Trajanoski S, Stocker G and Trajanoski Z 2002 Information management systems for pharmacogenomics Pharmacogenomics 3:651-667 [5] Wang C and Lefkowitz E 2004 SS-Wrapper: a package of wrapper applications for similarity searches on Linux clusters BMC Bioinformatics 5:171 [6] Cannon W, Jarman K, Webb-Robertson B-J M, Baxter D, Oehmen C, Jarman K, Heredia-Langer A, Auberry K and Anderson G 2005 Comparison of probability and likelihood models for peptide identification from tandem mass spectrometry data J. Proteome Res. 4:1687-1698 [7] Gao J, Ma S, Major D, Nam K, Pu J and Truhlar D 2006 Mechanisms and free energies of enzymatic reactions Chem Rev 106:3188-31209 [8] Hereld M, Stevens R, Lee H and van Drongelen W 2007 Framework for interactive million-neuron simulation J. Clin. Neurophysiol. 24:189-196 [9] Lins R, Vorpagel E, Guglielmi M and Straatsma T 2008 Computer simulation of uranyl uptake by the rough lipopolysaccharide membrane of Pseudomonas aeruginosa Biomacromolecules 9:29-35 [10] Shaikh S, Jain T, Sandhu G, Latha N and Jayaram B 2007 From drug target to leads-sketching a physiochemical pathway for lead molecule design in silico Curr Pharm Des 13:3454-3470 [11] Sun Y, Shen B, Lu Z, Jin Z and Chi X 2008 GridMol: a grid application for molecular modeling and visualization J. Comput Aided Mol. Des. 22:119-129 [12] Zhang B, Park B, Karpinets T and Samatova N 2008 From pull-down data to protein interaction networks and complexes with biological relevance Bioinformatics 24:979-986 [13] Zhu M and Wu Q 2008 Transcription network construction for large-scale microarray datasets using a high-performance computing approach BMC Genomics 9 suppl 1:S5 [14] Altschul S, Gish W, Miller W, Myers E and Lipman D 1990 Basic local alignment search tool J. Mol. Biol. 215:403-10 [15] Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z, Miller W and Lipman D 1997 Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Nucleic Acids Res. 25:3389-33402 [16] Darling A, Carey L and Feng W-C 2003 The design, implementation, and evaluation of mpiBLAST. In: Proc. ClusterWorld, (San Jose, CA)
8
SciDAC 2008 Journal of Physics: Conference Series 125 (2008) 012052
[17] [18] [19] [20] [21] [22] [23] [24] [25] [26]
[27] [28]
IOP Publishing doi:10.1088/1742-6596/125/1/012052
Lin H, Ma X, Chandramohan P, Geist A and Samatova N 2005 Efficient data access for parallel BLAST. In: 19th International Parallel and Distributed Processing Symposium (IPDPS), (Denver, CO: IEEE CS Press) Bjornson R, Sherman A, Weston S, Willard N and Wing J 2002 TurboBLAST(r): A parallel implementation of BLAST built on the TurboHub. In: 16th International Parallel and Distributed Processing Symposium (IPDPS), (Fort Lauderdale, FL: IEEE CS Press) Grant J, Dunbrack R J, Manion F and Ochs M 2002 BeoBLAST: distributed BLASTand PSI-BLAST on a Beowulf cluster Bioinformatics 18:765-766 Braun R, Pedretti K, Casavant T, Scheetz T, Birkett C and Roberts C 2001 Parallelization of local BLAST service on workstation clusters. Future Generation Computer Systems 17 Camp N, Cofer H and Gomperts R 1998 High-throughput BLAST. Hokamp K, Shields D, Wolfe K and Caffrey D 2003 Wrapping up BLAST and other applications for use on UNIX clusters Bioinformatics 19:441-442 Kim H-S, Kim H-J and Han D-S 2003 Hyper-BLAST: A parallelized BLAST on cluster system. In: International Conference on Computational Science, pp 213-222 Wang J and Mu Q 2003 Soap-HT-BLAST: high-throughput BLAST based on web services Bioinformatics 19:1863-1864 Muriki K, Underwood K and Sass R 2005 RC-BLAST: Towards a portable, costeffective open source hardware implementation. In: HICOMB 2005, 4th IEEE international workshop on high performance computational biology (Denver, CO) Curtis D, Peterson E and Oehmen C 2008 A secure web application providing public access to high-performance data intensive scientific resources. In: 4th annual international conference on web information systems and technologies, (Funchal, Madeira - Portugal Shah A, Singhal M, Klicker K, Stephan E, Wiley H and Waters K 2007 Enabling highthroughput data management for systems biology: the Bioinformatics Resource Manager Bioinformatics 23:906-909 Webb-Robertson , Peterson E, Singhal M, Klicker K, Oehmen C, Adkins J and Havre S 2007 PQuad-- a visual analysis platform for proteomic data exploration of microbial organisms Bioinformatics 23:1705-1707
9