Database Allocation Strategies for Parallel BLAST ... - Springer Link

5 downloads 12016 Views 180KB Size Report
The most common operation is to use a sequence comparison algorithm to .... Many computational biology applications deal with large data sets over .... Last but not least, there also exist commercial solutions for the parallelization of BLAST.
Distributed and Parallel Databases, 13, 99–127, 2003 c 2003 Kluwer Academic Publishers. Manufactured in The Netherlands. 

Database Allocation Strategies for Parallel BLAST Evaluation on Clusters ´ ROGERIO LU´IS DE CARVALHO COSTA [email protected] ´ SERGIO LIFSCHITZ [email protected] Departamento de Inform´atica, Pontif´ıcia Universidade Cat´olica do Rio de Janeiro (PUC-Rio), Rua Marquˆes de S˜ao Vicente 225, Rio de Janeiro, RJ 22453-900, Brasil

Abstract. In this work we investigate the parallel evaluation of BLAST, the most popular tool for comparing biological sequences. Our goal is to study database distribution issues that, besides workload balancing, could improve the performance of a set of BLAST processes running in a workstation cluster. We consider different partitioning strategies within actual BLAST executions against a few relevant molecular databases. We have implemented multiple databases and input sequence configurations and show that there are many important parameters, such as the fragment generation method and sequence similarities, that must be taken into account in order to make full use of such parallel environment. Keywords: database design, data distribution, BLAST, parallelism, workstation clusters

1.

Introduction

Despite the great number of active genome-like projects, dealing with large volumes of DNA and protein sequences (and annotations) still remains a problem. Database technology is present but to a little extent, i.e., even if some of the projects include some sort of DBMS software to store data, most do not make the most effective use DBMS functionalities. Users mostly work with flat text-based files downloaded from the web. There are many interesting open issues for the database research community, such as a suitable data model for this kind of data and applications; appropriate user interfaces and interaction between different data collections; data integration from multiple and heterogeneous data sources, including the characterization and definition of a widely accepted ontology; an efficient data access through storage structures and indices; and so on (e.g., [10, 25, 29, 31]). In the database research area, molecular biology databases demand a closer look due to the peculiarities of the basic molecular biology operations, such as nucleotide or amino acid sequence comparison and alignments [18]. The most common operation is to use a sequence comparison algorithm to compare a recently mapped sequence to a database of thousands of sequences. Such operations are like (concurrent) queries that must execute as fast as possible against large molecular biology or genomic databases. They are meant to identify similarities and, this way, to give some clue on finding new sequence functions [18]. In this work we are interested in parallel strategies for evaluating one of the most popular operations in bioinformatics, namely BLAST processing. BLAST is a popular family of

100

COSTA AND LIFSCHITZ

algorithms for (bio)sequence comparison and alignment operations. Such operations are widely and often used in laboratories that make genome sequencing and analysis. Besides some minor problems, such as load balancing, the use of shared-nothing machines can lead to a good cost/performance relation [22]. In fact, although there are strategies to parallelize some sequence comparison algorithms based on special-purpose processors, the use of general-purpose machines and database distribution may also lead to a good performance, when a large database is being searched [9]. Our goal is to study database distribution design in order to improve the parallel execution of BLAST processes when running on a workstation cluster. It is not our intention to prove that the use of a cluster is better than the use of a specific-purpose machine, nor to make any change to the BLAST algorithm. Indeed, we focus on the design problem, studying the application of traditional approaches, like database replication and fragmentation, and determining the important parameters to consider when choosing a database allocation strategy to execute parallel BLAST. In fact, due to the sensitivity of each strategy with respect to many of the studied parameters, we argue that a workload balancing technique is needed. We also give some implementation results on an actual BLAST evaluation on a workstation cluster with up to 32 nodes. BLAST processing is a frequent and time-consuming operation in bioinformatics research. We claim that our parallel considerations can be effectively used when looking forward to efficiency. It is important to note that the actual execution of multiple BLAST processes against the same database may suffer performance degradation, even when running on farms of computers. Without a DBMS, the underlying operating system (e.g., UNIX) is responsible for page replacement policies and memory management in general. So far, when evaluating the performance of the execution of BLAST processes, it is usually assumed that there is a large main memory available so that usual database services are not taken into consideration [13]. However, all the experimental sciences are experiencing an unprecedented increase in the amount and complexity of available data. Scientific databases with terabytes of data are now commonplace. In the field of computational biology, this is due in part to the relatively recent new technologies for sequencing and the dissemination of Internet services like WWW. The so-called molecular biology—or simply, genomic—databases are, then, getting bigger and issues like efficient storage and efficient data retrieval must be seriously considered (see [14] for an overview of many of these so-called genomic databases). For example, the EMBL Nucleotide Sequence database has 28 gigabase pairs1 and the GenBank DNA sequences database has approximately the same size.2 These genome sequences increase at a rate that corresponds to doubling their size every six months. Other specific research that has also become popular, known as microarray analysis, is expected to produce about 1 petabyte of data every year. Research laboratories like the Sanger Centre3 have already about 20 terabytes of data being processed [24]. This paper is organized as follows: in the next section we first summarize basic concepts for molecular biology databases and sequence comparison. We review some related work and discuss some parallel alternatives and relevant questions that should be properly answered. In Section 3 we describe the replicated and fragmented strategies that may optimize

DATABASE ALLOCATION STRATEGIES FOR PARALLEL BLAST EVALUATION

101

BLAST processing. We also give an overview of the methodology we will use for practical experiments. Then, in Section 4, we give some preliminary practical results using our parallel strategies on different database allocation models tested with different input files. In Section 5 we make some comparisons and discuss how to use the results obtained. Finally, in Section 6, we conclude listing the main contributions and future work. 2.

BLAST and parallel approaches

In this section, we review the basic molecular biology definitions and concepts that are needed to introduce the problem of sequence comparison and, particularly, BLAST search. Then, we present some important issues that must be considered before adopting a parallel environment, such as a workstation cluster, to deal with BLAST processing. Finally, we discuss selected works published in the literature that are closely related to our research, including academic and industrial proposals. 2.1.

Molecular biology and BLAST search

Nucleotide (DNA or RNA) and amino acid (bio)sequences are finite strings on a restricted alphabet, like {A, C, G, T } for DNA, hereafter simply referred to as “sequences”. Many computational biology applications deal with large data sets over sequences, such as (a) sequence comparison (b) DNA fragment assembly (c) chromosome (or DNA) physical mapping (d) philogenetic tree construction and (e) protein structure prediction [18]. Among these, the sequence comparison problem is surely among the most important ones and therefore many other complex applications are based on it. It is important to note that, in molecular biology databases, the comparison task does not correspond to an exact pattern matching problem. It is somewhat closer to a SQL query with a select statement and a “like” clause for comparison. Indeed, it is a process that allows mismatches and gaps, which have particular biological meanings (like the occurrence of mutations or the insertion of non-codifying material). Considering mismatches and gaps, the algorithms can lead to totally different alignments, which may have useful biological meaning. On figure 1, we present two sequences and two possible alignments between them. The first one does not consider gaps. The second does consider gaps. Which one has more significance? It is a question for the biologists to answer. The algorithms decide which alignment to show, by giving different weights to the occurrence of exact correspondences, gaps and a few allowed mismatches. The best known algorithms for determining optimal alignments are the Needleman and Wunsch [21] and the Smith and Waterman [30] algorithms. Optimal algorithms which deal with these problems have a computational complexity which makes them unfeasible for certain applications. This happens, for example, when comparing a given new sequence with all of those already in the database [18]. For this reason, many alternative and faster methods have appeared. The first popular program was FASTP [15], which searches for protein sequences. Later, another program called FASTA was proposed, which contemplates both proteins and nucleotide sequences [27]. These programs are based on local comparison and give only the best local alignments.

102

COSTA AND LIFSCHITZ

Sequence A: QEKNSDGDMANDNYVTQGDDPFZDRKSNZWAAFGANMSAATMYPPLCQPNLGFHKRAR

Sequence B: MEKNSDGDMANDNYKSNZWAAFGANMSAATMYPPLCQPNLGFHKRARQAARRQAARR M

Possible Alignment: QEKNSDGDMANDNYVTQGDDPFZDRKSNZWAAFGANMSAATMYPPLCQPNLGFHKRAR ||||||||||||| AEKNSDGDMANDNYKSNZWAAFGANMSAATMYPPLCQPNLGFHKRARQAARRQAARRM

13 Correspondences Another Possible Alignment: QEKNSDGDMANDNYVTQGDDPFZDRKSNZWAAFGANMSAATMYPPLCQPNLGFHKRA |||||||||||| |||||||||||||||||||||||||||||||| AEKNSDGDMANDNY-----------KSNZWAAFGANMSAATMYPPLCQPNLGFHKRARQAARRQAARRM

48 Correspondences

Figure 1.

Some alignments.

Programs (e.g., LFASTA, PLFASTA) that involve multiple local alignments were after incorporated to the FAST programs family. A discussion of these programs can be found in [26]. The BLAST (Basic Local Alignment Search Tool) [1, 2] programs appeared in the 90s. BLAST was developed as an answer to the performance problems of all FASTA programs and soon became popular. As well as its predecessor, BLAST has many versions such as BLASTP, for searching on proteins and BLASTN for nucleic acids. In short, BLAST is a heuristic that performs an exhaustive search on a sequence database to find homologous sequences to a given input sequence, using local alignments for comparisons. The inputs for a BLAST program are a sequence query and a sequence database. The query is a sequence in the FASTA format (see figure 2 for an example), with a first identification

>gi|1789275 (AE000374) proline aminopeptidase P II [Escherichia coli] MSEISRQEFQRRRQALVEQMQPGSAALIFAAPEVTRSADSEYPYRQNSDFWYFTGFNEPEAVLVLIKSDD THNHSVLFNRVRDLTAEIWFGRRLGQDAAPEKLGVDRALAFSEINQQLYQLLNGLDVVYHAQGEYAYADV IVNSALEKLRKGSRQNLTAPATMIDWRPVVHEMRLFKSPEEIAVLRRAGEITAMAHTRAMEKCRPGMFEY HLEGEIHHEFNRHGARYPSYNTIVGSGENGCILHYTENECEMRDGDLVLIDAGCEYKGYAGDITRTFPVN GKFTQAQREIYDIVLESLETSLRLYRPGTSILEVTGEVVRIMVSGLVKLGILKGDVDELIAQNAHRPFFM HGLSHWLGLDVHDVGVYGQDRSRILEPGMVLTVEPGLYIAPDAEVPEQYRGIGIRIEDDIVITETGNENL TASVVKKPEEIEALMVAARKQ

Figure 2.

Input file sample.

DATABASE ALLOCATION STRATEGIES FOR PARALLEL BLAST EVALUATION

103

line followed by many other lines with the sequence itself. The so-called database used for BLAST operations is basically a text file with no DBMS functions. This file usually keeps multiple sequences in the FASTA format. The output of a BLAST execution shows, in general, the database sequences that present the best alignments with the sequence query. Alignments, program execution parameters, statistics on program execution and on the database are also given at the output. A partial result of a BLAST evaluation can be observed in figure 3. The search cost depends on the size of the molecular biology database, as well as on the size of the input sequence. Actually, the crucial points are (i) BLAST performs an exhaustive search of the molecular biology database and the search order is irrelevant; (ii) There is no obvious access (indexing) method that helps speed up BLAST considerably; and (iii) in practice, a large number of BLAST processes are simultaneously submitted against the database. 2.2.

Parallel strategies and related work

There are many articles available in the literature concerned with BLAST, or more generally, sequence comparison, alignment algorithms and parallelization issues. However parallelization of BLAST has not yet been studied carefully, especially in the database context. To the best of our knowledge, there are only a few published works with a database context and the emphasis on database distribution we are concerned here. Most works have focused on specific parallel architectures and parallel and distributed database issues are sometimes even ignored. There are, still, some exceptions. The work most closely related to ours is the one proposed in [4]. The authors present three different levels of parallelism for local BLAST processing: 1. Fine grained: to perform all the alignments of the comparison in parallel; 2. Medium grained: a fragmented sequence database with multiple input sequences examined in parallel; and 3. Coarse grained: for multiple input sequences processed in batch mode against a database replicated at all processing nodes. Only the last approach, coarse grained, with the proposed architecture to achieve it, was discussed in more details but with no implementation issues nor practical results presented. An important point is that the replicated approach requires a large amount of memory to allow the storage of the complete database at each node. On the other hand, fragmentation must deal with the coordination of distributed jobs and the final polling of results at each site. The authors apparently left their work at that stage, which was a primary motivation for the work presented in this paper. By the time BLAST was still being developed, the work in [11] compares 4 sequence alignment algorithms based on dynamic programming. The author concludes that none of the algorithms studied is unequivocally better than the others, all having strengths and weaknesses. The one that should be used depends, clearly, upon the nature of the application.

104

Figure 3.

COSTA AND LIFSCHITZ

Partial BLAST output result.

DATABASE ALLOCATION STRATEGIES FOR PARALLEL BLAST EVALUATION

105

Some relatively old material discusses BLAST parallelization: in [12] BLAST is ported to a CRAY shared-memory machine, to nCUBE and Intel iPSC/860 distributed architectures, and also to a workstation cluster. With very few details shown, the author claims that the parallelization methods work well only for a moderate number of processing nodes. In [33] the authors claim that using self-scheduling and buffering strategies may considerably improve performance and show some results on a cluster with a PVM implementation. Unfortunately many details are omitted. The same happens in [20], which implements a parallel FASTA and discusses the I/O bottleneck problem that appears when the number of processors increase. In the latter work, the authors used the Linda parallel language to implement the algorithm. Another interesting study is described in [23]. The implementations were done on a DEC Alpha workstation cluster, using the C language and PVM as the parallel interface, simulating a shared-disk configuration. The sequence database is accessed via NFS and, thus, there is no real need for transmission of database partitions in the network. However, there is a need to monitor each node workload and to coordinate the distribution of the input sequences tasks. Two relevant questions in their project were pointed out: (1) whether it would be better to distribute the database partitions so that there would be local access at each node and (2) how to deal with requests that could not be serviced immediately when all servers were busy. In [6] there is a performance analysis—both throughput and response time—of BLAST when 3 different Shared-Memory Multiprocessors (SMP) machines are used. There are results on the algorithm sensitivity to input sequences content and also sequence lengths. The authors conclude that throughput performance is linearly scalable with the number of processors, with little performance degradation. They also argue that other processors can be added to improve performance when dealing with continuously growing sequence databases. However, nothing is said about the interference that these parallel architectures usually present when multiple nodes share a common (memory) resource. In [35] the authors propose a parallel method for homologous sequence search that would apply to BLAST and other searches. The strategy is based on a assignment of database sequences to buckets in order to obtain an even amount of work at each processor. Good load balancing may be achieved by distributing subsequences according to their sizes so that each bucket has approximately the same size. The focus was on load balancing issues and the parameters were the input query sequence lengths and the number of processors available (2 to 128 nodes). The basic problem, not addressed by the method in question, is that the degree of similarity between an input sequence and some database sequences influences execution times. This kind of skew, and consequently uneven workload balancing, may happen even for buckets with exactly the same size. Many other important parameters were also not even mentioned in this paper. We can find in the literature some other works that evaluate sequence comparison parallelization in terms of architectural issues. In [9] there is a comparison of 5 different types of hardware architectures, which include workstation clusters, supercomputers, single-purpose VLSI, reconfigurable hardware and programmable co-processors. The author concludes that an effective comparison is very hard when several parameters are taken into account. These are, for example, different problem (input sequence and database) sizes, complex

106

COSTA AND LIFSCHITZ

algorithms, speedup versus performance and specific hardware composition, that are often not clearly configured. More recently, the ParAlign parallel algorithm for sequence alignment has been proposed [28] and the authors claim that it is as sensitive as the Smith-Waterman algorithm [30] but NCBI BLAST 2 program performs better. The author claims that the approach in [9] is not a good one since it is based on a special purpose hardware that achieves good speed at a very high cost. The ParAlign algorithm uses a “semi-heuristic” method that exploits the advantages of the SIMD technology. Some other interesting results are proposed in [16] that use a multithreaded architecture (EARTH—Efficient Architecture for Running Threads) for implementing sequence comparison algorithms that are efficient enough so that heuristics like BLAST and FASTA do not have to be used. It is a parallel MIMD approach and they use off-the-shelf workstation clusters with a “collective memory” that distributes data evenly. The idea is to compare “whole genomes” in a faster way than those using dynamic programming based strategies that demand high computational power and huge memories. The work in [17] is similar to that of [16] but they use a specific parallel machine called MANNA. Still in the architectural context, the work in [19] argue that machine-specific parallelization (in this case, the Intel Hypercube parallel computer) are usually worse than machine independent parallelization. The programs, implemented in the (machine-independent) parallel language Linda, investigated the comparison of single biological sequences against a sequence database and also a database to database comparison. Last but not least, there also exist commercial solutions for the parallelization of BLAST and other sequence alignment and comparison strategies. As usual, most details are not available and it is hard to evaluate how good these products are. The first one we choose is the SGI HT-BLAST (HT for “high-throughput”), described in [5]. It is said that rather than using existing parallel structures, the concurrency (of BLAST processes) is done at a higher level. That is, the input sequences list is divided into small blocks (from 10 to 25 sequences each) and these blocks are distributed to multiple processors, each running a copy of the (multiple BLAST: BLASTn, BLASTx, etc.) wrapper. The authors claim that, as the assignment of work to processors is done dynamically, load balancing is optimized. Many performance and scalability results executed in a shared-memory multiprocessor are presented showing an overall better performance of HT-BLAST when compared to original BLAST. TimeLogic compares in [32] its Tera-BLAST software running on a hardware accelerated DeCypher server against NCBI BLAST. Even for single processor situations, their solution shows better performance than an original BLAST being executed on top of 4 CPU-servers. It is a hardware based product that is not described in details. Mostly, there are only practical results given and these are used to argue in favor of their approach. Finally, TurboBLAST [34], from Turbo Genomics, Inc. is implemented in a three-tier architecture that captures BLAST submissions at the client side and clusters them into a “BLAST jobs” database at a BLAST master application layer. Eventually, these are sent to BLAST workers at the last layer that will execute a BLAST on local database BLAST workers. They adopt a fragmented approach that initially partitions the databases in a way that all fits in the main memory of every worker node. All BLAST requests are thus executed

DATABASE ALLOCATION STRATEGIES FOR PARALLEL BLAST EVALUATION

107

against all databases, possibly partitioned. They mention that workload balancing of tasks being executed is done dynamically but no further details are given. Actually, as already stated, the work presented in this paper is original in several respects. Either some parameters are not considered, at least explicitly, or there are ad-hoc solutions based on specific architectures, or even the ideas cannot be adapted to BLAST processing. In view of all the points stated in this section, it is still an open problem, in the database research area, to determine the main parameters to run BLAST in a workstation cluster against a database. Actually, we would like to investigate if traditional distributed database techniques [22] can be used here. Indeed, we will study some replication and fragmentation strategies, and their important parameters, that are applicable to molecular biology databases. 2.3.

BLAST parallelization issues

As in many different applications, one could think about using a parallel environment and system to improve considerably programs execution performances. This environment does not have to be complex or based on some “exotic” architecture. Rather, it could be a cluster of workstations (or even a grid), which yields a good price-performance machine already widely used in this area. However, there is no simple and direct approach that makes full use of the available computational power of a workstation cluster. The assignment of data to processing nodes has a substantial influence on response times. Indeed, a straightforward strategy could be to replicate the whole sequence database at all nodes and split the input sequences into pieces to be passed to the multiple BLAST processes. Some issues must be explored further, such as the appropriate number of processing nodes and the strategy for assigning sequences to nodes. Furthermore, we would like to verify if the initialization costs are relevant and if the input sequences submission order affects the performance. It is also important to know if processing priorities must be defined. As we will see, these are questions that do not have trivial answers. When database fragmentation is taken into account, extra questions must be answered. Considering a non realistic situation, where only one input sequence will be blasted, the same sequence will be compared to multiple partitions of the database in parallel and, as so, there is a potential gain in efficiency. However, we would like to know, besides the number of nodes that should be used, what data distribution policy (e.g., round-robin) is best adapted. Moreover, we must know if disjoint fragments are allowed and also if uneven partition sizes may lead to load imbalancing. When the database is fragmented, there is an extra cost associated to gathering a single final result from all distributed sites. Actually, there are multiple input sequences submitted at a time and not only one. In addition to the previous questions, one may wonder if different sequences submission input orders would lead to different execution performances. It is true that all these are very well known issues in the distributed and parallel database communities but, still, there is a number of specific issues related to the biological context [7]. Furthermore, the general problem of distributed database allocation is known to be

108

COSTA AND LIFSCHITZ

NP-complete and, in practice, one must consider heuristics, besides some known “business” (in our case, biological) rules, to achieve a reasonable database distribution [22].

3.

Database allocation strategies

In this section, we describe our database distribution approaches for processing BLAST in parallel. We recall that the BLAST algorithm is kept unchanged and our main focus is to evaluate the availability of sequences (e.g., data) from a database server point of view. We will describe next our alternatives and what will be implemented and tested within a workstation cluster environment. Then we review some of the questions regarding replication and fragmentation strategies posed in Section 2.3 and give an idea of the different situations that will be analyzed in practice. The implementation results will be shown in Section 4. 3.1.

Distribution design

The alternatives to implement BLAST on a workstation cluster studied here are the replicated model and the partitioned model, similar to the ones studied in [4] and discussed in Section 2.3. In both schemes we consider a master-slaves model, using different numbers of slave nodes. In the replicated approach, the database file is fully copied to all the participating slave nodes. The master node receives the batch file, generates the BLAST’s input files according to the number of available slave nodes and distributes the files. At each slave node, the BLAST program chosen is executed for each input file. Then, the output files generated by the local executions are merged with the batch input file in the master node, in order to generate a single file with the results of the BLAST executions on all sequence queries. This mechanism is represented in figure 4. We have assumed initially a round-robin strategy to distribute the input sequences onto the slave nodes. For the partitioned strategies, we must first create partitions according to a given policy and place them on different nodes. One could think about creating fragments based on the similarity of sequences or on a given group of sequences based on their biological meaning. As we will discuss later in this article, this situation can lead to data skew problems and, consequently, load unbalancing. We will consider here a simpler situation. As a single BLAST tries to match the input sentences with all the databases sequences, one input query sequence must be an input sequence to BLAST in all slave nodes. Therefore, for each sequence participating in the input batch file, a BLAST process is executed locally at each slave node. After the last execution, each slave node sends its results to the master node. The partitioned database model is presented in figure 5. The master node has to assembly the partial results at each site to create a unique result file. We should note that this final stage, when all slaves end their local execution, is much more complex in the fragmented database model then it is in the replicated database model. In the replicated database model, the master node had only to group the results for different

DATABASE ALLOCATION STRATEGIES FOR PARALLEL BLAST EVALUATION

Input query file

Sequence 1 Sequence 2 ... Sequence k

109

Master Node

Sequence 1 Sequence 2 ... Sequence i

Sequence i+1 Sequence i+2 ... Sequence j

Sequence j+1 Sequence j+2 ... Sequence l

Sequence l+1 Sequence l+2 ... Sequence k

Node 1

Node 2

Node 3

Node N

Replica 1

Replica 2

Replica 3

...

Replica N

BLAST results

Master Node

Output file

Output assembly

Figure 4.

Replicated schema.

input sequences. In the fragmented database model, the master node has to create a final result for each input sequence before grouping the results for all input sequences. When the master node creates the final result for each input sequence, another important point arises. The BLAST programs take into account the database size when choosing the smaller value of punctuation that an alignment may have to be considered. As the slave nodes have only fragments of the original database, the database size considered by BLAST programs on slave nodes is not the same of the original database. Then, some slave nodes present some alignments that would not be considered if the original database was used. This alignments should not be present in the final result of the sequence, created by the master node. The initial stage as well should take more time for a partitioned database than for the replicated database model, as we have to transmit each input sequence to each slave node, increasing the communication costs. Furthermore, in the case of a fragmented database, we must address the problem of how to create each fragment. The first issue to decide upon is the unit of distribution. The whole input database file contains only one “type” of data that is actually used during the BLAST processing: the “sequence” character string (this is the description of a BLAST database— other biological databases may have other structures, including annotations, for example, as described in [14]). However, we can divide it in two units, the sequence part and a single line containing some information about it (like a name) that identifies the sequence (see figure 2 for a sample).

110

COSTA AND LIFSCHITZ

Input query file

Sequence 1 Sequence 2 ... Sequence k

Master Node

Sequence 1 Sequence 2 ... Sequence k

Sequence 1 Sequence 2 ... Sequence k

Sequence 1 Sequence 2 ... Sequence k

Sequence 1 Sequence 2 ... Sequence k

Node 1

Node 2

Node 3

Node N

Fragment 1

Fragment 2

Fragment 3

...

Fragment N

BLAST results

Master Node

Output file

Output assembly

Figure 5.

Partitioned schema.

The identification line is only used at the results presentation stage. If we put the information line apart from the sequence itself, we would have to do some kind of join operation to get the final complete result. Therefore, as we can see no obvious advantages on breaking the sequence apart from its information line, whenever we refer to a sequence we will be considering both its identification part and the sequence string. If we make an analogy with the relational data model, we could say that the entire database contains only one relation and the tuples of this relation are the sequences. Then we can choose an horizontal fragmentation: fragments containing a subset of sequences of the database are created. These fragments would satisfy entirely the completeness, reconstruction and disjointness rules that ensure correct fragmentation [22]. On the other hand, we cannot do anything that could be analogous to a vertical partitioning: fragments which do not contain entire sequences cannot be created because this would affect the correctness of BLAST’s executions. 3.2.

Evaluation methodology

We recall that in Section 2.3 we have listed some relevant issues that must be carefully evaluated if we want to make full use of a parallel environment to process BLAST or any other sequence alignment or comparison program.

DATABASE ALLOCATION STRATEGIES FOR PARALLEL BLAST EVALUATION Table 1.

111

Tests. Number of slave nodes

Number of input sequences

Input sequences lengths

Input sequences methods

Fragments generation sort methods

=

=

=

=



=

F

=

=

=

=



=

F and R

=

=

=

=



=

F and R

=

=

=

=



=

F and R

=

=

=

=

=

=

F and R

Database

Database design

Some of the aspects to be discussed depend on a few parameters that are not always known before the actual BLAST execution starts. Indeed, when one or a “small” number of input sequences (less or equal the number of available processing nodes) are submitted, a simple yet good strategy would be to replicate the sequence database at all nodes and evaluate all BLAST processes in parallel. However, when a non previously determined number of input sequences are present (which is the usual situation for the existing BLAST servers) the replication model may not be the best solution and fragmentation strategies must be studied. In order to detect the most relevant parameters that affects BLAST parallel processing’s performance some tests were made. A summary of these tests is presented in Table 1. Each line of Table 1 means a group of tests. Each column presents a parameter to be tested. The last column presents the database distribution method used in the group of tests. It can assume the following values: Fragmented Databases (F), Replicated Databases (R) or F and R which means that the group of tests where realized in both replicated and fragmented situations. An equal sign means that there is no variation of the parameter in the group of tests. A not equal sign means that there were some tests in the group of tests in which the parameter varied in some way. As presented in Table 1, in the fragmented situation, two fragment generation methods were tested using the same database and input files. In both replicated and fragmented situations, we will examine speedup when varying the number of nodes available for processing, as represented in line 2 of Table 1. The database size may influence on the distribution strategy to be adopted. The tests on different databases and with different number of slave nodes (line 3 of Table 1) results will show that. The groups of tests of lines 4 and 5 of Table 1 have some different objectives. We also wanted to test some alternatives for grouping input sequences and the order they are submitted to the processing nodes. By varying the input sequences’ sizes and the total number of sequences, we would be able to detect performance differences, if any, together with some kind of priority inside a given group. This is quite similar to what a DBMS do for optimal query plans. Finally, we were interested in two other issues that appear in any parallel analysis one should make. First, we want to measure both initialization and output results final pooling costs, to see if these costs are relevant with respect to the overall cost of the whole BLAST processing. Second, and maybe even more interesting, we would like to detect load

112

COSTA AND LIFSCHITZ

balancing problems, that may happen. This last point is a crucial one if we want to compare cost/performances between single and parallel environments. As we will see, some of the answers are far from obvious and the results obtained in our experiments help us understand why. All the experimental results obtained are further discussed in the Section 4. First, we detail the computational environment that is used in all experiments described in this paper.

3.3.

Environment

All implementations are executed in a 32 PC cluster, with Linux RedHat 6.2 (Zoot) 2.2.145.0 kernel version. All coding was done in C and a GNU gcc compiler v2.9 was used. The parallel interface was LAM-MPI v6.3b. Each node is a Pentium II 400 MHz IBM PC 300 GL, each with own 6.3 GB disk, connected through an Ethernet network via a IBM8274 switch, which permits 32 segments at 10 Mbps. The node used as Master has 64 Mb of RAM memory, while all others have only 32 Mb of RAM memory. We used WU-BLAST 1.4, which is freely available for download at Washington University’s site.4 We recall that our approach was not to make important changes on the BLAST’s code, but to study how to use DBMS’s techniques on BLAST’s execution. Nevertheless, we made a few changes in WU-BLAST code in order to output BLAST’s results in a file whose name is dependent on the name of the input query file. The BLAST family contains several programs: BLASTP, BLASTN, BLASTX, TBSLASTP, TBLASTN and TBLASTX. We have arbitrarily chosen the BLASTP program in our implementations, which compares protein queries to protein databases. Our discussion is easily extendable to other BLAST programs. Three data sources were used in our experiments: Ecoli.aa, SwissProt and nr (Table 2). These can be obtained at the NCBI site.5 The first one contains Escherichia-coli genome translations while the second one holds SWISS-PROT amino acid database information, in its last version, downloaded from EMBL (European Molecular Biology Institute) and SIB (Swiss Institute of Bioinformatics). The last source, nr, is a NCBI complete Genbank [3] translation that includes SWISS-PROT in a non-redundant form. We have considered here the 2002 February version. The number of sequences, as well as the total number of characters, the file (of data) and the longest sequence, for all three data sources, are presented in Table 2. We have taken as Table 2.

Input databases characteristics. Database Ecoli.aa

SwissProt

Number of sequences

4.289

100.225

887.672

Number of characters

1.358.987

36.448.443

277.466.903

Sequence larger size

2.383

6.486

34.350

1.8

45.6

436.7

File size (MB)

nr

DATABASE ALLOCATION STRATEGIES FOR PARALLEL BLAST EVALUATION Table 3.

113

Input files.

File name

Total number of sequences

Sequences’s lengths

Sort

IBF1

80

500



IBF2

80

100, 300, 700, 900 (20 sequences of each length)

Interleaved

IBF2b

80

100, 300, 700, 900 (20 sequences of each length)

Ascending sorted

IBF3

160

500



IBF4

240

500



IBF5

320

500



input load multiple files with different number of sequences, each of which with different sizes, as presented in Table 3. The input file sequences were randomly generated. We will present next the results obtained from the execution of our tests when the main parameters are changed. 4.

Implementation results

In this section we will comment on all relevant parameters and discuss the most important results obtained, to be showed in a series of graphics and tables. We start with the replication scheme and then we describe the fragmentation strategy. In Section 5 we will make some comparisons and evaluate how we could use in practice the results obtained. 4.1.

Replicated database model

Our initial goal with the replicated scheme was to evaluate the speedup. We have used a file called IBF1, with 80 randomly generated sequences, each of which with 500 characters. The sequence’s length chosen was based on the average values in [4, 5]. We have considered the SwissProt database here. We have then submitted each of the sequences to the BLASTP program to obtain the elapsed time equivalent to a serial single-processor execution. Then, the replicated parallel scheme was applied for 2, 4, 8, 12, 16, 20, 24 and 30 slave nodes. The serial execution time was approximately 1 hour and 7 minutes and for 30 processing nodes this was reduced to about 2 minutes and 20 seconds. The results for all configurations are given in figure 6. It can be observed that we have obtained results very close to a linear speedup in some configurations. For all configurations the startup time—representing the messages receiving and input file writing—took less than 1 second. Equivalently, this was the final time obtained since the last slave node stopped working and until the master node also stopped working. We can see in figure 7 the elapsed time and the idle time measured at each slave node for the 8 nodes configuration. Each node has processed the same number of sequences, each

114

COSTA AND LIFSCHITZ

4500 4000

Elapsed Time (sec)

3500 3000 2500 2000 1500 1000 500 0 Serial

2

4

8

12

16

20

24

30

Configuration

Figure 6.

Replicated schema: Elapsed time vs Number of nodes.

500

Elapsed Time (sec)

450 400 350 300 250 200 150 100 50 0 1

2

3

4

5

6

7

8

Node Execution Time

Figure 7.

Idle Time

Elapsed time on each slave node for a 8 node configuration.

with the same length. One can notice that no two nodes had the same execution time. The idle time of nodes 2, 3 and 4 is of about 10% of the execution time at these nodes. This difference between execution times for all nodes has ocurred in all tested configurations. Now we will consider a new input file, called IBF2. Like IBF1, this file has 80 randomly generated sequences and 40,000 characters at all. However, the input sequences in IBF2 do not have the same length. We have built 4 groups of sequences with different lengths: 100, 300, 700 and 900 characters. In IBF2 these sequences are placed in an interleaved fashion, i.e., the first sequence has length 100, the second 300, the third sequence 700 and the fourth

115

DATABASE ALLOCATION STRATEGIES FOR PARALLEL BLAST EVALUATION

800

Elapsed Time (sec)

700 600 500 400 300 200 100 0 1

2

3

4

5

6

7

8

Node Execution Time

Figure 8.

Idle Time

Replicated database and input file IBF2.

sequence 900, restarting again in the fifth sequence with another sequence with length 100 and so on. This IBF2 file was submitted for BLAST execution under a 8 node configuration. The sequences were distributed among the nodes following a round-robin strategy. The results obtained for elapsed time and idle time are presented in figure 8. As we can observe, a clear uneven execution happened in this situation, with a large idle time for nodes 1 and 5 and a large execution time for nodes 4 and 8. We have, then, modified the IBF2 in such a way that the 20 first sequences were the shortest ones (length 100 characters) and this way on, 20 sequences with length 300, next 20 sequences with length 700 and the final 20 sequences were the ones with larger lengths. This new file was called IBF2b. For the same execution as described above for IBF2—replicated scheme with 8 nodes, SwissProt database, round-robin sequence distribution for nodes—we have obtained another uneven workload as we can see in figure 9. However, this workload unbalanced was not so critical. With IBF2b, the total execution time was reduced in more than 30%. In order to evaluate the architecture response to an increase in the number of input sequences, we have tried out some test files with 160, 240 and 320 sequences. This was done for the same 8 node replicated configuration and still with the SwissProt database. These new input files were created by repeating the sequences present in IBF1, in order to guarantee that different results possibly obtained would not be due to new randomly generated sequences. Indeed, in the 160 sequences file, called IBF3, each IBF1 sequence appears twice and for the 240 and 320 sequences we have 3 and 4 copies, respectively, of each IBF1 sequence. The elapsed time obtained for the 8 nodes configuration is given in figure 10. As it could be expected, the IBF5 input file execution was the slowest one. The initialization time, as well as the termination time, for all executions, are still insignificant, being less than 1 second.

116

COSTA AND LIFSCHITZ

600

Elapsed Time (sec)

500 400 300 200 100 0 1

2

3

4

5

6

7

8

Node Execution Time

Figure 9.

Idle Time

Replicated database and input file IBF2b.

2000 1800

Elapsed Time (sec)

1600 1400 1200 1000 800 600 400 200 0 IBF3

IBF4

IBF5

Input File

Figure 10.

Execution time—Input files IBF3, IBF4 and IBF5—Replicated database schema.

As one may wonder if there is a possible skew in our strategy when we change the database size, we have submitted the IBF1 input file against the Ecoli.aa and nr databases, for 8, 16 and 24 nodes. We show in Table 4 the elapsed time for all three configurations. It can be noted that only for Ecoli.aa database and when we move from 8 to 16 nodes the correspondent execution time decreases proportionally to the increase in the number of nodes. When the configuration changes from 16 to 24 processing nodes (an 50% increase for available resources), we get a 12.5% decrease in the execution time for Ecoli.aa database and for the nr there was a 19.2% reduction.

DATABASE ALLOCATION STRATEGIES FOR PARALLEL BLAST EVALUATION Table 4.

117

Execution time—Databases: Ecoli.aa and nr—Replicated schema. Database Configuration

4.2.

Ecoli.aa

nr

8 nodes

16

3.574

16 nodes

8

1.769

24 nodes

7

1.428

Fragmented database strategy

In the fragmented database approach an important issue is the fragment generation policy. We have tried basically 2 methods. In both methods we have adopted a criteria that lead to fragments that have approximately the same number of sequences.

4.3.

Fragment generation process

The first method, which we will call FGM1 (Fragment Generation Method 1), generates fragments in such a way that the smaller sequences are in the first fragment, the following smaller sequences are in the second fragment and so on such that sequences of about the same length will be in the same fragment. This strategy was applied to the SwissProt database for generating 8 fragments. Indeed, the first 7 fragments had 12,528 sequences and the last one got 12,529 sequences. The maximum sequence size and the total number of characters at each fragment are given in Table 5. As 6 fragments contain sequences that are smaller than 500 characters, about 75% of the database has sequences with up to 500 characters. Each fragment size vary from about 1.7 MB (first fragment) to almost 14 MB for the eigth fragment.

Table 5.

Sequences distribution into fragments: FGM1.

Node

Fragment sequence larger length

Total number of fragment characters

1

101

765.628

2

154

1.604.122

3

220

2.346.554

4

293

3.195.608

5

370

4.152.980

6

464

5.180.067

7

628

6.645.145

8

6.486

12.598.338

118

COSTA AND LIFSCHITZ

Table 6.

Sequences distribution into fragments: FGM2. Fragment sequence larger length

Total number of fragment characters

1

6.486

4.564.274

2

5.217

4.558.389

3

5.255

4.559.206

4

5.263

4.559.815

5

5.327

4.560.442

6

5.376

4.561.099

7

5.430

4.561.785

8

6.359

4.563.432

Node

The second method, called FGM2 (Fragment Generaion Method 2), distributes sequences through fragments with a round-robin strategy for the sequence input set already ordered by size. Now, not only the number of sequences is about the same at each fragment but, also, the total number of characters at each fragment is similar. As we have done before for the previous method, we have applied it for the SwissProt database in order to generate 8 homogeneous overall size fragments (in this case the first one had 12,529 and all the other 12,258 characters). As before, we show in Table 6 the maximum size of a sequence at each fragment and the total number of characters at each fragment. Each fragment has approximately 5.7 MB. We have, therefore, tried out our environment for a fragmented database and BLASTP execution. In the first experiment, with the same IBF1 file used for the replicated situation, we have used fragments generated by the FGM1. The results obtained for both elapsed and idle time at each node are given in figure 11. The low values for the elapsed time and high values for idle time obtained for nodes 1 and 8, respectively, must be observed. For node 1, specifically, the idle time was about 1,000% with respect to the execution time. Next, an analogous test was implemented, using the FGM2 method for generating the fragments to be assigned at each site. The practical results are presented in figure 12. As it can be observed, a very effective load balancing is obtained. The total elapsed time was approximately 35% of the elapsed tipe with the fragments being generated by the FGM1 method. Due to the good results obtained with the FGM2 for both workload balancing and execution elapsed time, this was a fragment generation method chosen for the experiments that followed. 4.4.

Performance evaluation

We have tested many different configurations in order to evaluate the speedup for the fragmented database strategy. Initially we have considered the IBF1 file and the SwissProt

119

DATABASE ALLOCATION STRATEGIES FOR PARALLEL BLAST EVALUATION

1400

Elapsed Time (sec)

1200 1000 800 600 400 200 0 Master

1

2

3

4

5

6

7

8

6

7

8

Node Execution Time

Figure 11.

Idle time

Fragmented strategy, IBF1 and fragments with FGM1.

420

Elapsed Time (sec)

400 380 360 340 320 300 Master

1

2

3

4

5

Node Execution Time

Figure 12.

Idle Time

Fragmented strategy, IBF1 and fragments with FGM2.

database, as it was done in the replicated case. In figure 13 we show the elapsed time for these configurations. It can be noted that the performance improvements obtained were proportionally greater than the slave nodes increase. That is, in some situations, we have observed a superlinear speedup. It is interesting to note that for the fragmented database we have obtained a good workload balancing while in the replicated database strategy the result was an uneven workload for the same configuration.

120

COSTA AND LIFSCHITZ

4500 4000

Elapsed Time (sec)

3500 3000 2500 2000 1500 1000 500 0 Serial

2

4

8

12

16

20

24

30

Configuration Figure 13.

Speedup for fragmented strategy, IBF1 and SwissProt.

We have also examined IBF2 and IBF2b files for the fragmented database strategy. In the first case the execution time was 401 seconds while for the latter a little bit higher, 409 seconds. In order to verify the architecture behavior when an increase in the number of input sequences occurs, we have explored further IBF3, IBF4 and IBF5 files with the 8 nodes configuration. These files are the same already detailed in the previous subsection and we have kept SwissProt as the database. We present in figure 14 the practical results obtained. Again, as one could have expected, the execution times for files with more sequences were greater than those with less 1800 1600

Elapsed Time (sec)

1400 1200 1000 800 600 400 200 0 IBF3

IBF4

Input File

Figure 14.

Speedup for fragmented strategy, IBF3, IBF4 and IBF5, SwissProt database.

IBF5

DATABASE ALLOCATION STRATEGIES FOR PARALLEL BLAST EVALUATION Table 7.

121

Execution time—Databases: Ecoli.aa and nr—Fragmented schema. Database Configuration

Ecoli.aa

nr

8 nodes

19

3.278

16 nodes

12

1.604

24 nodes

10

1.058

sequences. However, this time we could observe an increase in the initialization time, which was approximately 2 seconds for IBF5 file. The termination time may still be ignored. Once again, we have repeated some experiments already done with the replicated database strategy. Indeed, we have submitted IBF1 to be processed by BLAST with different databases—Ecoli.aa and nr—and distinct environment configurations. The results obtained are given in Table 7. It should be noted that for each database the behavior was quite different with an increase in the number of nodes. For the Ecoli.aa the speedup is fewer than linear but for the nr the opposite occurred, with a performance increase that is proportionally greater than the number of nodes added to the configuration. Furthermore, it can be seen that for Ecoli.aa and all configurations, the master node total execution time was higher than the time observed for the slave nodes, which shows the overhead due to the final result pooling and assembly. Finally, it can also be observed that, the same it happened when the database was replicated, there is an execution time skew between processing nodes. However, this skew is smaller for the fragmented database strategy than those values obtained for the replicated database schema.

5.

Discussion

In this section we will analyse and discuss the main aspects related to the results presented in the previous Section. We will evaluate first the speedup results for both replicated and fragmented strategies. In figure 15 we present execution times for the IBF1 input file when compared to the SwissProt database for different configurations. We can observe that the fragmented database strategy performance was always better than the replicated database strategy. Particularly, for this replicated schema there is no performance gain when moving from 20 to 24 nodes. This is due to implicit characteristics of a replicated database strategy. Indeed, when an input sequence is assigned to a node, the entire sequence is copied to the node. When the number of sequences is not an exact multiple of the number of nodes, some nodes receive one more sequence than others. The execution time for a group of sequences would be, at least, the highest execution time for all slave nodes. When the number of sequences assigned to processing nodes is not the same, the longest time will happen at one of the nodes that received most sequences, assuming that execution times are not skewed by the similarity between some input sequences and some sequences in the database.

122

COSTA AND LIFSCHITZ

1800

Elapsed Time (sec)

1600 1400 1200 1000 800 600 400 200 0 2

4

8

12

16

20

24

30

Configuration Replicated Database

Figure 15.

Fragmented Database

Elapsed time for different configurations—Replicated and fragmented database schemes.

Therefore, with IBF1 input file (80 sequences), for a 20 nodes configuration 4 sequences are allocated to each node. When we have 24 nodes, 16 get 3 sequences and 8 receive 4 sequences. As the maximum number of sequences at each node is the same for both configurations, we obtain similar execution times. To confirm our expectations, we have tried out other configurations, with 9, 10, 11, 13, 15, 26 and 27 nodes and other thresholds appear. The positioning and size of each threshold is directly dependent on the relation between the number of input sequences and the number of processing nodes. In some cases, as the current one, a threshold implies an increase in the number of nodes for performance issues. For the fragmented database strategy we have obtained a linear speedup. This situation will persist for our implementation while the execution mean time for a BLAST process is greater than the communication costs related to messages received by the master node from each slave and the generation, by the master node, of each sequence’s final results. This means that a few nodes are needed for small databases (e.g., Ecoli.aa database, as it will be seen next) and a large number of nodes for also large data sources. If this is not true, the generation of each sequence’s final results at the master node will dominate the total execution time. It must be noted that the number of sequences allocated to a given node is not the only factor that may influence the performance. The similarity degree between sequences in the input file and those belonging to the database may also considerably change the execution time for BLAST search. This is due to intrinsic characteristics of BLAST algorithm, which is divided in two phases: the first one initially searches for small regions (called words) with good correspondence between the database and the input sequence. Then, the algorithm tries to extend these regions. If in the first step the number of corresponding regions is large, the BLAST execution time will be slower.

123

DATABASE ALLOCATION STRATEGIES FOR PARALLEL BLAST EVALUATION

We can see in figure 7 that, although the same number of sequences is assigned to the available nodes and the sequences have the same length, the execution times were very different. This is not a general rule and the values depend on the similarity degree. Thus, we have a situation that asks for a good dynamic load balancing mechanism if one wants to make full use of the parallel environment. Another important factor for the system performance, specially for the replicated database strategy, is the input sequences’ lengths and their allocation throughout the processing nodes. Indeed, the length of a sequence may affect the BLAST processing in 3 ways: (i) more time is needed to search for similar words in the sequence and in the database, (ii) the extension of words take also more time, even for those already found in smaller sequences and (iii) probably a larger number of words are determined and, consequently, a larger number of extensions is executed. The results presented for IBF2 and IBF2b files illustrate well the way the replicated database strategy is influenced by the tasks division with different sized sequences. We can see in figure 16 that the execution time for IBF2b in a whole was about 40% faster than IBF2 file. The only difference between the tests with IBF2 and IBF2b was the order of sequences in the input files. We have adopted a round robin distribution strategy and this way the sequence ordering at each file have generated distinct task sets at each node. For IBF2 input file, the smallest sequences were allocated to nodes 1 and 5, the largest to nodes 4 and 8 and the mid-sized sequences were assigned to all other available nodes. Therefore, nodes 1 and 5 executed faster while nodes 4 and 8 were very slow, what explains the uneven workload at each node. For the IBF2b file, the sequences’ ordering, together with the round robin strategy, generated tasks with different sequences sizes assigned to each processing node. Thus, we have obtained a shorter total execution time. The workload was not as unbalanced as before. 800

Elapsed Time (sec)

700 600 500 400 300 200 100 0 1

2

3

4

5

6

7

8

Node IBF2

Figure 16.

IBF2b

Elapsed time for each node in a 8 nodes replicated database scheme—Input files IBF2 and IBF2b.

124

COSTA AND LIFSCHITZ

1200

Elapsed Time (sec)

1000 800 600 400 200 0 Master

1

2

3

4

5

6

7

8

Node FGM1

Figure 17.

FGM2

Fragmented scheme: Elapsed time for input file IBF1 and fragment generation with FGM1 and FGM2.

It would be a hard task to let the user decide upon input files that are organized in such a way that good performances are achieved. This should be a master node task, together with the distribution of sequences through the processing nodes. This issue is further discussed in [7]. With respect to the fragmented database schema, the fragment generation method is a relevant parameter that must be considered. The results presented here for configurations that use our proposed fragment generation methods are good examples. In figure 17 we can see that with the FGM1 method an uneven workload was observed. The execution times for fragments generated with this method was almost 3 times grater than the ones obtained when we used FGM2 for generating fragments. And the latter also presented good workload balancing. With the results on IBF1, IBF3, IBF4 and IBF5 files and the 8 nodes configuration, we may infer a trend on strategies with respect to an increase on the input volume. Both strategies have presented a linear speedup when the number of sequences increase, close to an uniform behavior. We must also evaluate the changes in the initialization time. When the replicated database strategy is considered, there are no further consequences for the initialization time when the number of input sequences increase. However, for the fragmented strategy we could observe some change. Even if this was not so big, at least it confirms that this parameter is important for a fragmented database. This may be explained by communication costs. Indeed, for a set of m input sequences and n processing nodes, with the replicated strategy the master node will send m messages while for the fragmented database strategy we will have m × n messages. In our implementation, the messages are sent in one first stage. Only after receiving all messages a slave node starts its BLAST processing. If we want to reduce the effects of message passing in the performance, the algorithm may be changed in a way that only a small number of messages be transmitted initially. All the other messages would be sent/received while the slave nodes execute their BLAST programs.

DATABASE ALLOCATION STRATEGIES FOR PARALLEL BLAST EVALUATION

125

Elapsed Time (sec)

10000

1000

100

10

1 8 Nodes

16 Nodes

24 Nodes

Number of Nodes

Figure 18.

Replicated Ecoli.aa

Replicated SwissProt

Replicated nr

Fragmented Ecoli.aa

Fragmented SwissProt

Fragmented nr

Elapsed time for different databases—Replicated and fragmented database schemes—Input file IBF1.

Finally, we would like to discuss what happens when the database sizes change. The tests for Ecoli.aa and nr databases, together with the SwissProt database, show that this is also (as expected) an important parameter. Indeed, in figure 18 we have represented the execution times for IBF1 files, the 3 databases and 8, 16 and 24 nodes configurations, with both replicated and fragmented database strategies. We can observe that the graphics have an identical behavior, showing that reactions to configuration changes are similar. For the SwissProt and nr databases, the fragmented schema has presented better results. However, for the smaller database Ecoli.aa, better results were obtained with the replicated database strategy. This is due mainly to the costs of communication and output results assembly, which become more relevant for the fragmented schema.

6.

Final comments

In this paper, we have addressed sequence comparison, one of the basic operations every Molecular Biology database must support. We have discussed the problem in the context of the BLAST algorithm. Since BLAST performs an exhaustive search of the database to find the best match and there is no obvious (if one at all) access method that helps speed up BLAST. We focused, then, on parallel strategies that optimize simultaneous BLAST executions. A few database allocation models were described and implemented. The practical results obtained have helped us to evaluate the main parameters that must be taken into account. If possible, these should be tailored in order to obtain the best performance for running time and throughput in this biocomputing context. The main contributions of this work are: (i) a detailed discussion on distributed database design important issues for a genomic database application, (ii) an actual implementation on a popular parallel environment like a workstation cluster and (iii) an analysis of the parameters that have been showed relevant in practice.

126

COSTA AND LIFSCHITZ

As some side results, we can cite an initial discussion on the load balancing and data skew problem and also a preliminary survey on existing parallel BLAST (and some others (bio)sequence comparison parallel strategies), including academic and industrial solutions. We will keep on doing some other experiences to have a better insight of solutions for this problem. We want to obtain further results for skew situations and continue our studies on load balancing methods (as it is initially done in [7]). Also, so far we were not primarily concerned with determining high throughput performance, this is a straightforward step to be followed. This work is part of a research effort that tries to evaluate database techniques that can contribute to this bioinformatics area. We have been working already with data sources and application integration through object-oriented frameworks and also with ad-hoc buffer management policies for sequence comparison algorithms, like BLAST. We believe that genomic databases have become a new database research area itself, much like spatial, temporal and text databases were in the past, with specific algorithms, storage structures, services and functionalities that still have to be developed. Notes 1. 2. 3. 4. 5.

September 2002, in http://www3.ebi.ac.uk/Services/DBStats/. http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html. http://genomic.sanger.ac.uk/. http://blast.wustl.edu/. http://ftp.ncbi.nlm.nih.gov/blast/db/.

References 1. S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman, “A basic local alignment search tool,” Journal of Molecular Biology, vol. 215, pp. 403–410, 1990. 2. S.F. Altschul, T.L. Madden, A.A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D.J. Lipman, “Gapped blast and psi-blast: A new generation of protein database search programs,” Nucleic Acids Research, vol. 25, no. 17, pp. 3389–3402, 1997. 3. D.A. Benson, I. Karsch-Mizrachi, D.J. Lipman, J. Ostell, B.A. Rapp, and D.L. Wheeler, “Genbank,” Nucleic Acids Research, vol. 28, no. 1, pp. 15–18, 2000. 4. R.C. Braun, K.T. Pedretti, T.L. Casavant, T.E. Scheetz, C.L. Birkett, and C.A. Roberts, “Three complementary approaches to parallelization of local BLAST service on workstation clusters,” in Proceedings of the 5th International Conference on Parallel Computing Technologies (PaCT), Lecture Notes in Computer Science (LNCS), vol. 1662, 1999, pp. 271–282. 5. N. Camp, H. Cofer, and R. Gomperts, “High-throughput BLAST,” SGI white paper, available at http://www.sgi.com, September 1998. 6. E.H-h. Chi, E. Shoop, J. Carlis, E. Retzel, and J. Riedl, “Efficiency of shared-memory multiprocessors for a genetic sequence similarity search algorithm,” Technical Report TR97-05, University of Minnesota, Minneapolis, CS Dept, January 1997. 7. R.L.C. Costa, “Data allocation and load distribution for the parallel execution of BLAST,” MSc Dissertation, PUC-Rio Departamento de Inform´atica (in portuguese), March 2002. 8. R.F. Doolittle (Ed.), Molecular Evolution: Computer Analysis of Protein and Nucleic Acid Sequences (Methods in Enzymology), Academic Press: New York, p. 183, 1990. 9. R. Hughey, “Parallel hardware for sequence comparison and alignment,” CABIOS, vol. 12, no. 6, pp. 473–479, 1996.

DATABASE ALLOCATION STRATEGIES FOR PARALLEL BLAST EVALUATION

127

10. E. Hunt, M.P. Atkinson, and R.W. Irving, “A database index to large biological sequences,” in Proceedings of the International Conference on Very Large Data Bases (VLDB), 2001, pp. 139–148. 11. A.K. Iyengar, “Parallel characteristics of sequence alignment algorithms,” in Proceedings of the ACM International Conference on Supercomputing, 1989, pp. 304–313. 12. A. Julich, “Implementations of BLAST for parallel computers,” Bioinformatics, vol. 11, pp. 3–6, 1995. 13. M. Lemos and S. Lifschitz, “Buffer management for BLAST biological sequence comparison,” Technical Report MCC18/02, Departamento de Inform´atica, PUC-Rio, 2002. 14. S. Letovsky (Ed.), Bioinformatics: Databases and Systems, Kluwer: Dordrecht, 1999. 15. D.J. Lipman and W.R. Pearson, “Rapid and sensitive protein similarity search,” Science, vol. 227, pp. 1435– 1441, 1985. 16. W.S. Martins, J.B. del Cuvillo, W. Cui, and G.R. Gao, “Whole genome alignment using a multithreaded parallel implementation,” in Proceedings of the Symposium on Computer Architecture and High Performance Computing (SBAC), 2001, pp. 1–8. 17. W.S. Martins, J.B. del Cuvillo, F.J. Useche, K.B. Theobald, and G.R. Gao, “A multithreaded parallel implementation of a dynamic programming algorithm for sequence comparison,” in Proceedings of the Pacific Symposium on Biocomputing, 2001, pp. 311–322. 18. J. Meidanis and J.C. Set´ubal, Introduction to Computational Molecular Biology, PWS Publishing Company, 1997. 19. P.L. Miller, P.M. Nadkarni, and W.R. Pearson, “Comparing machine-independent versus machine-specific parallelization of a software platform for biological sequence comparison,” Bioinformatics, vol. 8, pp. 167– 175, 1992. 20. P.L. Miller, P.M. Nadkarni, and N.M. Carriero, “Parallel computation and FASTA: Confronting the problem of parallel database search for a fast sequence comparison algorithm,” Bioinformatics, vol. 7, pp. 71–78, 1991. 21. S.B. Needleman and C.D. Wunsch, “A general method applicable to the search for similarities in the amino acid sequence of two sequences,” Journal of Molecular Biology, vol. 48, pp. 443–453, 1970. ¨ 22. T. Ozsu and P. Valduriez, Principles of Distributed Database Systems, Prentice-Hall: Englewood Cliffs, NJ, 1999. 23. A. Pappas, “Parallelizing the blast applications on a network of Dec Alpha workstations,” Internal Project Report, available at http://www.cslab.ece.ntua.gr/˜pappas/. 24. N.W. Patton and C.A. Goble, “Information management for genome level bioinformatics,” Tutorial Notes, International Conference on Very Large Data Bases (VLDB), pp. 213–295, 2001. 25. N.W. Patton, S.A. Khan, A. Hayes, F. Moussoni, A. Brass, K. Eilbeck, C.A. Goble, S.J. Hubbard, and S.G. Oliver, “Conceptual modeling of genomic information,” Bioinformatics, vol. 16, no. 6, pp. 548–557, 2000. 26. W.R. Pearson, “Rapid and sensitive sequence comparison with FASTP and FASTA,” in Molecular Evolution: Computer Analysis of Protein and Nucleic Acid Sequences (Methods in Enzymology), R.F. Doolittle (Ed.), Academic Press: New York, pp. 63–98, 1990. 27. W.R. Pearson and D.J. Libman, “Improved tools for biological sequence comparison,” Procs National Academy of Sciences of the USA, vol. 85, pp. 2444–2448, 1988. 28. T. Rognes, “ParAlign: A parallel sequence alignment algorithm for rapid and sensitive database searches,” Nucleic Acids Research, vol. 29, no. 7, pp. 1647–1652, 2001. 29. L.F.B. Seibel and S. Lifschitz, “A genome databases framework,” in Proceedings 12th Conference on Database and Expert Systems Applications (DEXA), 2001, pp. 319–329. 30. T.F. Smith and M.S. Waterman, “Identification of common molecular subsequence,” Journal of Molecular Biology, vol. 147, pp. 195–197, 1981. 31. R. Stevens, C. Goble, P. Baker, and A. Brass, “A classification of tasks in bioinformatics,” Bioinformatics, vol. 17, no. 2, pp. 180–188, 2001. 32. Tera-BLAST, TimeLogic Inc, available at http://www.timelogic.com. 33. O. Trelles-Salazar, E.L. Zapata, and J.M. Carazo, “On an efficient parallelization of exhaustive sequence comparison algorithms on message passing architectures,” Bioinformatics, vol. 10, pp. 509–511, 1994. 34. TurboBLAST, Turbo Genomics Inc., available at http://turbogenomics.com. 35. T.K. Yap, O. Frieder, and R.L. Martino, “Parallel computation in biological sequence analysis,” IEEE Transactions on Parallel and Distributed Systems, vol. 9, no. 3, pp. 283–294, 1998.

Suggest Documents