teresting to integrate a relational database such as MySQL into ParSeq to store the metadata. Thus, the communica- tion overhead can be decreased by ...
PARALLEL MOTIF SEARCH USING PARSEQ∗ Jun Qin, Simon Pinkenburg and Wolfgang Rosenstiel Wilhelm-Schickard-Institut f¨ur Informatik Department of Computer Engineering University of T¨ubingen Sand 13, 72076 T¨ubingen Germany email: {qinj,pinkenbu,rosen}@informatik.uni-tuebingen.de ABSTRACT Modern methods in molecular biology produce a tremendous amount of data. As a consequence, efficient methods have to be developed to retrieve and analyze these data. In this article, a parallel algorithm searching motifs based on ParSeq, a software tool for motif search, is put forward and its performance analyzed and discussed. Many experiments show that with this parallel algorithm a considerable improvement upon the sequential motif search algorithm is achieved. KEY WORDS Motif Search, Parallel, Sequence, Load Balancing, Bioinformatics
1 Introduction ParSeq [1, 2] is a bioinformatics tool searching motifs with structural and biochemical properties in DNA or protein sequences. It combines the search for motifs with certain structural properties, the verification of biochemical properties and an approximate search mechanism. ParSeq uses extended regular expressions as its query language. For example, query: @()/f unc1 (a1 , . . . , an ), · · · , f unck (b1 , . . . , bm )@ where the biochemical constraints “f unc1 (a1 , ..., an ), ..., f unck (b1 , ..., bm )” add some biological restrictions on regex2 , and each biochemical function corresponds to a test of one biochemical constraint. The search for approximate patterns can be considered as a special case of the search for motifs with some biochemical constraints, where only either Edit distance (ed) or the Hamming distance (hd) can be present as a biochemical function. Both of these two functions take the number of allowed errors as their arguments. The search process can be sketched as follows. First, process the query by drawing out biochemical constraints and leaving a pure regular expression. Second, search for matching hits by the left pure regular expression ∗ The work was supported by a grant from the Ministry of Science, Research and the Arts of Baden-W¨urttemberg (Az: 23-7532.24-3-18/6, Sequenzanalyse-Algorithmen auf hochparallelen PC-Cluster).
against the sequence database through calling the standard regular expression libraries. In the final step, all obtained hits are tested if they are valid concerning the biochemical constraints. ParSeq also supports incremental search by keeping track of previous searches and allowing the user to take any of them as the basis for further searches. The history of searches is represented graphically as a tree in the GUI. On the other hand, modern methods in molecular biology produce a tremendous amount of data. In the past few years, the number of molecular biology databases grew constantly. One might estimate their number between 500 and 1,000. Well-known examples are GenBank [3], EMBL [4] and DDBJ [5]. Most molecular biology databases are very large, e.g. GenBank contains more than 4 × 1016 nucleotide sequences containing about 3 × 1012 occurrences of nucleotides. Furthermore, they grow exponentially. With the continuous increase of these biological data, the execution time increases remarkably when we search against a large data set using ParSeq. To reduce this lack in performance, a parallel motif search algorithm was developed based on the sequential search algorithm in ParSeq. This paper is organized as followed. In section 2, the architecture of ParSeq is introduced and section 3 shows some problems in implementing the parallel motif search algorithm. The implementation is presented in the fourth section. The achieved performance and the comparison and the analysis between the parallel algorithm and the sequential algorithm are discussed in section 5. Finally, based on the discussions in the previous sections, some conclusions are drawn and outline directions for future work are presented in the last section.
2 Architecture of ParSeq Figure 1 shows the architecture of ParSeq. The client, the proxy server and the parallel application are the three layers in the architecture. The communication protocol between each two successive layers is XML-RPC [6]. The reasons we choose XML-RPC as our remote procedure call protocol are that (1) XML-RPC is simple and easy to be used for transferring search request and search response objects in
ParSeq, since they only contain some basic data types, (2) XML-RPC is an extremely lightweight mechanism (compared with others like CORBA or SOAP) for invoking remote service and exchange data in a platform-, language-, and vendor-neutral manner, and (3) XML-RPC is secure because it uses HTTP as the transport, which already offers basic authentication, SSL/TLS and cookie authentication. Furthermore, XML-RPC using HTTP overcomes firewall problems. Our parallel application is developed based on TPO++ [7], an object-oriented message-passing library implemented in C++ on top of the well-known message passing standard MPI. The key features of TPO++ are easy transmission of objects, type-safety, MPI-conformity and integration of the C++ Standard Template Library (STL). The sequence database in ParSeq is stored in flat files in FASTA format, which is the most widely used sequence file format. Each processor is able to access these database files. The database can be updated by importing sequence files from local disks or from GenBank through the client GUI of ParSeq. The sequence importing procedure in ParSeq consists of three activities: (1) parse sequence files to get information about the contained sequences such as: sequence names, sequence start/stop positions, sequence lengths, etc. and save them into SequenceInfo objects, (2) copy or upload sequence files to the database (the database directory), and (3) save sequences information into a metadata database.
also data intensive. These applications usually have large I/O requirements, in terms of both the size of data and the number of files or data sets. Thus, the I/O part of an application has to be parallelized to reach a high efficiency and a good scalability of the parallel implementation. Collective data access routines essential for an efficient parallel I/O are provided by MPI-IO [8], the I/O part of the widely accepted message passing standard MPI-2 [9]. Nevertheless, it has a lack of integration to real object-oriented concepts, making the interface useless for our efforts in providing an object-oriented implementation. To close the gap between MPI and object-oriented concepts, our group recently extended TPO++ by an object-oriented parallel I/O interface [10], which is set up on top of MPI-IO. It provides the same functionality in a MPI-conform way and easily enables reading/writing objects. In the parallel motif search algorithm proposed in this paper, we partition both I/O tasks and searching tasks into n sub tasks and distribute each sub task to each processor, supposing the number of processors involved in the parallel motif search is n.
3.2 Load Balancing In order to present the load balancing strategy, we make the following assumptions: the search space consists of p sequence files: f [0], f [1], · · · , f [p − 1]. A sequence file f [i] consists of qi sequences: s[i][0], s[i][1], · · · , s[i][qi −1]. L(f [i]) denotes the total length of all sequences in file f [i] and L(s[i][j]) is the length of sequence s[i][j]. The number of processors involved in parallel computing is n. Then, the search space Ω and its total length L(Ω) will be: Ω = { f [0], . . . , f [p] } = { s[0][0], . . . , s[0][q0 − 1], · · · , s[p − 1][0], . . . , s[p − 1][qp−1 − 1] }
L(Ω) =
Figure 1. ParSeq three-layer architecture
=
p−1 X
L(f [i])
i=0 p−1 i −1 X qX
L(s[i][j])
i=0 j=0
3 Problems There are several problems which should be considered in the implementation of the parallel motif search algorithm. Two aspects among them affect the performance of the application remarkably: the I/O effect and the load balancing strategy. Search overlap is another problem which has an influence on the completeness of the results.
3.1 I/O Effects Due to a tremendous amount of data produced, bioinformatics applications are not only computing intensive but
Considering the structure of the search space, there are several available solutions for partitioning the search space: 1. File level: the search space Ω is partitioned into n sub partitions based on the number of the sequence files. Each sub partition consists of approximate p/n files. 2. Sequence level: the search space Ω is partitioned into n sub partitions based on the number of the sequences in all sequence Pfiles. Each sub partition consists of approximate ( p−1 i=0 qi )/n sequences. 3. Sequence chunk level: each sequence in the search space Ω is partitioned into n sub partitions based
on its length. In this case, all processors search against all sequences and in one sequence, each processor search against a part with approximate length L(s[i][j])/n (0 ≤ i < p, 0 ≤ j < qi ). 4. Global chunk level: the search space Ω is partitioned into n sub partitions based on the total length (the character number) of all sequences in all sequence files. In this case, the search space Ω is looked as a large string whose length is L(Ω). The large string is partitioned into n pieces and each processor just searches from its start position to its stop position. It is obvious that (1) and (2) have coarse granularity and will have high load imbalance since the lengths of the sequence files and the lengths of sequences are variable and maybe differ much from each other. The disadvantage of solution (3) is that each processor has to read all the sequences and each processor always reads the same sequence in the same file at the same time to get its starting position to search. Too many I/O operations and communication overhead would slow down the search process. Therefore, we choose solution (4) because of the lower load imbalance and fewer I/O operations. As mentioned in solution (4) above, each processor only has to process its own partition: reading sequences data from its start position to its stop position then searching the part it read. In order to correctly partition the search space Ω, we defined a search pointer triple in ParSeq: R = {f, s, p} which indicates a start or a stop search position: the position p of sequence no. s in file no. f . For example, in the case doing search against a search space consisting 7 sequences in 2 sequence files, the search pointer {1, 2, 1203}1 is labeled in Figure 2.
Figure 2. Example of search pointer As a consequence, the length of each sub partition L(Ωsubi ) (0 ≤ i < n) can easily be expressed as followed: L(Ωsubi ) = L(Ω)/n
(0 ≤ i < n)
Then the sub partition for the processor no. i is: Ωsubi = Ω × i/n → Ω × (i + 1)/n = Rstarti → Rstopi = {fstarti , sstarti , pstarti } → {fstopi , sstopi , pstopi } (0 ≤ i < n) That means the processor no. i processes sequences data from the position pstarti of the sequence sstarti in the file fstarti to the position pstopi of the sequence sstopi in the file fstopi . 1 Supposing the position of the arrow in Figure 2 is just before character no. 1203 of the sequence no. 2 in file no. 1
3.3 Overlapping Search Another notable aspect is that when we partition the search space across multiple processors, some sequences may be divided into several partitions. As a result, some motifs are also possible to be broken. Therefore, there must be an overlap between each two successive partitions if the divider position between these two partitions is in the middle of a sequence (see Figure 3). It is the same case for the search within each processor since some sequences may be too large to be loaded into memory to search against them. Each processor has to search one sequence chunk by one sequence chunk. To get all occurrences of the query motif in the whole search space Ω, the minimum value for the length of this overlap must be the maximum matching length of the current query (the extended regular expression). At the same time, it is necessary to remove redundancy when merging all these part results into the final results, since some same hits could be returned within these overlaps by different processors.
Figure 3. The search overlap
4 Implementation As shown in the architecture of ParSeq (see Figure 1), the local search is implemented in Java and the search is done sequentially. We implemented the remote parallel search in C++. In the parallel implementation, three crucial aspects were addressed: (1) the inter-communication among all three involved instances: the client, the proxy server and the parallel application, (2) the intra-communication within the parallel application, and (3) the parallel motif search algorithm. The communication protocol among the client, the proxy server and the parallel application is XMLRPC. In the client, we implemented a class called SearchRequest, which contains all necessary parameters information to start a parallel search. Before being sent to the proxy server, the SearchRequest object is serialized into a Java Hashtable object, which can be sent directly via XML-RPC. The proxy server acts as an XMLRPC web server, where the parallel application can register its parallel search methods. When receiving a new search request from the client, the proxy server calls its own corresponding search method with the received search request as the parameter. The method is actually a wrapper function, which calls the registered parallel search method of the parallel application, transfers the SearchRequest object to it via XML-RPC, and returns the SearchResponse object received from the parallel application to the client.
The communication implementation within the parallel application is based on our TPO++ library, a library with the capability of easily transmitting objects. When implementing the parallel search algorithm, we migrated from Java classes in the client to C++ through reusing the design of the sequential search method and the class hierarchy as close as possible. However, several classes had to be extended by adding two methods called serialize() and deserialize() to make them transmittable with TPO++. In the parallel application, the master processor communicates with the proxy server and represents the interface between Java and C++. When receiving a new search request from the proxy server, the master processor deserializes the search request into a new C++ SearchRequest object and distributes it to all other processors by calling the TPO++ function: CommWorld.bcast(SearchRequest, Rank(0) ); The parallel algorithm implemented in the parallel application follows the master/worker principle. After receiving a new search request, the master processor (master) and all other processors (workers) start to search the query motif in a parallel way. At the inside of each worker, the search is done sequentially one sequence chunk by one sequence chunk. As mentioned above, since we have the same class hierarchy in C++ as in Java, we can reuse the sequential search algorithm in the client within each worker. First, the sub partition Ωsub is computed to get the search pointers Rstart : {fstart , sstart , pstart } and Rstop : {fstop , sstop , pstop }. Next, each worker iterates on all sequence chunks between Rstart and Rstop and searches the occurrences of the query motif. The search overlap is considered by starting the search from the position resulted from subtracting the maximum matching length from the currently computed start position. When a worker finishes the searching, it returns its results to the master by calling similar method in TPO++ library mentioned above. After receiving all part results, the master then merges them and removes the redundant hits. Finally, the complete results are built into an XML-RPC data package and returned to the client via the proxy server using the same serialize/deserialize mechanism as mentioned above.
5 Performance Results Our experiments were conducted on the Kepler cluster [11], a self-made clustered supercomputer based on commodity hardware. It has two front end nodes each being a dual-processor system with two Pentium III processors at 733 MHz, sharing 2 GB of total memory. The computing nodes consist of two parts: The first diskless part with 96 nodes each running with two Pentium III processors at 650 MHz and having 1 GB of total memory, or 512 MB per processor, and the newer second part with 32 nodes each running with two AMD Athlon processors at 1.667
GHz, sharing 2 GB of total memory, and provided with a 80 GB disk. The whole system has two interconnects, a Fast Ethernet for booting the nodes, administration purposes and as storage area network (SAN), and a Myrinet network for the communication between parallel applications. Both have a multi-staged hierarchical switched topology organized as a fat tree. The Myrinet has a nominal bandwidth of 133 MB/s which is the maximum PCI transfer rate. Measurements give about 115 MB/s effective bandwidth and 7 µs latency. The nominal transfer rate of the Fast Ethernet is 12.5 MB/s and about 11 MB/s effective due to TCP/IP overhead. The parallel I/O architecture consists of the MPI-IO implementation ROMIO [12], which is set up on top of the file system using an abstract device interface for I/O (ADIO) [13]. The underlying parallel file system is PVFS [14] which is configured as follows: The 32 disks of the AMD nodes act as I/O nodes and the whole 128 nodes are PVFS clients. PVFS uses the SAN for striping the data onto these disks. Both the sequence database and the metadata database of ParSeq are located at PVFS. We tested the performance respectively through a local sequential search and remote parallel searches with 1, 2, 4, 8, 16 and 32 processors. The local sequential search ran on one of the two front end nodes and the remote parallel search ran on the computing nodes of the newer second part. In the local sequential search, the execution time is the real time for reading sequences and searching sequences. In the remote parallel searches, beside the same time in the local sequential search, the execution time additionally covers the communication and computing overhead. We evaluated the performance in the following two cases: • In case I, the query motif is CC(A|T ){6}GG, which is a consensus motif in the Serum Response Element (SRE) sequence, also named CArG box. We search the motif against a search space with large sequences but small sequences number (see Table 1). • In case II, the query motif (denoted as an extended regular expression) is M (X{2, 9})/chg(> , 0)@(X{7, 36})/hdp kd(15, >, −0.9)@(L|V |I)(A| S|T |V |I)(G|A|S)C, which represents a probable lipoprotein signal sequence from the bacterial genome. We search the motif against a search space with small sequences but large sequences number (see Table 1). Since the size of the real protein sequence file is only 2.1 MB, we increased the size of the search space by using 20 copies of this file. The hit results presented in the following sections include all nonoverlapping and overlapping hits, which are significant to biochemical constraints. More validation, thus more time, is needed to get those overlapping hits. That’s why we can not compare our results to those of other solutions.
18
Table 1. Two search spaces
Files Sequences Number Total Size
14
Case I human 01.fa, human 02.fa, ..., human 22.fa, human X.fa, human Y.faa
Case II E coli orfs 01.txt, E coli orfs 02.txt, ..., E coli orfs 20.txt b
24 × 1
5379 × 20
3121 MB
42 MB (2.1 MB × 20)
12 Speedup
Sequence
With parallel I/O (Experimental) Without parallel I/O (Theoretical)
16
10 8 6 4 2 0 0
a All
5
10
15 20 Number of processors
25
30
35
chromosomes of human genome. copies of a protein sequence file: Escherichia coli Open Reading Frames (ORFs).
Figure 5. Case I: Speedup for the remote parallel searches with 1, 2, 4, 8, 16 and 32 processors
5.1 Results of Case I
that the execution times which are plotted in Figure 4 are the average times for three tests. It is same in Figure 6 in the next section. As shown from the experimental results, we get reasonably a good speedup up to 32 processors. Specifically, for the remote parallel searches with 8, 16 and 32 processors, the speedups are 6.47, 11.23 and 16.99 respectively. And the efficiency for 32 processors is 53.08%. To facilitate the comparison, a theoretical speedup curve for the case without parallel I/O was drawn based on the Amdahl’s law in the speedup figure. We get the execution time for the sequential part through dividing the total size of the search space (3121 MB) by the data transfer rate of Fast Ethernet (11 MB/s) mentioned before. Here we use 10 MB/s because of the communication and computing overhead. Compared with the theoretical curve for the case without parallel I/O, we get a considerable speedup with parallel I/O. Moreover, the benefit from parallel I/O becomes more prominent as we increase the number of processors. For example, on 32 processors, the speedup with parallel I/O is 16.99. In contrast, the speedup without parallel I/O is only 5.37. Note that when we increase the number of the processors involved in the parallel search, the efficiency reduces. We can attribute this to the fact that as more processors are used, the communication and computing overhead may increase.
b 20
In case I, we get 178321 hits in all 24 sequences. The execution time for the local sequential search is 2279.03s, which is a little more than that for the remote parallel search with 1 processor (1913.71s). We can attribute this to several main factors. Firstly, the remote search ran on a faster computing node, while the local search ran on the front end node. Secondly, the remote search method is implemented in C++ and the local search method in Java. Thirdly, the faster processor in the remote search compensates for the loss of the performance resulted from the communication and computing overhead. Another interesting result is that when we search only against the first chromosome, the execution time is 145.51s for the local sequential search and 153.41s for the remote parallel search with 1 processor. The remote search with the faster processor is slower because the too small search space reduces the proportion of matching operations in the whole search process and the communication and computing overhead dominates the execution time. 2000
Execution time (seconds)
1800 1600 1400 1200
5.2 Results of Case II
1000 800 600 400 200 0 0
5
10
15 20 25 Number of processors
30
35
Figure 4. Case I: Execution time for the remote parallel searches with 1, 2, 4, 8, 16 and 32 processors Figure 4 and Figure 5 present the execution times and the speedup curves for the remote parallel searches. Note
Open Reading Frame (ORF) describes a part on a genome coding for one gene. ORF libraries normally are a set of thousands of short sequences. As a result, when determining if a genome contains a DNA sequence coding for a protein with some defined properties, we will search against a search space with thousands of short sequences, like case II. In this case, we get 290 hits in 228 of 5379 sequences in each sequence file of the 20 copies. The remote parallel search with 1 processor is much slower than the local sequential search (2642.21s vs. 1429.32s). Beside the factors such as different languages and different power of processors, we can contribute this
to the fact that the communication occupies heavy proportion in the whole search process. Compared with case I, the total size of all sequences in the search space is only about 2.1 × 20 MB and the total number of sequences is 5379 × 20. That means we have to pass 5379 × 20 SequenceInfo objects to the remote application via XML-RPC when searching in parallel, which is the worst case. 2800
Execution time (seconds)
2400 2000 1600 1200 800 400 0 0
5
10
15 20 25 Number of processors
30
35
Figure 6. Case II: Execution time for the remote parallel searches with 1, 2, 4, 8, 16 and 32 processors
6 5
Speedup
4
and computing overhead, not the disk I/O. That is why we did not draw the theoretical curve for the case without Parallel I/O.
6 Conclusion and Future Work Bioinformatics applications are both computing intensive and data intensive. In this article, we proposed an implementation of parallel motif search and presented the performance results on the Kepler cluster. Results show that the parallel implementation is very effective and we can obtain a significant speedup in moderately sized cluster environments when searching against large data sets. Some search spaces with large numbers of fine granular sequences such as ORF libraries lead to a tremendous communication overhead and demand a different parallelization strategy. In terms of future work, we can compare the performance of our parallel ParSeq with that of other tools solving the same problems. In addition, future work can also progress in the database directions. It would be interesting to integrate a relational database such as MySQL into ParSeq to store the metadata. Thus, the communication overhead can be decreased by avoiding transferring too much metadata to the parallel application. In addition, if possible, it is also valuable to adapt the parallel implementation to a cluster of heterogeneous workstations, or supercomputers like the Hitachi SR8000-F1 at LRZ in Munich or the Cray Opteron Cluster at HLRS in Stuttgart.
References
3
[1] M. Schmollinger, I. Fischer, C. Nerz, S. Pinkenburg, F. G¨otz, M. Kaufmann, K.-J. Lange, R. Reuter, W. Rosenstiel, and A. Zell. ParSeq: Searching motifs with structural and biochemical properties. Bioinformatics, 20(9), 2004, 1459-1461.
2 1 0 0
5
10
15 20 Number of processors
25
30
35
Figure 7. Case II: Speedup for the remote parallel searches with 1, 2, 4, 8, 16 and 32 processors The execution times and the speedup curves for all remote parallel searches are shown in Figure 6 and Figure 7. It can be seen that we get a reasonable improvement in performance. On 8 processors, we get the highest performance: the speedup is 3.95 and the efficiency is about 50%. The execution time for remote parallel search on 16 processors (689.08s) is a little more than that on 8 processors (668.75s). We can attribute it to the increase of the intracommunication overhead and the computing overhead such as merging part results as more processors are involved. We can conclude that, in case II, with more than 16 processors, the execution time for the remote parallel search will increase and the speedup curves will drop. Because of the small search space and the large number of sequences, the sequential part is dominated mainly by the communication
[2] Univerisity of T¨ubingen. ParSEQ Homepage. Online, URL: http://www-ti.informatik.unituebingen.de/parseq/index.html, 2004. [3] D. Benson, I. Karsch-Mizrachi, D.J. Lipman, J. Ostell, B.A. Rapp, and D.L. Wheeler. GenBank. Nucleic Acids Research, 28(1), 2000, 15-18. [4] W. Baker, A. van den Broek, E. Camon, P. Hingamp, P. Sterk, G. Stoesser, and M.A. Tuli. The EMBL Nucleotide Sequence Database. Nucleic Acids Research, 28(1), 2000, 19-23. [5] Y. Tateno, S. Miyazaki, M. Ota, H. Sugawara, and T. Gojobori. DNA Data Bank of Japan (DDBJ) in Collaboration with Mass Sequencing Teams. Nucleic Acids Research, 28(1), 2000, 24-26. [6] Dave Winer. XML-RPC Specification. Online, URL: http://www.xmlrpc.com/spec, Jun 1999.
[7] T. Grundmann, M. Ritt, and W. Rosenstiel. TPO++: An object-oriented message-passing library in C++. In Proceedings of the 2000 International Conference on Parallel Processing, Toronto, Canada, 2000, 4350. [8] Peter Corbett, Dror Feitelson, Yarsun Hsu, JeanPierre Prost, Marc Snir, Sam Fineberg, Bill Nitzberg, and Bernard Traversatand Parkson Wong. MPI-IO: A Parallel File I/O Interface for MPI Version 0.3. Technical Report NAS-95-002, NASA Ames Research Center, 1995. [9] Message Passing Interface Forum. MPI-2: Extensions to the Message-Passing Interface. Online. URL: http://www.mpi-forum.org/docs/mpi-20html/mpi2-report.html, July 1997. [10] S. Pinkenburg and W. Rosenstiel. Parallel I/O in an Object-Oriented Message-Passing Library. In Proceedings of the 11th European PVM/MPI Users’ Group Meeting, Budapest, Hungary, 2004. [11] University of T¨ubingen. Kepler Cluster website. Online, URL: http://kepler.sfb382-zdv.unituebingen.de/kepler/start.shtml, 2001. [12] R. Thakur, W. Gropp, and E. Lusk. Users Guide for ROMIO: A High-Performance, Portable MPI-IO Implementation. Technical Memorandum ANL/MCSTM-234, Argonne, IL, 1997, 82-105. [13] R. Thakur, W. Gropp, and E. Lusk. An AbstractDevice Interface for Implementing Portable ParallelI/O Interfaces. In Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation, Annapolis, Maryland, 1996, 180-187. [14] P. H. Carns, W. B. Ligon III, R. B. Ross, and R. Thakur. PVFS: A Parallel File System for Linux Clusters. In Proceedings of the 4th Annual Linux Showcase and Conference, Atlanta, Georgia, 2000, 317-327.