Vol. 18 no. 3 2002 Pages 446–451
BIOINFORMATICS
D-ASSIRC: distributed program for finding
sequence similarities in genomes ´ Pierre Vincens 1, 2,∗, Anne Badel-Chagnon 2, Cecile Andre´ 3 and Serge Hazout 2 1 Departement ´
´ de Biologie (FR 36), Ecole Normale Superieure, 46 rue d’Ulm, ´ et 75230 Paris Cedex 05, France, 2 Equipe de Bioinformatique Genomique ´ Moleculaire, INSERM U436, Universite´ Paris 7, case 7113, 2 Place Jussieu, ´ ´ 75251 Paris Cedex 05, France and 3 Departement de Biomathematiques, CHU ´ ´ ` ˆ Pitie-Salpetriere, 91 boulevard de l’Hopital, 75634 Paris Cedex 13, France
Received on April 4, 2001; revised on July 26, 2001; accepted on September 19, 2001
ABSTRACT Motivation: Locating the regions of similarity in a genome requires the availability of appropriate tools such as ‘Accelerated Search for SImilar Regions in Chromosomes’ (ASSIRC; Vincens et al., Bioinformatics, 14, 715–725, 1998). The aim of this paper is to present different strategies for improving this program by distributing the operations and data to multiple processing units and to assess the efficiency of the different implementations in terms of running time as a function of the number of processing units. Results: The new version D - ASSIRC is based on three alternative strategies of task sharing: (1) a distributed search using the splitting of studied sequences into large overlapping subsequences (strategy ASS); (2) two distributed searches for repeated exact motifs of fixed size either managed by a central processor (strategy AGD) or locally managed by numerous processors (strategy ALD). The result is that the strategy ASS is suitable for a large number of processing units (the time was divided by a factor of 12 when the number of processing units was increased from 1 to 16) wheras the strategy ALD is better for a small set of processors (typically for four or six).The different proposed strategies are efficient for various applications in genomic research, particularly for locating similarities of nucleic sequences in large genomes. Availability: D - ASSIRC is freely available by anonymous FTP at ftp://ftp.ens.fr/pub/molbio/dassirc.tar.gz. Sources and binaries for Solaris and Linux are included in the distribution. Contact:
[email protected]
INTRODUCTION Understanding genome organization and evolution (Achaz et al., 2000; Seoighe and Wolfe, 1998) will be a major †
To whom correspondence should be addressed.
446
scientific challenge in the coming years. One investigative approach consists of locating the similar regions of the genome and analyzing the relationship between them. The classical applications for finding homologous genes are FASTA (Pearson and Lipman, 1988) and BLAST (Altschul et al., 1990, 1997). The primary goal of these software tools is to find target sequences similar to a single specified source. The search for similarities in a complete genome requires it to be split into overlapping fragments, then each fragment is successively used as source for the analysis. We have previously proposed another efficient approach (Vincens et al., 1998). The software ‘Accelerated Search for SImilar Regions in Chromosomes’ (ASSIRC) implements a heuristic that allows one to optimize the search for similar regions between two large sequences (i.e. greater than 100 kb) or within one sequence. A comparison of ASSIRC (Vincens et al., 1998) with conventional methods (Altschul et al., 1990, 1997; Pearson and Lipman, 1988) has been carried out. It was shown that ASSIRC found the same set of similar regions reported by BLAST and a large proportion of the pairs discovered by FASTA. Furthermore, many extra regions were detected. With respect to the speed of the runs, ASSIRC is faster than BLAST and FASTA. However, the time required to find the similarities in a genome may be long, especially when searching for less-similar regions. To avoid this drawback, one possibility is to distribute the work to several processors. Several hardware options are available to carry this out: parallel computers with shared or distributed memory, or a cluster of workstations connected through a network. Guided by the hardware generally available in laboratories, we have considered the last solution and developed a new distributed version of the ASSIRC software. This approach has been successfully applied to accelerate c Oxford University Press 2002
Redundancy in complete genome sequences
the search for target sequences in databanks (Penotti, 1994). Tools such as BioSCAN (Singh et al., 1996), dtask (ftp://ftp.ifi.uio.no/pub/molbio/dtask11.tar.gz), PARRAL LEL BLAST (J¨ ulich, 1995) reduce significantly the waiting time required to obtain the results. Consequently, an extraction of regions of lower similarity can easily be carried out. The aim of this paper is to present different strategies for distributing the operations and data to the different processing units and to assess the efficiency of the implementation in terms of running time in the case of a search for pairs of similar regions. The different configurations were tested using the genome of Saccharomyces cerevisiae (Goffeau, 1998).
SYSTEM AND METHODS The ASSIRC strategy The strategy of ASSIRC is detailed in a previous paper (Vincens et al., 1998). We recall here the main principles. The first step consists of locating exact common patterns of fixed size k, called ‘seeds’ using the algorithm proposed by Rabin-Karp (Sedgewick, 1989). In the second step, each ‘seed,’ is extended in order to define the bounds of the putative region of similarity. When its length exceeds a user-fixed threshold, it is recorded within an internal database. To reduce the CPU time, the seeds included in pairs previously registered in the database are not checked again. To perform the extension, a ‘random walk’ is computed for each sequence encompassing the common pattern. A random walk consists of translating the nucleic sequence into a bidimensional graph (Gates, 1985; Leong and Morgenthaler, 1995). The bounds of the putative region are determined by thresholding a function characterizing the proximity of two trajectories, each displaying one sequence. After the extension step, the regions of similarity are checked by aligning sequences. The final results are summarized by all the fragment pairs whose aligned length is greater than a user-defined L min value and whose the percent identity is greater than a given Pmin -value. The main principles of task sharing Several approaches to sharing the computation of ASSIRC can be envisaged. (i) The sequences can be split into sections of length l (a user-defined parameter), each one treated in one processing unit. The drawback lies in the need to introduce overlaps between the sections of the sequence and to manage the results obtained from each computer. (ii) The ASSIRC algorithm is based on an exhaustive search for the exact common words (i.e. seeds) of
fixed size k. One method of dividing these searches between several processors would be to break the seed of size k into k1 and k2. In this process, k1 is used as a prefix size to divide the seed space and k2 as a suffix size to fill the data structure according to a fixed k1-value. The advantage of this decomposition is the decreased memory requirement. Of course, it is necessary to cover the sequences Ak1 times (A: number of different symbols coding the sequence, i.e. A equals 4 for nucleotidic sequences). This decomposition can be applied to allocate the work to many processors, each receiving the task of computing the seeds corresponding to a sublist of prefixes, and computing the extension phase for each putative pair of regions of similarity. (iii) Final validation operations can be easily dispatched to different processors because the alignment of each pair of putative regions of similarity can be performed independently. Another point concerns the database management. A first solution consists of choosing a central database management, in which a single processor collects all the pairs of regions of similarity and answers all the requests. A second solution relies on the use of a duplicated database. In this model, the central database collects all the pairs and dispatches them to the local databases running on each processing unit. Each local database looks after the answers to the requests generated by the local task.
Description of the different models of distributed processing Figure 1 shows the models of distributed processing. Different tasks are assigned to certain processing units: (i) the manager task (MNGR) is unique and considers the requirements of the user (parameter checks, loading of sequences to be analyzed, printing of the results); (ii) the dispatcher task (DPCH) receives the orders from the previous one and distributes the work to the builder tasks; (iii) the builder task (BLDx) assumes the computing work, one task is assigned to each available processing unit; (iv) optionally, associated with each BLDx, a local database task (LDBx) maintains a database of the pairs of similar regions in order to respond to the local requests; (v) the global database service (GDBS) collects all the pairs of regions of similarity found and surveys the uniqueness of each pair; (vi) finally, a monitor (MONI) allows the user to follow the advance of the computing. On the basis of the properties described above, and according to this general model, three different strategies of task sharing were defined: • The first one, called ‘Assirc using Sequence Splitting’ (ASS) (Figure 1a), implements a splitting of the sequences into large sections, each BLDx searching for 447
P.Vincens et al.
the regions of similarity within pairs of sections. Short overlaps (e.g. some hundreds of bases) are introduced to take into account putative pairs located at the boundaries of sections. The test for determining whether a seed is included in pairs previously registered in the database is performed locally by the LDBx. New pairs are registered both in local and global databases. The local database is cleaned after each treatment of a section pair, leading to a reduction of both memory usage and computing cost of checking the presence of a seed. Computing redundancy only exists for the pairs found in the overlaps of sections. Given the small size of overlaps relative to that of the sections, the excess of computing time is slight. • The second approach takes advantage of the decomposition of the seeds of size k, each BLDx computing the seeds according to assigned values of k1 and expanding them. In this case, checking whether a seed is included among previously registered pairs can be performed by referring either to a central database or to a local database. In the first case called ‘Assirc using Seed Decomposition and Global Database’ (AGD) (Figure 1b), the BLDx collects a set of seeds and sends them to the GDBS. This performs the necessary checks and returns to the BLDx the seeds to be expanded. The first drawback of this method is the introduction of a waiting time in BLDx during the checks by GDBS . However, these operations are independent and can be performed separately. Besides, no constraints require that the seeds found by a specific BLDx be expanded by the same BLDx. Hence, the BLDx can be decomposed into two types of independent threads: one (S - PRDx) producing seeds and sending them to GDBS, the other (S - EXPx) expanding seeds received from GDBS. The second drawback of the method is the huge amount of data which must be exchanged between BLDx and GDBS , introducing an overload of the network and a potential traffic jam of the GDBS . An alternative is possible: the check of the pairs can be performed locally using the LDBx, the central database then records all the pairs and manages the local copies distributed to each BLDx. Such an approach is labelled ‘Assirc using Seed Decomposition and Local Database’ (ALD) (Figure 1c). For the three strategies, the alignment of each pair of similar regions is dispatched to available processing units. The validated pairs are collected by the MNGR, which produces a file summarizing the results.
Assessment of the strategies The assessment of the different strategies can be computed using the efficiency coefficient (J¨ulich, 1995), which is a measure of the hardware consumption. It is defined 448
Data MNGR
BLD1
LDB1
BLD2
LDB2
DPCH
Results ... MONI
GDBS
BLDn
LDBn
a : ASS Strategy Data MNGR
DPCH
S-PRD1 S-EXP1
BLD1
S-PRD2 S-EXP2
BLD2
Results ... MONI
GDBS
S-PRDn S-EXPn
BLDn
b : AGD Strategy
Data MNGR
BLD1
LDB1
BLD2
LDB2
DPCH
Results ... MONI
GDBS
BLDn
LDBn
c : ALD Strategy
Fig. 1. Description of the distributed strategies. Figures (a)–(c) describe the data flows in the ASS, AGD and ALD strategies. MNGR: main process ensuring the interpretation of program options, and the reading and writing of the data. DPCH: process associated with MNGR and distributing the jobs. MONI : process allowing the user to follow the progress of the computation. GDBS: process managing the central database which records the putative pairs of regions of similarity. BLDx: computing process generating the seeds, extending them and performing their alignment. LDBx: process managing the local database. For the ASS and ALD strategies, it is associated with BLD x and runs on the same processing unit. For the AGD strategy, the process is dispatched to two independent threads, S - PRDx and S - EXP x, performing the generation and the expansion of seeds, respectively.
as the ratio of the theoretical minimum execution time achievable with a given number p of processors and the actual elapsed time te . The theoretical time is defined by dividing the time ts required using one processor by the number of parallel processors used, p. Therefore, the efficiency coefficient is ts /( p.te ). The higher the efficiency coefficient, the better the distribution of the program.
Redundancy in complete genome sequences
IMPLEMENTATION Software was written in the Ada 95 programming language. It was compiled using the Gnat compiler with the Glade implementation of the annex ‘Distributed system’ of the Ada language (http://www.adahome.com/rm95). In the Glade extension, sockets are used to support the communication between processors. This choice is largely motivated by the fact that the supplementary code to be written includes only Ada instructions. Thus, a large part of the program is common with the ASSIRC program. This feature facilitates future software evolution. Alternatively, other approaches using grid systems such as PVM (http: GLOBUS //www.epm.ornl.gov/pvm/pvm home.html), (http://www.globus.org), NQS (http://www.gnqs.org) are possible, but, these tools require more code modifications and impose on the end user the installation of large packages on each workstation. Concerning this last point, D - ASSIRC only requires installation of one executable file on each computer. D - ASSIRC was developed on Linux and Solaris operating systems. It would run on other operating systems with slight transformations. The assessment was performed on a heterogeneous cluster of processors composed of PC DELL 210—Pentium II 400 MHz workstations, Sun Ultra 5 Sparc Ultra II 333 MHz workstations and a Sun E450 computer connected by a local area network (100 base TX). RESULTS AND DISCUSSION Assessment of the strategies for a small dataset In order to assess the strategies, the programs were run on a subset of S. cerevisiae genome (chromosomes I, VI and IX corresponding to 940 kb in total) using identical parameters (k = 12, L min = 150, Pmin = 60%). These sequences are available at the Martinsried Institute for Protein Sequences (MIPS, http://www.mips.biochem. mpg.de; Mewes et al., 1997). The work was carried out on PC-DELL 210, 400 MHz computers. The parameters used for splitting the work were: the length of the section l = 200 kb for the ASS strategy and the size of the prefix k1 = 2 for the AGD and ALD strategies. Under these conditions, 190 pairs of similar regions were identified. The non-distributed version (ASSIRC) required 328 s to complete the search. Figure 2 shows the performances obtained by the three versions of distributed strategies (D - ASSIRC) with an increasing number of identical processing units. As a first remark, the times required by the distributed AGD and ALD strategies using one processing unit are close to those of the nondistributed method. When two or more processing units are used, the benefit of these distributed strategies is significant. Hence, the time needed with six processing units is decreased by a factor of 4, the ALD method
implementing local copies of the database performing slightly better than AGD. The efficiency coefficient is higher in AGD strategy than in ALD strategy for a small number of processing units. In contrast, the converse result is obtained when using a greater number of processing units. For the AGD strategy, the efficiency coefficient decreases linearly with increasing number of processing units. For more than five units, the efficiency decreases drastically. A complementary experiment confirmed this finding by adding new processing units (e.g. an efficiency of 0.52 for nine processing units). In fact, this approach induces an overload of the communication network. Moreover, for the AGD strategy, the decrease of efficiency is probably also related to the overload of the process managing the central database. The ALD strategy exhibits another drawback. The number of independent tasks depends on the size of the seed prefix. For instance, a prefix value of 2 only allows 16 (i.e. 24 ) different patterns. When all these patterns are dispatched, the other processing units must wait. The independence between the seed generation task and the extension phase task in AGD strategy avoids this problem. The ASS strategy is hugely time consuming when only one processor is used: the required time is about a factor of 10 higher than the other strategies. The explanation for this time increase is the successive analysis of all the pairs of sections requiring the systematic study of all seeds of length k. However, the efficiency coefficient for this strategy is always greater than 0.90. These results suggest that the ASS strategy will be suitable for a large number of processing units.
Assessment of the strategies for a medium dataset We have searched for the similarities between the five first chromosomes (from I to V representing a total of 3469 kb) of S. cerevisiae using 1–16 heterogeneous processing units (8 Sparc/Solaris and 8 PC Pentium/Linux). The different types of processing units were introduced alternately to facilitate comparison of the results. Evidently, the hardware configurations used for assessing the three strategies with a given number of processing units were identical. The results are displayed in Figure 3. Concerning the running time (Figure 3a), we observe different behaviours according to the strategies used. The ASS strategy is the most efficient when using a large number of processing units. We observe that the time was divided by a factor of 12 when the number of processing units was increased from 1 to 16. The AGD strategy appears efficient for an intermediate number of processing units (for four or six). However, the minimal waiting time does not continue to fall with the addition of further processing units but reaches a plateau. This is because the time consumed by the central database process is around 8000 s for this strategy, whatever the number of processing units. Hence the waiting time is a fortiori more than this value. For the ALD strat449
+
+ x o
+ o
95
2000
(a)
+
x
+ o
+
x
x o
(b)
+
90
+
+
x o
x
200
80
o x
+
85
+
% Efficiency
1000
+
500
Log(Time)
100
P.Vincens et al.
o x 75
x o
1
2
3
4
o x
o x
5
6
o
70
100
o x
1
2
3
Number of units
4
5
6
Number of units
+
(b)
+
+
x
+
+
x
x
o + x
o
o x
+ x
+
x
60 40
o +
o
o
+
x
o
x o
x +
x
o
20
+ o
o
+ 5
10
15
Number of units
+
+
o % Efficiency
10000 5000
o x
+ x + o x
80
(a)
x o Log(Time)
100
+
20000
35000
Fig. 2. Assessment of computing time and efficiency of the three strategies for different numbers of processing units for a small dataset. Locating the pairs of regions of similarity was performed for chromosomes I, VI and IX of S. cerevisiae (940 kb). All the processing units were identical (PC Pentium 400 MHz/Linux). For the ASS strategy (+), the sequences are split into fragments of 100 kb. For the AGD (o) and ALD (x) strategies, the parameters are: k = 12 (size of the seed), k1 = 2 (size the prefix), L min = 150 (minimal length) and Pmin = 60% (minimal proportion of identity). (a) Plot of the waiting time for the user in seconds as a function of the number of processing units. (b) Plot of the efficiency as a function of the number of processing units.
5
10
x o
x
x
o
o 15
Number of units
Fig. 3. Assessment of computing time and efficiency of the three strategies for different numbers of processing units for a medium dataset. Locating the pairs of regions of similarity was performed for chromosomes I–V of S. cerevisiae (3469 kb). The processing units were PC Pentium/Linux and Sparc/Solaris. The hardware configurations for a given number of processing units were identical. The parameters are identical to these of Figure 2. (a) Plot of the waiting time for the user in seconds as a function of the number of processing units. (b) Plot of the efficiency as a function of the number of processing units.
egy, the minimal time is around 4000 s and is reached for six and more units. For 12 units, we observe a surprising value (5957 s). In fact, the splitting of the seed into 16 prefixes (k1 = 2) introduces an imbalance in the assignment of the work to each unit. Concerning the efficiency coefficient (Figure 3b), we observe large differences of evolution according to the 450
number of processing units. The ASS strategy keeps a high level of efficiency (around 0.8) as the number of units increases. This strategy successfully exploits the independence of the tasks executed by the different builder units. The efficiency of the AGD strategy decreases drastically with a large number of units (0.1 for 16 units); this was explained in the previous remarks concerning
Redundancy in complete genome sequences
the use of the central database and the overload of the network in this strategy. To corroborate this fact, we assessed the maximal speed of exchange of pairs to be checked between the central database and all the BLDx. We observe that a limit is reached for six process units. Morever, when the number of processing units is greater than six, the percentage use of each processor running database or BLDx is decreased (data not shown). For the ALD strategy, the efficiency coefficient is around 0.2 for more than 12 units. This can be explained by the fact that in this strategy we maintain both a global database and a local database for each builder unit, increasing the data exchange between units.
Assessment of the strategies for a large dataset To complete this study, we applied our strategies to determine the similarities existing in the whole genome (nuclear and mitochondrial) of S. cerevisiae. The hardware configuration was composed of four processing units (2 Sparcs and 2 PCs). The parameters were set to 12 for k, 55% for Pmin and 150 for L min . The waiting time was 44 530 s (12 h 22 min 10 s) for locating 25 610 pairs of similar regions for the ASS strategy. The CPU times consumed by each processor were similar. For the ALD strategy with k1 set to 2, the waiting time was shorter, 29 060 s (8 h 4 min 20 s) confirming its efficiency when a low number of processing units is available. In this case, we observed an unbalanced use of the processors (range 10 460–28 958 s). This occurred despite the number of prefixes being a multiple of the number of processing units. The explanation for this imbalance is that the number of generated seeds depends on the prefix, leading to an unbalanced sharing of the tasks. The solution to this problem is therefore to increase the granularity of the program. When we performed the search with a k1 -value of 3, the waiting time was decreased to 23 847 s (6 h 37 min 27 s). Finally, we ran the AGD strategy and obtained a waiting time of 56 140 s (15 h 35 min 40 s), confirming the lower efficiency observed for it in the previous analysis. CONCLUSION The ASS strategy is the most suitable for a large number of processing units. The ALD strategy is better for a small number of processing units. The k1 parameter specifying the splitting of the work must be set according to the size of the sequences to be analyzed. The AGD strategy might prove useful for specific hardware configurations with high communication rates, a high speed machine for the central database processing and many slow machines for running the BLDx. In conclusion, the approach makes possible the location of pairs of similar regions in very large sequences, without
any restrictions on those sequences. The availability of a distributed version supporting different hardware configurations reduces the time required for data collection for a given set of parameters. The greater computing power of the distributed approach allows enquiries about lesser similarities using smaller seed sizes or selecting small pairs of regions. Finally, this approach completes the identification of pairs of similar regions and offers a new way for analyzing complete genomes.
ACKNOWLEDGEMENTS This work was supported by the ‘Minist`ere de la Recherche et de la Technologie’ within the scope of the ‘G´enopole Ile de France.’ The authors wish to thank Dr Boris Barbour for constructive reading of the manuscript. REFERENCES Achaz,G., Coissac,E., Viari,A. and Netter,P. (2000) Analysis of intrachromosomal duplications in yeast Saccharomyces cerevisiae: a possible model for their origin. Mol. Biol. Evol., 17, 1268– 1275. Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403– 410. Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. Gates,M.A. (1985) Simpler DNA sequence representations. Nature, 316, 219–219. Goffeau,A. (1998) The yeast genome. Pathol. Biol. (Paris), 46, 96– 97. J¨ulich,A. (1995) Implementations of BLAST for parallel computers. Comput. Appl. Biosci., 11, 3–6. Leong,P.M. and Morgenthaler,S. (1995) Random walk and gap plots of DNA sequences. Comput. Appl. Biosci., 11, 503–507. Mewes,H.W., Albermann,K., Heumann,K., Liebl,S. and Pfeiffer,F. (1997) MIPS: a database for protein sequences, homology data and yeast genome information. Nucleic Acids Res., 25, 28–30. Pearson,W.M. and Lipman,D.J. (1988) Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA, 85, 2444– 2448. Penotti,F.E. (1994) A distributed system for DNA/protein database similarity searches. Comput. Appl. Biosci., 10, 277–280. Sedgewick,R. (1989) Algorithms. Addison-Wesley, Reading, MA. Seoighe,C. and Wolfe,K.H. (1998) Extent of genomic rearrangement after genome duplication in yeast. Proc. Natl Acad. Sci. USA, 95, 4447–4452. Singh,R.K., Hoffman,D.L., Tell,S.G. and White,C.T. (1996) BioSCAN: a network sharable computational resource for searching biosequence databases. Comput. Appl. Biosci., 12, 191–196. Vincens,P., Buffat,L., Andr´e,C., Chevrolat,J.P., Boisvieux,J.F. and Hazout,S. (1998) A strategy for finding regions of similarity in complete genome sequences. Bioinformatics, 14, 715–725.
451