BIOINFORMATICS APPLICATIONS NOTE
Vol. 18 no. 3 2002 Pages 484–485
Assembly of fingerprint contigs: parallelized FPC S. R. Ness, W. Terpstra, M. Krzywinski, M. A. Marra and S. J. M. Jones ∗ Genome Sequence Centre, British Columbia Cancer Agency, 600 West 10th Avenue, Vancouver, BC V5Z 4E6, Canada Received on August 20, 2001; revised on October 20, 2001; accepted on October 24, 2001
ABSTRACT Summary: One of the more common uses of the program FingerPrint Contigs (FPC) is to assemble random restriction digest ‘fingerprints’ of overlapping genomic clones into contigs. To improve the rate of assembling contigs from large fingerprint databases we have adapted FPC so that it can be run in parallel on multiple processors and servers. The current version of ‘parallelized FPC’ has been used in our laboratory to assemble mammalian BAC fingerprint databases, each containing more than 300 000 BAC fingerprints. Availability: This parallelized version of FPC is available under the GNU GPL licence, and can be downloaded from ftp://ftp.bcgsc.bc.ca/pub/fpcd. Contact:
[email protected]
FPC OVERVIEW FingerPrint Contigs (FPC; Soderlund et al., 1997, 2000) is a program designed to assemble related restriction digest fingerprints of DNA clones into contiguous cloned genomic segments (contigs). FPC groups related clones into contigs by using a pair-list algorithm to compare all fingerprints within a database to each other and generate a sparse matrix of comparison scores. Of the two scoring methods offered by FPC we use most commonly the one devised by Sulston et al. (1988, referred to as the ‘Sulston score’). Above-threshold scores identify similar clones which are then grouped into contigs using a greedy algorithm. With the original single processor FPC implementation, the time required to make contigs from a large fingerprint database would take on the order of weeks on currently available processors, making it difficult to optimize build parameters by running multiple FPC builds. PARALLEL PROCESSING Of the two algorithms in FPC, the pair-list algorithm is the most computationally expensive, representing 90% or more of the computational time required by FPC builds. In the pair-list algorithm, the fingerprint pattern of each ∗ To whom correspondence should be addressed.
484
clone is compared to each other clone in the database. For a typical mammalian BAC-fingerprint mapping project, generating 15-fold redundant coverage of a genome, approximately 300 000 BAC fingerprints are required. An all-versus-all comparison of BAC fingerprints for a 300 000 fingerprint database requires 90 billion comparisons and the number of comparisons required increases as the square of the number of fingerprints. The large amount of computation required has proven a bottleneck in the construction of mammalian physical maps and has hindered the ability to experiment in building maps at different stringencies and parameters. We implemented modifications to FPC, creating FPC server daemons allowing the program to distribute the building the pair-list to multiple computers (distributed computation), and also to multiple processors on each of these machines (threaded computation). We have incorporated both of these disparate tasks into a common client/server architecture. Before starting an FPC contig assembly job, FPC server daemons are started on all available nodes within the computer network subnet. The client FPC program sends out multicast UDP broadcast packets across the subnet and the server daemons respond, initiating a TCP connection with the client. Data is apportioned to the FPC server daemons in batches of 6 rows of clones. Additional batches of clone data are sent on the successful completion. This piecemeal approach to apportioning data means that the speed of the entire process is not limited by the slowest FPC server. Error checking and recovery is carried out so that if a server daemon is terminated the data is re-sent to another node. If nodes crash or generate errors they are not sent any more data during the current build. Each node computes the pair-list score data using at least two thread processes, taking advantage of dual processor machines. The client assembles all the results from the nodes into the sparse matrix pair-list, a matrix of scores for all clones that match above a given Sulston score cutoff. After all rows of the pair-list are calculated, the client then runs the normal FPC greedy algorithm to establish the relative order of the clones within contigs on a single processor.
c Oxford University Press 2002
Parallelized FPC
Parallelized FPC Build Time (min)
CPU (n)
Fig. 1. FPC contig assembly time using increasing numbers of processors. Times shown are for a database consisting of 50 000 mouse BAC fingerprint clones. For the above tests, dual 1 GHz Pentium III computers with 1 GB of RAM were used.
The speedup observed when using the distributed version of FPC is shown in Figure 1. At the beginning of each job, all clones must be transmitted to all processors in turn. Because of this overhead and other synchronization issues, as increasing numbers of processors are utilized, the rate of speedup diminishes.
This parallelized implementation of FPC was used to successfully build contigs for both the human and mouse (McPherson et al., 2001; unpublished, available from www.bcgsc.bc.ca/projects/mouse mapping). Our studies indicate that for building physical maps for mammalian genomes (approximately 300 000 clone fingerprints), 512 MB of RAM is sufficient for both the client and server programs. This software was developed and tested under Linux using SMP Kernel versions 2.2.12 and 2.2.14, VA Linux version 6.2.3 and Mandrake version 7.2. The implementation of the FPC server daemon is based on FPC 5.0.2.
ACKNOWLEDGEMENTS This work was funded by the National Human Genome Research Institute grant no. 3U01 HG021 55-01S1. We gratefully acknowledge the financial assistance and support of all members of the BCCA Genome Sequence Centre. We would sincerely like to thank Cari Soderlund and Fred Engler for their comments, support and FPC code base. REFERENCES McPherson,J.D., Marra,M., Hillier,L., Waterston,R.H. et al. (2001) A physical map of the human genome. Nature, 409, 934–941. Soderlund,C., Humphray,S., Dunham,A. and French,L. (2000) Contigs built with fingerprints, markers, and FPC V4.7. Genome Res., 10, 1772–1787. Soderlund,C., Longden,I. and Mott,R. (1997) FPC: a system for building contigs from restriction fingerprinted clones. Comput. Appl. Biosci., 13, 523–535. Sulston,J., Mallett,F., Staden,R., Durbin,R., Horsnell,T. and Coulson,A. (1988) Software for genome mapping by fingerprinting techniques. Comput. Appl. Biosci., 4, 125–132.
485