Proc. Natl. Acad. Sci. USA Vol. 91, pp. 11094-11098, November 1994 Genetics
A fast random cost algorithm for physical mapping YUHONG WANG, ROLF A. PRADE, JAMES GRIFFITH, W. E. TIMBERLAKE, AND JONATHAN
ARNOLDf
Department of Genetics, University of Georgia, Athens, GA 30602
Communicated by Norman H. Giles, June 30, 1994
Ordering clones from a genomic library into ABSTRACT physical maps of whole chromosomes presents a central computational/statIcal problem in genetics. Here we present a physical mapping algorithm for creating ordered genomic libraries or config maps by using a random cost approach [Berg, A. (1993) Nature (London) 361, 708-7101. This random cost algorithm is 5-10 times faster than existing physical mapping algorithms and has optimization performance comparable to existing procedures. The speedup in the algorithm makes practical the widespread use of bootstrap resampling to assess the statistical reliability of links in the physical map as well as the use of more elaborate physical mapping criteria to improve map quality. The random cost algorithm is illustrated by its application in assembling a physical map of chromosome IV from the fliamentous fungus Aspergifus nidulans.
Since nearly the beginning of genetics a central problem has been creating maps of whole chromosomes. These maps are central to understanding the structure of genes, their function, and their evolution. The advent of molecular genetics has led to a wealth of DNA markers along a chromosome, making feasible the mapping of entire genomes, like that of humans (1). These chromosomal maps fall into two broad categories, genetic maps and physical maps. The large number of DNA markers and the ease with which molecular markers are assayed have shifted the problem from one of collecting the mapping data to the computational and statistical problem of making the maps. In 1987 Lander and Green (2) provided an efficient computational/statistical tool to assemble genetic maps with many markers. This breakthrough presented new conceptual and empirical opportunities to hunt down genes of interest. Here we present an efficient computational/statistical tool to assemble physical maps. While genetic maps narrow the search for genes to a particular chromosomal region, ultimately a physical map allows the recovery and molecular manipulation of genes. A physical map is defined here as a partial ordering of distinguishable DNA fragments by their position along the chromosome. Examples of chromosomal physical maps include cytological maps, radiation-hybrid maps, ordered clonal libraries ("contig maps"), and ultimately a chromosome's entire DNA sequence. Here we present an efficient physical mapping algorithm using a random cost approach (3) and illustrate its application in assembling a physical map of chromosome IV from the filamentous fungus Aspergillus nidulans.
ing to a particular random sampling scheme described in Materials and Methods and hybridized to all clones in the library. The result is a binary clone/probe hybridization matrix (with "1" indicating hybridization and "." indicating no hybridization) of n (= 593) clones (rows) and m (= 115) probes (columns). Each row ofthe clone/probe hybridization matrix is the digital "call number" of a particular clone. When clones overlap, they tend to hybridize to the same probes, and their digital call numbers tend to be similar. Probes (columns in Fig. 1) link clones into contiguous blocks or contigs by their shared pattern of clonal hybridization. As a consequence, the similarity between call numbers of clones can be used to reconstruct their true ordering along the chromosome in much the same way as books are organized by call number on shelves in a library. Clones (rows) are permuted into their inferred order along the chromosome by the algorithm described here as in Fig. 1 to visualize the physical map (6). Contig boundaries are demarcated by horizontal lines. With the relative ordering of clones unknown within the library, we faced the computational and statistical problem of reconstructing their true order along the chromosomephysical map. This ordering problem is equivalent to the traveling salesman problem (7). Clones are assigned identification numbers i (orj) = 1,. . , n, which may indicate, for example, a grid location on a particular Petri plate in a clonal library. The clone in position I in the order has identification il. A Hamming distance d(ci, c;) is defined between clones c, and cj, which counts the number of differences between clones in their m-long digital call numbers or fingerprints. An ordering of the clones was then selected to minimize the pairwise total linking distance between them along the chromosome: n-i
D=
E d(ci,, ci,,,),
1=1
[1]
where il,. . . , i, is the inferred ordering of n (= 593) clones. If there were faster methods to minimize physical mapping criteria like Eq. 1, this increased speed could be used either to make physical mapping tools based on Eq. 1 more widely available (i.e., on IBM PCs and Macintoshes) or make practical optimizing more complicated physical mapping criteria to improve map quality. The intuitive rationale behind Eq. 1 is that clones with similar digital call numbers tend to overlap and should be placed next to each other on the map. It is natural to extend this principle to groups of neighboring clones-i.e., to those that are several steps away along the map as well. As neighbors become more separated by intervening clones on the map, the requirement of similarity in digital fingerprints should be relaxed to reflect the expectation of declining overlap with distance. A physical mapping criterion that captures these two ideas is the triplewise total linking distance,
Physical Mapping Problem We illustrate the application of a random cost algorithm for physical mapping on data generated by STS (sequence tagged site) content mapping (4, 5), in which individual clones in a chromosome-specific library are selected as probes accordThe publication costs of this article were defrayed in part by page charge payment. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. §1734 solely to indicate this fact.
tTo whom reprint requests should be addressed.
11094
Genetics: Wang et al., n-3
D=
Proc. Nati. Acad. Sci. USA 91 (1994) n-s
[1 n-1 d(C(I, CC+3) 3 + d(c5 i,+) + 2:2d
d(Cill Cil+5) 5
.
[2]
this physical mapping criterion will require much computing because it forces the evaluation of a more complicated criterion at each minimization step.
Minimizing
more
Materials and Methods
Nucleic Acid Manipulations and DNADNA Hybridization. Recombinant plasmids and cosmids were maintained in and isolated from XL1-Blue cells (Stratagene). Manipulation, duplication, and stamping of genomic libraries, radiolabeling of DNA probes, and related reactions were performed as described in refs. 8 and 9 or as recommended by the suppliers of reagents. DNADNA hybridization conditions were as follows: (i) hybridization at665C in 0.5 M NaCl/0.1 M sodium phosphate buffer/6 mM EDTA/1% SDS at pH 7.0 in the presence of denatured salmon sperm DNA at 100 pg/ml and >1 x 106 cpm of [32P]dCTP-labeled probe; (ii) blots were washed twice with 150 ml of 2x SSC/0.1% SDS and twice with 150 ml of 0.5 x SSC at 650C before autoradiography (l x SSC = 150 ,.M NaCl/15 mM sodium citrate, pH 7.0). Libraries. Two independent libraries (8), constructed in the A-based cosmid vector pLORIST2 and pBR-based cosmid vector pWE15, were used in this study. Libraries were made chromosome-specific by hybridizing contour-clamped homogeneous electric field (CHEF)-gel-isolated chromosomal DNA with the cosmid clone collection (8). Partitioning of the primary library with 5134 cosmids into chromosome-specific subcollections was described elsewhere (8). Briefly, clones were grouped according to hybridization patterns with whole chromosomes into S, R, and 0 classes according to hybridization to one, more than one (but not all eight), and all eight chromosomes, respectively. Probe Selection Strategy and Map Reconstruction. Initially the entire chromosome IV S collection was sampled without replacement (10, 11) in batches of 4-6 clones. In the end 70 probes were needed so that every member of the chromosome IV subcollection hybridized to at least one probe. Clones and probes were ordered by minimizing Eq. 1 with the random cost algorithm described below, and the matrix in Fig. 1 was further extended by selecting probes at the ends of contigs (chromosome walking), bringing the total number of probes to 115. Probes were produced by incorporating [32P]-dCTP with T4 DNA polymerase (Klenow fragment) using the following: (i) random primers and agarose gel purified Not I inserts as templates or (ii) primer extension from the pair of T3 (or SP6) and 17 promoters and pWE15 (or pLORIST2) cosmids as templates in primer extension reactions. Computation. C programs available from
[email protected] implementing simulated annealing and random cost algorithms were compiled with the Digital Equipment Corp. (DEC) V3.1 C compiler and executed on a DEC local area VAXcluster running VAX/VMS 5.5-2. Random
Cost Algorithm
Under the random cost approach random reversals in segments ofthe current ordering of clones may be used to search for a minimum in the total linking distance D in Eq. 1 or 2. A random reversal in the current ordering of the clones, ci,, ... , car is obtained by first selecting two distinct clones at random to define endpoints of the segment, cj+1,... , i the current ordering and second performing a reversal in ordering of the clones on the segment to
Ci1+k9
*
..
,
ci,+1.
11095
To begin, a random permutation il,. . . , in is selected to initialize the algorithm. Then 100 random reversals of the initial random clonal ordering, ci,, . .. , c, , are generated, and the 100 changes in total linking distance AD relative to the initial order are recorded. The recorded maximum AD (max AD) is used to define a uniform random variable U on the interval [-max AD, max AD]. A uniform random number is generated on this interval. The minimization process is begun by performing a random reversal on the current ordering and ascertaining whether or not the following random cost inequality is satisfied:
AD + U 0. -
A random reversal satisfying this inequality is used to update the current ordering; otherwise the current ordering is left unchanged. The cycle is completed by reducing the range of the random number generator U by 8= 10-6 (until the range can no longer be positive). The cycle is repeated, and a new value of the uniform random variable U is generated with each cycle. After 100,000 cycles the max D and min D are recorded, and if the max D = min D in four consecutive blocks of 100,000 cycles, the algorithm terminates.
Results and Discussion
Advantages of the Algorithm. We compare the ability of the random cost algorithm and simulated annealing (7) to minimize the pairwise total linking distance on 20 simulated data sets and 1 real binary hybridization matrix for chromosome IV in Table 1. The annealing schedule was slightly modified from the tuned geometric annealing schedule in ref. 7 to one in which inverse temperature (1T = 0.0, 0.2, 0.4, . . .) is used, to enhance further the performance of simulated annealing. The stopping rule under simulated annealing (7) was Table 1. Comparison of the random cost algorithm with simulated annealing Simulated Random cost annealing Random Time,* Time,* Genome seed sec D sec D 1 1103527590 332 485 38 481 2 377401575 264 537 47 510 3 662824084 239 473 63 490 4 1147902781 255 544 49 503 5 2035015474 2% 468 42 499 6 368800899 247 531 42 500 7 1508029952 310 466 56 495 8 486256185 253 531 42 504 9 1062517886 301 466 40 496 10 267834847 316 533 57 498 11 180171308 262 469 45 475 12 836760821 261 530 52 500 13 595337866 310 464 40 485 14 790425851 316 533 40 506 15 2111915288 310 466 49 483 16 1149758321 254 533 44 493 17 1644289366 308 471 42 493 18 1388290519 347 518 58 500 19 1647418052 253 468 52 475 20 1675546029 301 544 45 480 Mean SE A. nidulans
287 7
502 7
47 2
493 2
595337866 275 520 57 520 Time is measured in central processor unit seconds on a VAXstation 4000, model 90.
11096
Proc. Natl. Acad. Sci. USA 91 (1994)
Genetics: Wang et al.
iM
SW3 1DO9 OL13B02
OLOD03 OL31301
OL31302 OW12k11 OW27C01
OL17003 OL07O5 OL20A2 OL286O
OL13C05 Ot114A04 OWIIC07
0OW6003 OW19)02 RLI31503
I + 2 +
19 + 72 9 +
19 * 76 72 . 100
0
53 +
49
990
6 *
+
+ 1 * 1 + 1 + 0
1 + 2 + 1 3
0
0 1 1 1
+ +
* *
-
49
95
20 +
96 ....................
46 +
46 + 53 50 . 79 *
83 86
-
5W23004
0 *
2 20
Aw2sros
1 +
6
RW22001
..................................1.....................111±11. ..........
11
...............................................................................
I1..............................................1..... . 9O . ........................................ 95
1.11............................
. .....................
.
67 34 + 40 13 -
19
111113-.
+
53 , 50 , 79 3
..11.......
...................
+
5010101
IU08Z03
. .........
. .....................
59 26
14U 23 12 15 12 -
129D011
.. ....................
20
1 1 + 0 + 0 + 0 0 +
3L31D09
.111........ I ........1.. .1.....
-
.
..................
..
100
61
13 +
53
0 + 19 +
802
27 +
..........................1...............1..
63
82
27
-
82
a7
-
46
..........111
......................................... ........................................
.....11
.............................................................................
...............................
. .... ........ .....
FIG. 1. Physical map of chromosome IV in A. nidulans. The following information is available on the electronic version of this map, but space limitations do not permit complete presentation here. The lower part of the figure shows a magnified section of the complete map. Clone names are given in the margin ofthe map in their inferred order along the chromosome down the rows. Probes are assigned coumns. Contigs are numbered 1-20 on the right margn of the map. The number of differences in the digital call numbers between cosmid cal and the next cosmid
Cg,1l
on
the
physical map-i.e.,
the
pairwise
ming distance dcg,
cj,,)-4s also given in the
nextto the clone name;
their sum it the total linking distance D in Eq. 1. Next to the column of lnking distances are the columns of confidence statistics C1, C2, and C3 in that order for each link on the map generated by 100 bootstrap resamplings.
Genetics:
Wang et al.
also modified to be identical to the random cost algorithm for comparative purposes. A simulation scheme was chosen to mirror the mapping of chromosome IV in A. nidulans. The 20 artificial binary matrices were generated by sampling without replacement (10, 11) from a 2900-kb chromosome (8) so that no current probe hybridized to a previous probe selected until all n (= 593) clones (with 40-kb inserts) were hit at least once. At this point the remaining probes were selected at random to bring the probe number m up to 115. False positives and false negatives were introduced into the resulting 20 simulated binary matrices at a frequency of 0.2% each. In general the random cost algorithm is 5-10 times faster than simulated annealing, with the mean from Table 1 being 6 times faster. While on average the random cost algorithm outperforms simulated annealing in Table 1, the means of final pairwise total linking distances are not significantly different by a pairwise t test at the 0.05 level. Performance of the two algorithms was also compared on the real data by computing the final pairwise total linking distances resulting from the 20 seeds in Table 1. Again the means of the final pairwise total linking distances are not significantly different by a pairwise t test at the 0.05 level. To generate the mapping data a chromosome-specific cosmid library (8) was sampled without replacement, and each probe selected was hybridized to the whole library, producing an unordered binary matrix. The random cost algorithm was used to construct a contig map of chromosome IV in A. nidulans (an unlabeled version is shown in Fig. 1; the full version is available electronically from the authors at arnoldebscr.uqa.edu). The resulting map has 20 contigs with a mean contig size of 145 kb (= 2900 kb/20 contigs) and markers every 13 kb [= 2900/(2 x 593)], as with the Schizosaccharomyces pombe contig map (11). There are two immediate consequences of accelerating map reconstruction by a factor of 5-10. First, it now becomes computationally practical to assess the statistical reliability of each link between clones in the physical map by bootstrap resampling (12) on PCs and Macintoshes. Second, it becomes computationally practical to optimize more elaborate physical mapping criteria such as Eq. 2 with the random cost algorithm to improve map quality. Map Reliability. To assess map reliability, first a physical map is assembled as in Fig. 1 from the original binary clone/probe hybridization matrix X by minimizing the pairwise total linking distance with the random cost algorithm. Then probes (columns of the binary clone/probe hybridization matrix) used to construct the map are sampled randomly with replacement (in the computer) to generate a new hybridization matrix X*, in which some probes (columns) are selected multiple times, and some, not at all. Distances between clones are calculated only on the basis of the selected probes, and a new map is generated by the random cost algorithm. We then score whether or not each linkage in the original map reappears. Hybridization matrices X* are repeatedly generated (i.e., 100 times) by this resampling process, and the resulting physical maps are compared with the original physical map. The result is analogous to having repeated the whole physical mapping experiment many times and then assessing how often each link in the physical map recurs. Three simple measures of the confidence of each link in the physical map are calculated. The first confidence statistic C1 is the percentage of time two neighboring clones on the original map reappear as neighbors under resampling. It would also be useful to have a supplementary measure which sets aside the degeneracy in the physical map resulting from many clones having identical digital call numbers. (The criterion of total linking distance D in Eq. 1 will not resolve the relative order of identical clones.) The second confidence statistic C2 between neighboring clones i andj on the original
Proc. Natl. Acad. Sci. USA 91 (1994)
11097
map is the percentage of times that clone i or a clone with identical call number to clone i appears next to clone j. The third statistic, C3, is the percentage of time two neighboring clones on the original map reappear in the same contig under resampling. All of these confidence statistics are reported in the electronic version of the physical map in Fig. 1. Arratia et al. (13) suggest a rule of thumb of two or more probes supporting a link in the physical map to avoid false joins due to experimental problems, such as chimerism in vector inserts (1). In total, 73% of the links in this map are supported by two or more probes, and the average number of probes supporting a link is 3. The confidence statistics provide additional information. The average value of C1 (26%) would indicate there is low confidence (i.e., 10-39% on average within contigs) in the correct local placement of clones, even when most of the clones are doubly linked. Even when the confidence is corrected for the degeneracy in the physical map due to contiguous blocks of clones with identical "call numbers" via C2, the confidence statistic C2 is typically in the 2-64% range within contigs with an overall mean of 38%. Typically the second confidence statistic is greater than C1. Where the statistic C2 exceeds C1 is an indicator of where the map is degenerate because of neighboring clones with identical hybridization profiles (e.g., contig 2 in Fig. 1). Both statistics C1 and C2 are highly correlated and drop precipitously near contig boundaries. The confidence with which two neighboring clones belong to the same contig is much higher (with an overall average of 72%), exceeding 48% on average within contigs (with the exception of contigs 7, 9, 10, 12, and 14 in Fig. 1). All five exceptional contigs are linked by only one probe on average. All three confidence statistics tend to drop precipitously near potential false joins between contigs (results not shown). Improving Map Qulty. To assess whether or not map quality can be improved by utilizing the random cost algorithm, experiments leading to Table 1 were repeated with the Table 2. Map quality after minimizing triplewise versus pairwise total linking distance with the random cost algorithm
Minimizing pairwise D Genome 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Minimiing
triplewise D
Random Time,* False Time,* False seed sec e joins sec e joins 1103527590 133 1712 1 36331 3559 4 377401575 70 2774 7 35578 3390 2 662824084 90 1134 4 36689 2638 5 1147902781 162 3017 7 36757 2394 3 2035015474 76 4199 11 35835 3516 5 368800899 258 2130 6 37545 3244 4 1508029952 72 2601 9 35668 1948 2 486256185 90 2963 11 36727 2840 2 1062517886 150 2062 7 37372 1394 5 267834847 392 2365 4 39909 2579 1 180171308 68 3303 11 37524 4183 10 836760821 133 3252 10 38186 3484 5 595337866 76 1615 4 38053 2343 4 790425851 72 3891 9 39168 3809 3 2111915288 96 3736 12 39391 1561 3 1149758321 131 1775 3 37643 3557 3 1644289366 92 3319 14 38829 3967 4 1388290519 51 3759 17 39022 3659 5 1647418052 62 3530 11 39130 3601 2 1675546029 43 2938 16 35235 3604 2
Mean 116 2803 9 37530 3064 4 SE 18 192 1 316 180 0 *Time is measured in central processor unit seconds on a VAXstation 4000, model 90.
11098 new
Proc. Nati. Acad. Sci. USA 91 (1994)
Genetics: Wang et al.
physical mapping criterion, triplewise total linking dis-
tance in Eq. 2. For each artificial genome generated and
mapped in Table 2, we compared the ability of the pairwise total linking distance in Eq. 1 and of the triplewise total linking distance in Eq. 2 to recover the true ordering ofclones (known by virtue of the genomes being artificially generated). For simplicity, the clones were indexed by their true order, 1 = 1,... , n, in each simulation (as opposed to agrid location on a petri plate), and the inferred ordering of clones i,. . . .. in was obtained by minimizing Eq. 1 or 2 with the random cost algorithm. If there are nc clones within a contig, the map reconstruction error within each contig was measured by nc-1
2 lill-ill(nc - )
[31
and summed over contigs to yield the map reconstruction error (e) within contigs. If the true ordering of clones within the first contig were recovered during map reconstruction, then il,. . . i,,,, would be the true ordering, 1, . . ., nc, and the error in summation 3 would be zero. The error is the number of additional steps taken by the inferred ordering (relative to the true ordering) in walking along a contig and was introduced originally to assess the quality of maps produced by minimizing the pairwise total linking distance ,
(7).
There is also one other complementary measure of map reconstruction error between contigs, which is important to assess because of its effect on the whole perceived progress of a mapping experiment. The progress of a mapping experiment is usually monitored by charting the number of contigs observed with each probe added (13). The number of false joins between clones will affect perceived mapping progress and is one important measure of the map reconstruction error between contigs. Because the triplewise total linking distance is a more complicated criterion, we decreased the noise more slowly by setting 8 = 10-7. The numbers offalse joins as well as the map reconstruction error (e) are reported for 20 artificial genomes in Table 2 when the random cost algorithm (8 = 10-7) is used to minimize pairwise total linking distance (Eq. 1) and triplewise total linking distance (Eq. 2). While the map reconstruction errors within contigs are not significantly different between methods 1 and 2 by a pairwise
t test at the 0.05 level, the map reconstruction error between contigs is significantly reduced by the triplewise total linking distance according to a pairwise t test at the 0.05 level. Use of the triplewise total linking distances cuts in half the number of false joins. The price paid is to run the random cost algorithm overnight. In summary, this computational approach provides an adaptable tool for a variety of physical mapping experiments, which can be run on PCs and Macintoshes with reasonable performance even on large chromosomal clone/probe hybridization matrices like Fig. 1. This approach also makes feasible the assessment of the reliability of physical maps by bootstrap resampling and the consideration of more sophisticated physical mapping criteria, such as linking distances involving clonal triples, to improve map quality while maintaining computational performance comparable to existing algorithms. 1. Cohen, D., Chumakov, I. & Weissenbach, J. (1993) Nature (London) 366, 698-701. 2. Lander, E. S. & Green, P. (1987) Proc. Natl. Acad. Sci. USA 84, 2363-2367. 3. Berg, B. A. (1993) Nature (London) 361, 708-710. 4. Green, F. D. & Olson, M. V. (1990) Science 25, 94-98. 5. Foote, S., Vollrath, D., Hilton, A. & Page, D. C. (1992) Science 258, 60-66. 6. Maier, E., Hoheisel, J. D., McCarthy, L., Mott, R., Grigoriev, A. V., Monaco, A. P., Larin, Z. & Lehrach, H. (1992) Nat. Genet. 1, 273-277. 7. Cuticchia, A. J., Arnold, J. & Timberlake, W. E. (1992) Genetics 132, 591-601. 8. Brody, H., Griffith, J., Cuticchia, A. J., Arnold, J. & Timberlake, W. E. (1991) Nucleic Acids Res. 19, 3105-3190. 9. Sambook, J., Fritsch, E. F. & Maniatis, T. (1989) Molecular Cloning: A Laboratory Manual (Cold Spring Harbor Lab. Press, Plainview, NY), 2nd Ed. 10. Hoheisel, J. D., Maier, E., Mott, R., McCarthy, L., Grigoriev, A. V., Schalkwyk, L. C., Nizetic, D., Frances, F. & Lehrach, H. (1993) Cell 73, 109-120. 11. Mizukami, T., Chang, W. I., Garkavtseve, I., Kaplan, N., Lomardi, D., Matsumoto, T., Niwa, O., Kounosu, A., Yanagida, M., Marr, T. G. & Beach, D. (1993) Cell 73, 121-132. 12. Efron, B. (1982) The Jackkife, The Bootstrap, and Other Resampling Plans (SIAM, Philadelphia). 13. Arratia, R., Lander, E. S., Tavare, S. & Waterman, M. S.
(1991) Genomics 11, 806-827.