Ordering Clone Libraries in Computational Biology

0 downloads 0 Views 250KB Size Report
Feb 16, 1998 - chromosome there are on the order of 108 nucleotides, and the ... Assuming this model, the question was posed in 2] and 7] as to how many ... the correct ordering of the clones C1; C2;:::;Cn. We wish to ... Di;j = j i;jj for i 6= j 2 V . Let Di = f j;i : j 2 V n figg for i 2 V . We will .... enough K and su ciently small ,.
Ordering Clone Libraries in Computational Biology Martin Dyer School of Computer Studies University of Leeds, Leeds, UK Alan Friezey Department of Mathematics Carnegie Mellon University, Pittsburgh PA, USA Stephen Suen Department of Mathematics University of South Florida, Tampa FL, USA February 16, 1998 Abstract We consider a probabilistic model, due to Lander and Waterman and to Alizadeh, Karp, Newberg and Weisser, for the physical mapping of DNA molecules. Within this model, we answer precisely a question of Alizadeh et al concerning the minimum number of probes required to reconstruct the entire ordering of a given clone library with high probability. We also examine the related problem of determining the least number of probes required to construct a \tiling" for the library. We give a fairly precise characterization for this number.  y

Supported in part by Esprit Working Group RAND. Supported in part by NSF grant CCR-9225008

1

1 Introduction The objective of many e orts in molecular biology, including the Human Genome project, is to sequence chromosomal DNA, i.e. to obtain the sequence of A, C, T or G nucleotides which constitute one strand of each molecule. This may be a complex task since, for example, in a typical human chromosome there are on the order of 10 nucleotides, and the whole genome comprises 23 pairs of chromosomes. Modelling this combinatorial complexity leads to interesting mathematical and computer science problems. See [7]. 8

A sub-goal of DNA sequencing is often to construct a physical map, which speci es the location of speci c identi able fragments on the molecule. In the usual procedures for physical mapping, it is necessary to extract information from fragments of the molecule called clones. These might typically contain about 10 nucleotides. A clone library is a collection of clones covering one or more molecules of interest, for example the human genome. One approach to physical mapping is to locate (by hybridization) the occurrence of short sequences called probes within the clones. See [2]. In practice, the identi cation process may also involve errors, which further complicates matters. 4

Lander and Waterman [5] proposed a probabilistic model for the location of clones in physical mapping. Their model was re ned by Alizadeh, Karp, Newberg and Weisser [2] to encompass the occurrence of probes within clones. We will describe the model in detail in Section 2. 2

Assuming this model, the question was posed in [2] and [7] as to how many probes are required to correctly order a given clone library, even assuming that the data is error-free. From the probabilistic viewpoint, the question concerns the existence of a \threshold" [1]. In Section 6 we will answer this question precisely. From a practical viewpoint, the numbers required are fairly discouraging. Therefore, in Section 7, we will describe and examine a natural related problem, having a more modest objective than that of ordering the entire clone library. Again we can give reasonably precise characterizations for the numbers of probes required. For this second problem, in contrast with the rst, the numbers required are rather more encouraging for the practitioner.

2 The model Let V = f1; 2; : : : ; ng. Then we will consider the following model: (a) The genome is represented by the interval [0; L]. (b) There is a library of n clones, each being an interval Ci (i 2 V ) of unit length with Ci  [0; L]. (c) The left end-points of the clones Xi (i 2 V ) form a Poisson process of density = n=L on [0; L ? 1] { see [3]. n will be our expected number of 3

clones and whp the actual number of probes  satis es  = n + o(n). We assume that the Xi are ordered in increasing order. Note that with probability 1 no two Xi are equal in this model. We will say Ci is \to the right" of Cj if Xi > Xj , otherwise \to the left". clones and their left hand endpoints will be uniformly distributed once we condition on the value of  . 1

2

(d) There are m probes, and the occurrences of each probe form a Poisson process with rate  on [0; L]. These m Poisson processes are mutually independent. We are interested in the case where L is large, and order-of-magnitude statements will be as L ! 1. The problem is to correctly identify the ordering of the clones in the library using only the information contained in the probes. Thus for each clone and each probe we are told whether the clone contains the probe. The information can be represented as a n  m 0-1 matrix. (See, for example, [2] for more details.) Given only this information the hope is that one can nd the correct ordering of the clones C ; C ; : : : ; Cn. We wish to determine the minimum number of probes for which this happens whp. Assuming the 1

2

With high probability i.e., with probability 1-o(1) as ! 1 The actual model proposed in [2] and [5] was uniformly randomly chosen points. Our analysis was originally done in this model. We have opted here for the Poisson model, as suggested by a referee, since (i) it yields essentially the same result, and (ii) the computations are generally easier. 1

L

2

n

4

:

above model, we answer this question precisely in Section 6, as follows. Let n = n(L) tend monotonically to in nity with L. Let p (n) be the probability that n clones form one connected component, and p (n) be the probability that their correct ordering can be found from the data. Clearly p (n)  p (n). Let us call n(L) minimal if p (n) ! 1, but, for any n0(L) such that limL!1 n0 (L)=n(L) < 1, we have p (n0) ! 0. Then we show that 0

1

1

0

0

0

Theorem 1 Let n(L) be minimal, and let m = (n)n log n as n ! 1, then 8 > if (n) ! 0; < 0 (1) p (n) ! > expf?e =(2 ) if (n) ! ; a constant, : 1 if (n) ! 1: 1

Furthermore the algorithm order, described below, satis es

lim Pr(order nds the correct sequence) = nlim !1 p (n):

L!1

1

Note that this number m is in fact rather large, about n log n. Therefore, in Section 7 we examine a more modest objective: that we correctly order a connected subset of the library which covers \almost all" of the genome. We call this a tiling of the clone library.

Remark We will see that we need n = L(log L +log log L + !) where ! ! 1 with n. Our proof of Theorem 1 is given only for ! = o(log L), but the reader may check that when ! = log L, Theorem 1 remains true with the exception that the limit in case (ii) is increased by a factor ( + 1). 5

3 Clone Ordering Algorithm Let Ii denote the set of probes incident with clone i. Let i;j = Ii n Ij and Di;j = ji;j j for i 6= j 2 V . Let Di = fj;i : j 2 V n figg for i 2 V . We will assume that ! = o(log n) and let  = n? = : 1 3

Let

D(x) = me? (1 ? e? )(1 ? x): We show later (Lemma 1) that whp Di;j  D = D() whenever Ci \ Cj = ; and that Di;i < D. +1

Next let

Mi = f 2 Di :  is minimal and jj < Dg: and consider the graph G = (V; E ) where

E = f(i; j ) : j;i 2 Mi or i;j 2 Mj g: In ideal circumstances G will be a path with vertices in clone order. Whp this will be close to being true. Disjoint clones tend to have Di;j  D and so not yield edges. Most of the remaining edges are of the form (i; i + 1). Algorithm order

(0) Construct G. (1) Find a Hamilton path in G. 6

Of course Step 1 is intractable unless we can prove that whp G has a special structure. We prove that whp either

(a) 9i 6= j such that Ii = Ij , in which case there is more than one sequence of clones compatible with the data, OR

(b) Ii 6= Ij for i 6= j and G consists of the path H = (1; 2; 3; : : :; n) plus a set of edges (i; i + 2) for i 2 K , where i 2 K implies that i ? 1 62 K . In Case (b) G consists of a sequence of jK j triangles joined by paths. each triangle (x; y; z) contains a unique vertex of degree 2, y say. Delete the corresponding edge (x; z) for each triangle. We are left with the path H .

4 Outline of Proof The choice of m = n log n is determined by the fact that for constant, the expected number of pairs i; j for which Ii = Ij tends to e=(2 ) as L (and n) tend to 1. In fact the number of pairs is asymptotically Poisson (see Lemma 5), which explains the form of (1). Let us now work under the assumption that Ii 6= Ij for i 6= j . We must show that whp G has the particularly simple form described above.

 Lemma 1 shows that whp Di;i < D < Dj;i for Ii \ Ij = ;. Then G will not contain any edges for which Ii \ Ij = ;. +1

7

 Lemma 2 rules out edges (i; k) for jk ? ij > 2. This is done by showing that whp k;i will not be a minimal member of Di, assuming Ik \Ii 6= ;.  Lemma 3 shows that whp (i; i + 1) is an edge of G for 1  i  n ? 1.  Lemma 4 shows that conditional on Ii 6= Ii , whp at most one of (i ? 1; i + 1) and (i; i + 2) is an edge of G. +1

 At this stage we know that if the Ii's are distinct then whp G has the stated structure and the correct clone ordering can easily be found.

 Lemma 5 then determines the asymptotic probability that the Ii's are distinct.

5 Preliminaries Let Zi; i = 1; 2; : : : ; be an in nite sequence of independently and identically distributed exponentials with parameter . Let Xi = Z + Z +    + Zi and let  = maxfi : Xi  L ? 1g. Thus X ; X ; : : : ; X are distributed as the left hand endpoints of the clones in our model. 1

1

2

2

Now observe that there is a \gap" in the coverage if there exists i   such that Zi > 1. However, Pr(9i   : Zi  1) 

 8

Z L?1

x=0 ne? :

e? dx

Now suppose n = L(log L + log log L + !) for some !, we have Pr(9i : Zi  1)  (1 + o(1))e?! : Thus, if ! ! 1, then whp there will be no gap. On the other hand, if ! ! 1, then since  is Poisson with mean n,   n=2 whp. Furthermore the number Z of indices i  n=2 such that Zi > 1 is distributed as a binomial with mean ne? =2 ! 1. Hence Z 6= 0 whp. Hence constant ! is the threshold for connectivity of the clones. This result is well known in di erent contexts, see [4] and the references contained therein. gap

gap

Note that ordering will be impossible if the clone library is not connected since, no matter how many probes are used, there will be no way of detecting the order in which the disconnected pieces should be placed. Thus, to ensure connectivity with high probability, we must have ! ! 1. Note now that, in our earlier de nition, the clone library is minimal if and only if ! = o(log n). However, most of our proofs are easily modifed for the case ! = (log n). The details are left to the reader. However, in practice, clone libraries are designed to provide small constant coverage of the genome, with a coverage factor of about ve, say. In the model, k-times coverage of an interval including most of the genome corresponds to taking n(L) at around L(log L + k log log L), or ! at about (k ? 1) log log L. (See, by comparison, [4].) Thus this region of greatest interest for ! falls well within the scope of our results. It will be observed, that in this model, short segments at either end of the 9

genome will (with probability 1) be uncovered by any clone. This is a small de ciency in the model which we ignore, assuming that these segments are to be handled separately.

6 Threshold Let m = n log n. We show that the threshold for ordering the entire set of clones occurs at constant , and hence prove Theorem 1. We show rst that with high probability adjacent clones have smaller di erence sets i;j than disjoint clones. Let  = n? = , as before. 1 3

Lemma 1 Let D() = me? (1 ? e?)(1 ? ). Then whp, for every i 2 V , Di

;i

+1

< D and minfDj;i : Cj \ Ci = ;g > D:

Proof For any constant K > 0, Pr(9i   : Zi > 1 ? K) 

Z L?1

x=0

expf?(1 ? K) gdx

 n expf? log L ? log log L + ! ? O( log L)g = o(1): Thus assume that Zi < 1 ? K for i   . Then probes fall in i ;i independently with probability dominated by e?(1 ? e? ?K ). But, for large enough K and suciently small , +1

(1

)

me? (1 ? e?)(1 ? )  (1 + )me? (1 ? e?

?K) ):

(1

10

Thus Pr(9i : Di

;i

+1

> D)  E( expf?  me? (1 ? e? 1 3

2

?K) )g) = o(1):

(1

Now, if Cj \ Ci = ;, then probes fall in Ij n Ii independently Hence we have Pr(9i; j : Dj;i < D)  E( expf?  me? (1 ? e?)g) = o(1): 2

1 3

2

2 We now prove the crucial properties of G.

Lemma 2 With high probability fi; kg is not an edge of G for all jk ? ij > 2. Proof Suppose the Lemma is false, and let us assume without loss that k > i + 2 and k;i 2 Mi. (The case k < i ? 2 is symmetric.) Hence Ck \ Ci = 6 ; by Lemma 1. Suppose i < j < k. Then Cj  Ck [ Ci. Hence Ij  Ik [ Ii. Thus Ij n Ii  Ik n Ii, i.e. j;i  k;i and we must have j;i = k;i by minimality. Hence r;i = i ;i for all i < r  k. We have +1

four successively overlapping intervals Ci; Ci ; Ci ; Ci . +1

+2

+3

Given such a quadruple, let w = Zi ; x = Zi ; y = Zi . If the quadruple satis es i ;i = i ;i = i ;i, then every probe which falls in [Xi + 1; Xi + 1] must also fall in [Xi; Xi + 1]. This has probability +1

+1

+2

+2

+3

+1

+3

+1

(1 ? e?

+3

w

(1+ )

(1 ? e? x y ))m : ( + )

11

Thus the event E that there exists any quadruple meeting this condition satis es 1

Pr(E )  1

Z L?1 Z

1

Z

Z

1

1

=0 w=0 x=0 y=0 n4 Z 1 Z 1 Z 1

 L

3

= Ln

4 3

dw

0

Z

1

0

dw

dx

0

Za 0

dx

0

Za 0

e? w 4

x y

( + + )

dy(1 ? e?

(1 ? e?

w

(1+ )

w

(1+ )

(1 ? e? x y ))mddwdxdy ( + )

(1 ? e? x y ))m ( + )

dy(1 ? e? (x + y))m + o(n? ) 2

1 2

1

where a = n? = ; 1 2

 Ln

4 3

 Ln

4 3

Z

1

0

Z

0

1

dw dw

Z

1 0

Z

0

1

dy(1 ? e?  (x + y))m

2 Z a e?bmz dz

 = n O 1 L m ! = O log n : n

1 2

2

where b = e? =2; 2

0

4

3

dx 

2

2

Lemma 3 With high probability fi; i + 1g is an edge for i = 1; 2; : : : ; n ? 1. Proof Otherwise there exists an i such that i ;i 62 Mi and i;i 62 Mi . This only occurs if, for some i, i? ;i  i ;i and i ;i  i;i . Let us call this event E , and let w = Zi; x = Zi ; y = Zi . As in Lemma 2, we have a quadruple of sets which overlap in succession, and if E occurs, every +1

1

+1

2

+1

+2 +1

+1

+1

+1

+2

2

probe which falls in either [Xi? ; Xi] or [Xi + 1; Xi + 1] must also fall in [Xi ; Xi + 1]. This has probability (1 ? e? x (1 ? e? w y ))m. Hence, 1

+1

+2

(1+ )

+1

12

( + )

similarly to Lemma 2, Pr(E ) < Ln

4

2

3

Z 0

1

dw

Z

1 0

dx

Z 0

1

dy(1 ? e?(1+x) (1 ? e?(x+y) ))m

!

= O logn n :

2 From Lemmas 2 and 3, we see that G will be a path with the possible exception of cases where an edge of the form fi; i + 2g exists. If such an edge exists then, unless fi ? 1; i + 1g is also present, G decomposes into two smaller graphs joined by the bridge fi ? 1; ig. Hence we can recursively subdivide the order-reconstruction problem. The only diculty occurs when both fi ? 1; i +1g and fi; i +2g are present, since then it is not clear whether i +1 should be placed before or after i. Let E be the event that this happens for some i, let E i be the event fIi = Ii g and E = Sni E i . Note that E i = fi ;i = i;i = ;g. Clearly, if E occurs, we cannot order all the clones, since there will be a pair Ci; Ci which could be ordered incorrectly but consistently with the data. On the other hand, if E does not occur, then we have argued that ordering is possible. (And is, in fact, achievable by a simple polynomial time algorithm). Clearly E implies E , since if Ii = Ii then i;i? = i ;i? and i;i = i ;i . We now show that the converse is almost always true. 3

( ) 4

( ) 4

+1

+1

4

+1

=1

( ) 4

4

+1

3

4

1

+1

1

+2

3

+1

+1 +2

Lemma 4 Pr(E n E ) = o(1). 3

4

Proof We will use the notation from the proof of Lemma 3. If E occurs, 3

13

then for some i, the following two events occur. (i) Either i? ;i 2 Mi or i ;i? 2 Mi? , (ii) Either i ;i 2 Mi or i;i 2 Mi . Reasoning as in Lemmas 2 and 3, this gives four possible cases: 1 +1

+1

+2

+1

1

+2

1

+2

(a) For some i, i? ;i = i;i and i ;i = i ;i. Call this event Ea . we see that every probe which falls in either [Xi? ; Xi] or [Xi + 1; Xi + 1] must also fall in [Xi ; Xi + 1]. The probability of this was bounded by O(log n=n) in the proof of Lemma 3. Thus Pr(Ea) = O(log n=n). 1 +1

+1

+2

+1

1

+1

+2

+1

(b) For some i, i? ;i = i;i and i;i = i ;i . Call this event Eb. We see that every probe which falls in [Xi? ; Xi] falls in [Xi; Xi + 1] and every probe which falls in [Xi; Xi ] falls in [Xi ; Xi + 1]. Thus every probe which falls in [Xi? ; Xi ] falls in [Xi ; Xi + 1]. But the probability of this event can bounded as in the proof of Lemma 2. Thus 1 +1

+1

+2

+1 +2

1

+1

+1

1

Pr(Eb) < Ln

4 3

Z 0

1

dw

Z

1 0

dx

+1

Z 0

1

+1

+1

+2

+2

dy(1 ? e?(1+y) (1 ? e?(w+x)))m = O

!

log n : n

(c) For some i, i ;i? = i;i? and i ;i = i ;i. This event, Ec say, is simply the reverse (in clone ordering) of Eb, and hence Pr(Ec) = O(log n=n). +1

1

1

+2

+1

(d) For some i, i ;i? = i;i? and i;i = i ;i . Call this event Edi , for given i, and let Ed = Sni Edi . Note that E i  Edi , so we must estimate more carefully than above. If Edi occurs, then every probe which falls in [Xi + 1; Xi + 1] falls in [Xi? ; Xi + 1] and every probe which falls +1

1

1

( )

+2

( )

( ) 4

=1

( )

+1

1

14

+1 +2

( )

in [Xi; Xi ] falls in [Xi ; Xi + 1]. The probability of this event is +1

+1

+2



1 ? (1 ? e?x )(e?

w

(1+ )

+ e?

y

(1+ )

m

) ;

since the following two events are disjoint: A given probe falls (i) in [Xi + 1; Xi + 1] and not in [Xi? ; Xi + 1]. +1

1

(ii) in [Xi; Xi ] and not in [Xi ; Xi + 1]. +1

+1

+2

Let Nd be the number of events Edi which occur. Then, ( )

E(Nd )  n

Z 1Z 1Z 1 0

=

 =

Z 1Z 1

0



+ e?

y

(1+ )

m

) e? w



3

exp ?me? (e?w + e?y )x e? w 3

x y

( + + )

x y

( + + )

dw dx dy

dw dx dy

e? w y me? (e?w + e?y ) + dw dy n Z 1 Z 1 e? w y dw dy me? e?w + e?y n Z 1 Z 1 e? w y me? e?w= + e?y= dw dy n : 1 (since ! 1 with n) me? 2 e (since  log n, m = n log n): 2 

= n



0

0

w

(1+ )

Z 1Z 1Z 1 0

0

 n

1 ? (1 ? e?x)(e?

0

2

( + )

0

2

0

( + )

0

( + )

0

0

Now let N be the number of events Ei which occur. The event E i occurs if and only if there is no probe in [Xi ; Xi ] which is not in [Xi ; Xi + 1] ( ) 4

+1

15

+1

+1

and no probe in [Xi + 1; Xi + 1] which is not in [Xi; Xi + 1]. Thus, using disjointness as in (d) above, +1

Z1

E(N )  n (1 ? 2e?(1 ? e?z )m e?z dz Z1  n exp(2me?z) e?z dz n  2me ? 0

0

  2e  :

But this is asymptotically equal to E(Nd ). Now E i  E i , and hence ( ) 4

n  [ (i)

Pr(E \ E )  Pr 3

4



n X i=1 n X

i=1

E \ E

4

3

! i

( ) 3

( )

Pr(E i \ E i ) ( ) 3

4

( )

fPr(E i ) ? Pr(E i )g i = E(Nd ) ? E(N )

=

( ) 3

( ) 4

=1

= o(1): We have therefore shown that, asymptotically, ordering depends on the occurence or otherwise of E , and that the expected number of such events is e=(2 ). We now complete the proof of Theorem 1. 2 4

Lemma 5 Pr(E ) ! expf?e =(2 )g as n ! 1. 4

Proof Let  = e=(2 )  E(N ). For the remainder of the proof of the lemma, we condition on the value of  . So we may assume that   n left 16

hand endpoints are chosen uniformly from [0; L ? 1]. Then Pr(Ei)  =n. Let us say Cj misses Ci if Cj [ Cj does not meet Ci [ Ci . Let i be the number of Cj which do not miss a given Ci. Then if t = d10 log ne, say, +1

+1

!

 1 < n? : Pr(max   40 log n j  )  4 i i t Lt 10

Thus we may assume maxi i < 40 log n. But, for given i and j such that Ci does not miss Cj , a calculation similar to that in part (a) of the proof of Lemma 4 gives Pr(Ei \ Ej ) = O(1=m ). Thus Pr(Ej j Ei) = O(n=m ) However, if Ci misses Cj then Ei; Ej are independent. Consider PS Pr(Ti2S Ei) over all sets S size k for constant k. If Ci misses Cj for all i; j 2 S then Pr(Ti2S Ei)  (=n)k . Hence the contribution to the sum from these terms is k =k!, since there are about nk =k! such sets. However if S contains two sets that do not miss, let r be the maximum number of i 2 S such that the Ci miss one another. The contribution from such sets is then at most 2

kX ?1  r r=1

r!

(40r log n)k?r

 r 

2

!

n  = O (log n)k? :  O n m n 

3

2

Hence, for k = 1; 2; 3; : : :, we have Pr( Ei)   : k! i2S S jS j k X

:

k

\

=

Then, by inclusion-exclusion, Pr(

[

i2V

Ei ) !

1 X

k (?1)k?  = 1 ? e? : k! 1

k=1

2 17

Hence, we have Theorem 1 for constant . For ! 1 slowly enough, we therefore have Pr(E ) ! 1, and for ! 0 slowly enough, we have Pr(E ) ! 0. Theorem 1 now follows by observing that the probability that we can reconstruct the clone ordering is monotone in the number of probes for a xed number of clones. 4

4

7 Constructing a tiling In this section, we consider the problem of constructing a tiling, i.e. a subset of the clone library which is correctly ordered and covers the genome, with the exception only of regions of length o(1) at either end as L ! 1. We will show that this requires considerably fewer probes than in Section 6, since we are not obliged to give any ordering information on the clones which are not in the tiling. We will show that m = o(log L) probes will always suce, and that (log L) probes are both necessary and sucient for this task. 3

Suppose Ci \ Cj 6= ;, and Xj ? Xi = x. Then Dj;i is binomial with parameters m, e? (1 ? e?x). Let us write (x) = me? (1 ? e?x) for the expected value. Clearly (x) is an increasing function of x. Let now  = ! = = log n. Then, for each i 2 V , de ne 1 2

Si = fj : ( )  Dj;i  (1 ? )g: 2

Now consider the following algorithm for nding an ordered subset of the 18

clones. (0) Pick any clone Ci1 , and nd a j 2 Si such that Dj1;i1 is maximum. Place clone Cj1 adjacent to Ci1 in the chosen subset. Let i i ; j j . 1

1

1

(1) If there is k 2 Sj such that Di;k > Dj;k and Dk;i > Dj;i, pick one such that Dk;j is maximum. Place clone Ck adjacent to Cj on the opposite side from Ci in the chosen subset. Let i j , j k and repeat this step. (2) Otherwise reverse direction. Let j i , i the loop terminates again, then stop. 1

j . Repeat step (1) until 1

Theorem 2 If m(n) is such that log n= = o(m), then with high probability 2

the algorithm will succeed in correctly determining a linear order on a subset of the clones. Futhermore, if S is the set of clone numbers in the chosen subset, then jS j = O(L) and

min X ? X <  ; X ? max X D(K), if i 6=  , there is a least one j (i.e. i + 1) such that Dj;i  (1 ? ). Also, ( )  (2 ). Thus, if Xj ? Xi > 2 for 1 2

2

2

2

any i; j , then Pr(Dj;i < ( ))  exp(? (1 ? o(1))( ) (2 )): 2

1 2 2

1 3

2

Hence Pr(9i; j : Xj ? Xi > 2 and Dj;i < ( ))  n exp(? (m )) = o(1); 2

2

2

2

since log n = o(( )). 2

Thus if Si = ;, we may assume Xj ? Xi < 2 for all j such that Ci \ Cj 6= ;. Let Xj be maximum such that Xj  Xi +1. Clearly Xj 2= [Xj ; Xi +1], and thus if j <  we have Zj > 1 ? 2 , contradicting the proof of Lemma 1. 2 2

+1

2

+1

Lemma 8 If X ? Xj  2 , then whp there will exist a choice for k in 2

step (1) of the algorithm.

Proof We have ( )  2( =2). If Xj ? Xi <  =2 for any i, j with j 2 Si, 2

2

2

then Pr(Dj;i  ( ))  exp(? (1 ? o(1))( )): 2

2

1 3

Hence, Pr(9i; j : Xj ? Xi <  =2 and Dj;i  ( ))  n exp(? (m )) = o(1): 2

2

20

2

2

Thus we may assume that Xj ? Xi >  =2. If j 6=  , from Lemma 7 there will exist a k 2 Sj , and we will have Xk ? Xj > 2 . 2

2

First suppose Ci \ Ck 6= ;, so Ci \ Cj \ Ck 6= ;. Then j;k  i;k , thus Dj;k  Di;k only if j;k = i;k . But since Xj ? Xi >  =2 and Xk ? Xj < 1, the existence of such an i, j and k has probability less than 2

n (1 ? e?  (1 ? e?2 = ))m = o(1): 3

2

2

Similarly, since Xk ? Xj >  and Xj ? Xi < 1, n Pr(Dj;i  Dk;i) = o(1). 2

3

Suppose nally, that Ci \ Ck = ;. Now, since j 2 Si, Dj;i < (1 ? ) < D(=K ). But, from Lemma 1, Dk;i > D(=K ), so Dk;i > Dj;i. Similarly Di;k > D (=K ). If Xk ? Xj > 1 ? =K , then 2

Pr(Dk;j < (1 ? )) < exp(? (m )) = o(n? ): 2

2

In ating the right side of the above inequality by n deals with the existence of any such pair k; j . Since k 2 Sj we may now assume Xk ? Xj  1 ? =K . But D(=K )  (1 + )(1 ? =K ) for some xed  > 0. So 2

2

Pr(Dj;k > D(=K ))) < exp(? (m )) = o(n? ); 2

2

2

and in ation by n can be used as previously. 2

2

Lemma 9 It is correct to place Ck on the opposite side of Cj from Ci. Proof Suppose Xk < Xi < Xj . Then, since k 2 Sj , we have Ck \ Cj 6= ; and hence Ci \ Cj \ Ck 6= ;. But then Dj;k  Di;k , a contradiction. 21

Thus suppose Xi < Xk < Xj . Then, since j 2 Si, we have Ci \ Cj 6= ; and hence Ci \ Cj \ Ck 6= ;. But then Dj;i  Dk;i, again a contradiction. 2 Theorem 2 now follows. We see that when we are building the subset from \left to right", we cannot terminate until we have encountered a Cj such that X ? Xj < 2 . Similarly, by symmetry, when we are building the subset from right to left, we cannot terminate until we have encountered a Cj such that Xj ? X < 2 . Thus we have only to verify the claim concerning the number of clones selected. 2

2

1

Lemma 10 The algorithm will whp select O(L) clones. Proof We will merely sketch the method, leaving the routine calculations to the reader. Let j 2 B if and only if there is no k for which Xj + < Xk < Xj + . Then,

E(jB j) 

Z L?1

x=0

1 3

2 3

e? = dx = o(n = ): 3

2 3

Thus Pr(jB j > n = ) = o(1). If j 2= B , let k be such that Xj + < Xk < Xj + . By straightforward calculations we obtain 1 3

2 3

2 3

Pr(Dk;j < ( )) = o(n? ); Pr(Dk;j > (1 ? )) = o(n? ): 2

2

2

Thus with high probability we will have k 2 Sj in all such cases. Now, if Xj + < Xk < Xj + and Xi < Xj , similar calculations give 1 3

2 3

Pr(Di;k  Dj;k) = o(n? ); Pr(Dk;i  Dj;i) = o(n? ): 3

3

22

Thus k will be a valid selection in step (1). However, if ` 2 Sj is such that X` < Xj + , then letting t = (( ) + ( )), further calculations show 1 4

1 2

1 4

1 3

Pr(D`;j  t) = o(n? ); Pr(Dk;j  t) = o(n? ); 2

2

so with high probability D`;j < Dk;j . Thus if j 2= B , the maximization in step (1) will ensure that we choose k so that Xk ? Xj  . 1 4

Now suppose the algorithm chooses g clones Ci with i 2= B , and b with i 2 B . Then these cover an interval of length at least (g ? 1)  L, so g  4L + 1. Thus the total number selected is 1 4

g + b  4L + 1 + n = = O(L): 2 3

2 Note that if m(n) = !0 log n= and  = !0= log n, where !0 = o(!) but !0 ! 1, then m = log n=!0 = o(log n): 2

3

3

Thus o(log n) clones will always suce. On the other hand, suppose ! = p log n= !0, where !0 tends arbitrarily slowly to in nity. Thus the the number of clones is \maximally minimal". Then letting  = 1=!0, say, gives m = (!0) log n, so we need a little more than log n probes. 3

3

We also have the following 23

Corollary 1 If m(n) = log n, then whp Sj2S Cj = Sni Ci. 3

=1

Proof First suppose ! = O(plog n). Take  = !0= log n, where !0 = o(!) but !0 ! 1. Thus log n probes suce, and  = o(1= log n). Hence 3

2

Pr(Z <  ) = 1 ? exp(?(1 + o(1)) log n) = o(1); 2

2

and thus, by Lemma 8, the loop in algorithm must terminate with j =  . Similarly for the reverse direction.

p

Now if log n = o(!), take  = 1=!. Then a little greater than ! log n = o(log n) probes suce, and again  = o(1= log n). The remainder of the argument is as before. 2 2

3

2

Thus, with log n probes, we guarantee to cover the entire interval represented in the clone library with high probability . 3

Although our proofs are given only for ! = o(log L), similar methods extend to faster growing values of !. However, we may deduce from the above that, if we take ! = (log L), O(log n) probes will suce. We use the following simple Lemma.

Lemma 11 Let f (n); g(n) be positive functions. If f = o(h) for all positive h(n) such that g = o(h) as n ! 1, then f = O(g). Proof If f 6= O(g), there is an increasing sequence ci ! 1 such that for all i, there exists ni such that f (ni) > cig(ni). Assume without loss that ni 24

is nondecreasing, and de ne h by h(n) = cig(n) if ni  n < ni . Clearly g = o(h), so f = o(h). Thus f (ni) < ci g(ni) for large i, a contradiction. 2 +1

Lemma 12 Let " > 0. If n(L)  (1 + ") log L, then O(log n) probes suce to determine the tiling.

Proof For a xed number of probes m, the algorithm above clearly cannot have smaller probability of success with a larger number of clones. Thus the number m(n) required for any given probability of success is nondecreasing with n. Thus, if m is the number required when n = (1 + ")L log L, m  m(n) for all n such that ! = o(log n). Let h be arbitrary such that log n = o(h), and let  = (h= log n) = . Thus  ! 1. Let ! = log n= ,  = 1= , and m =  log n. Then Theorem 2 applies, and hence m  m = o(h). We now apply Lemma 11, with g = log n, f = m to complete the proof. 2 1 6

2

5

On the other hand, we have the following simple lower bound.

Lemma 13 At least L log L clones and log L probes are necessary to deter2

mine a tiling.

Proof We need L log L clones for connectedness, without which we clearly cannot determine a tiling. Any tiling must obviously have at least L clones. Since it is unambiguously ordered, no two clones can contain the same set of probes. Now, with m probes, there are at most 2m di erent sets. Thus 2m  L. 2 25

We may sumarise these results in the following \optimality theorem".

Theorem 3 n = (L log L) clones and m = (log L) probes are necessary and sucient to determine a tiling.

References [1] N. Alon, J. Spencer and P. Erd}os, The Probabilistic Method, John Wiley and Sons, New York, 1992. [2] F. Alizadeh, R. M. Karp, L. A. Newberg and D. K. Weisser, \Physical mapping of chromosomes: A combinatorial problem in molecular biology", Proceedings of the 4th Annual ACM-SIAM Symposium on Discrete Algorithms (1993), 371{381. [3] R.Arratia, E.S.Lander, S.Tavare and M.S.Waterman, \Genomic mapping by anchoring random clones: a mathematical analysis", Genomics 11, 806{827. [4] L. Holst, \On multiple covering of a circle with random arcs", Journal of Applied Probability 17 (1980), 284{290. [5] E. S. Lander and M. S. Waterman, \Genomic mapping by ngerprinting random clones: A mathematical analysis", Genomics 2 (1988), 231{239.

26

[6] W. Feller, An introduction to probability theory and its applications Volume II, Second edition, John Wiley and Sons, New York, 1971. [7] R. M. Karp, \Mapping the genome: Some combinatorial problems in molecular biology", Proceedings of the 25th Annual ACM Symposium on Theory of Computing (1993), 278{285.

27