Kinship under the coalescent

Kinship under the coalescent Michael Blum Olivier Fran¸cois∗ TIMC-TIMB, Faculty of Medicine, F38706 La Tronche cedex, France

Abstract We study kinship in a sample of genes taken from a large population under the coalescent model of neutral evolution. We describe the level of the genealogy at which a given lineage coalesces with the rest of the sample. Then we study the time of coancestry and the number of pairwise differences with the sequence of a closest parent using the infinitely-many sites model of mutation. Finally, we give the probability distribution for the size of the smallest clade that contains a given gene.

∗

email: [email protected]

1

1

Introduction

This article describes kinship in a sample of n individuals (or genes) taken from a large population. The evolution of the population is assumed to be selectively neutral. In other words, genetic drift is the only factor responsible for the variation within the population. In this context, the coalescent approximation [4, 5] proved to be useful for describing the evolutionary relationships of DNA sequences of the individuals in the sample. In particular, the time since the most recent common ancestor of the sample has been extensively investigated both theoretically and experimentally (see [9, 2, 6] for surveys). This contribution studies the time since the coalescence of the lineage of an arbitrary individual with the lineages of its closest parents in the genealogy. The smallest family of this individual in the sample is studied as well. The organization is as follows. Section 2 recalls basic facts about the coalescent model. Section 3 describes the distribution of the number of lineages present in the sample at the moment of the coalescence. Section 4 establishes Laplace transform and probability density formulae for the coalescence time. Using the infinitely-many sites model of the DNA molecule, Section 5 describes the conditional distribution of the coalescence time given the number of substitutions between the sequences of a gene and one of its closest parents. This distribution corresponds to an explicit mixture of gamma distributions. In the last Section, we describe the size of the minimal family of an individual in the sample. The minimal family is a clade which contains the individual himself plus the subset of his closest parents. As a 2

remarkable result, we obtain that the asymptotic average number of closest parents is equal to 2. Throughout the article, we provide comparisons with a more traditional approach that studies the coalescence for two randomly chosen individuals.

2

Notations and Background

The coalescent is a random model that describes the behavior of the genealogy of a sample from the neutral Wright-Fisher model when the population size N grows to infinity [4]. The approximation arises as a diffusion limit when time is measured in units of N generations. Let n denote the size of a sample of individuals taken from the large population. Here, we study the distribution of the time necessary to find a common ancestor of an arbitrary individual with the rest of the sample. This time corresponds to the coalescence with a closest parent. The question of how many of “parents” is also addressed in the sequel. In the coalescent, one wishes to record information about the number of ancestors at various times and also information about which individuals share common ancestors. The ancestral process An (t) records the number of distincts ancestors of the sample a time t in the past. It can be described as a continous-time Markov chain on [n] = {1, . . . , n} such that An (0) = n, 1 is an absorbing state, and the rate of transition from k to k − 1 is equal to λk =

k(k − 1) , 2 3

k = n, . . . , 2.

This means that the times (Tk ), k = n, . . . , 2, separating coalescence events are independent and exponentially distributed with mean 2/k(k − 1). To understand the topology of the tree, one possibility is by labeling individuals in the sample from the set [n] and defining random equivalence relations so that i and j are in the same class at time t iff they share a common ancestor at this time. Denote by C(t) the random partition obtained from the former equivalence relation at time t. According to Kingman [4], the process C(t) is a continuous-time Markov chain on the set of all partitions of [n] (denoted En ) for which C(0) ≡ {{1}, . . . , {n}}. The transition rates of this Markov chain can be described as follows 1 if α ≺ β for all α, β ∈ En , qαβ = 0 otherwise, where the symbol α ≺ β means that α is a nested partition of β which may be obtained by merging two classes in α. The observation that 1 is an absorbing state of An (t) means that C(t) converges almost surely to {[n]}, the final state in which a single class remains. The embedded discrete-time Markov chain {Ck }, k = n, . . . , 1, moves through the sequence Cn ≡ C(0) ≺ Cn−1 ≺ Cn−2 ≺ . . . ≺ C1 ≡ {[n]} and has transition probabilities P(Ck−1 = αk−1 | Ck = αk ) = 4

2 , k(k − 1)

k = n, . . . , 2.

This happens if αk ≺ αk−1 and αk (resp. αk−1 ) has exactly k (resp. (k − 1)) classes, otherwise the probability is zero.

3

Coalescence levels

In this Section, we consider a specific realization of the random genealogy C(t). This realization can be represented as a rooted tree with all the individuals at the tips and the most recent common ancestor of the sample at the root. Consider an arbitrary individual in the sample. For example, we take the individual labeled 1. Define the coalescence time τn = sup{t ≥ 0 , {1} element of C(t)}. The coalescence level K of the individual 1 with the rest of the sample is defined as the random variable taking values k = n, n − 1, . . . , 2 so that K=k

iff

τn = Sk

where Sk ≡ Tn + · · · + Tk . In a more compact definition, K is related to the ancestral process as follows K = 1 + An (τn ). For sake of comparison, we will also describe the distribution of the level of coalescence for two arbitrarily chosen individuals. The two individuals can be labeled 1 and 2. Then let us define their coalescence time as τñ = sup{t ≥ 0 , {1, 2} elements of C(t)}. 5

We set ˜ = 1 + An (˜ K τn ). ˜ below. We give the distributions of K and K Proposition 1 For n ≥ 2, we have P(K = k) =

Proof.

2(k − 1) , n(n − 1)

for all k = 2, . . . , n.

Recall that, for all k = 2, . . . , n, we have P(Ck = α) =

(n − k)!k!(k − 1)! n1 ! . . . nk ! n!(n − 1)!

(1)

where α ∈ En is a partition in k classes such that each class has cardinality #αi = ni and n1 + · · · + nk = n. In the sequel, we shall use the notation |α| = k. For a proof of this classical result, see eg [9] p. 39, proposition 2.2.2. For k = 2, . . . , n, the probability that the coalescence occurs at a level of the genealogy lower than k is equal to P(K ≤ k) = P({1} ∈ Ck ). To compute the probability of this event, we can write X

P(K ≤ k) =

P(Ck = {1} ∪ α)

α:|α|=k−1

where the sum runs over all partitions of {2, . . . , n} into (k −1) classes. Using equation (1), we find that P(K ≤ k) =

X α:|α|=k−1

(n − k)!k!(k − 1)! n1 ! . . . nk−1 ! n!(n − 1)! 6

and since n1 + · · · + nk−1 = n − 1, we have X α:|α|=k−1

(n − k)!(k − 1)!(k − 2)! n1 ! . . . nk−1 ! = 1. (n − 1)!(n − 2)!

Therefore, we obtain that P(K ≤ k) =

k(k − 1) , n(n − 1)

k = 2, . . . , n,

which yields the desired result. The distribution of K has been obtained in a different way by Wiuf and Donnelly [13]. The mean and the variance of K can be computed from elementary algebra. We find that the expectation is equal to 2 E[K] = (n + 1), 3

n ≥ 2.

In order to compute the variance, recall that n X k=2

k3 =

1 (n + 1)4 − 2(n + 1)2 + (n + 1)2 − 1. 4

Then we have Var[K] =

1 2 (n + 31n − 2). 18

The interpretation is that the average level at which the coalescence occurs is closer to the tips of the tree than to the root. However the variance is relatively large with respect to the mean for large sample sizes. As far as the coalescence of two random lineages in the sample is concerned, we have the following probability distribution.

7

Proposition 2 For n ≥ 2, we have ˜ = k) = P(K Proof.

2 n+1 n − 1 k(k + 1)

for all k = 2, . . . , n.

The argument is based on the bivariate coalescent (see eg [10],

p.87) and a result of Saunders et al [7] that describes the joint distribution of B(t) = (Am (t), An (t)),

t ≥ 0,

where An (t) is the ancestral process at time t and Am (t) is the ancestral process of a subsample of size m ≤ n. Using [7], we find for k = 2, . . . , n, P(A2 (t) = 1, An (t) = k − 1) = P(An (t) = k − 1)

2(n − k + 1) . k(n − 1)

For 2 ≤ k ≤ n − 1, we have P(˜ τn ≤ Sk ) = Pr(A2 (Sk ) = 1) ≡ F (k) =

2n−k+1 k n−1

and then ˜ = k) = P(˜ P(K τn = Sk ) = F (k) − F (k + 1) =

n+1 2 . n − 1 k(k + 1)

For k = n, we have ˜ = n) = P(˜ P(K τn ≤ Sn ) = F (n) =

2 . n(n − 1)

The probability that two individuals share their most recent ancestor with the whole sample has previously been calculated by Watterson [12] whose result agrees with the fact that ˜ = 2) = P(K 8

1n+1 . 3n−1

˜ is very different from K. The nodes close to the root The distribution of K are given more important weights in the distribution of K. This can also be ˜ which is O(log n) in contrast with the O(n) seen with the average level E[K] result obtained for E[K].

4

Coalescence times

The main result in this Section is the description of the distribution of the coalescence time τn . It corresponds to the length of an external branch length in the terminology of Fu and Li [3]. Notice that the distribution of τñ is a well-known result in coalescent theory. This distribution is exponential with rate 1. Before giving the distribution of τn , we remark that the mean and variance of this random variable can be computed from proposition 1. Fu and Li [3] provided a different proof for thess results.

Proposition 3 Let n ≥ 2. Consider the coalescence time τn . We have E[τn ] =

2 , n

and Var[τn ] =

Proof.

4 . n2

Use the fact that τn = Tn + · · · + TK ,

and apply proposition 1. 9

For k = 2, . . . , n − 1, and k ≤ j ≤ n, let us denote Y

c(j, k) =

(`(` − 1) − k(k − 1)) .

j≤`6=k≤n

In addition, we set c(n, n) = 1. Then we have the following result. Theorem 1 Let n ≥ 2. The Laplace transform of τn is given by −sτn

Lτn (s) = E[e

]=

n X k=2

where

ak

λk , s + λk

s ≥ 0,

k

(n − 1)!(n − 2)! X c(j, k)−1 ak = . λk (j − 2)!2 j=2

(2)

The probability density function can be described as a mixture of exponential distributions f (t) =

n X

ak λk exp(−λk t),

t ≥ 0.

k=2

Proof.

According to proposition 1, we have −sτn

E[e

n n X 2(k − 1) Y λj ]= n(n − 1) j=k s + λj k=2

Using decomposition in simple elements, we have n Y

n

X 2n−k−1 c(k, j)−1 1 = . s + λj s + λj j=k j=k 10

Let −1

b(k, j) = c(k, j)

n (k − 1) Y (`(` − 1)) . n(n − 1) `=k

Reordering the sums, we find that n X n X b(k, j) k=2 j=k

s + λj

=

n k X X k=2

and λ k ak =

k X

! b(j, k)

j=2

1 s + λk

b(j, k).

j=2

5

Pairwise differences

In this section, a mutation process is superimposed on the coalescent. We suppose that DNA sequences are observed at the tips of the genealogical tree. The times at which mutations occur are modeled in the coalescent by assuming that these times form a Poisson process of constant rate θ/2, for some θ > 0. If a branch of the tree has length t, then the number of mutations has a Poisson distribution with mean θt/2, independently of the other branches. Among the various models that describe the mutation types, the infinitelymany sites model may be one of the most appropriate [11]. In this model, each DNA sequence consists of completely linked sites (ie, no recombination occurs). Each mutation occurs on a site of the DNA sequence that had not been mutated before, so that a new segregating site arises. The number

11

of segregating sites corresponds to the number of substitutions of ancestral bases since the most recent common ancestor in the sample. We study the number of substitutions ∆ in the DNA sequence of an arbitrary individual compared with the sequence of a closest parent. Recall that, for two arbitrary individuals, the number of pairwise differences has a shifted geometric distribution of parameter p = 1/(1 + θ). The mean is θ and the variance is θ + θ2 . Here we obtain the following result. Proposition 4 Let n ≥ 2 and assume the infinitely-many sites model for mutations in the coalescent. Then, the number of substitutions between a gene and one of its closest parent is distributed as a convex combination of shifted geometric distributions P(∆ = `) =

n X

ak Gs (pk , `),

` = 0, 1, . . . ,

k=2

where ak is given in Theorem 1 equation (2), Gs (pk , `) = (1 − pk )` pk ,

` = 0, 1, . . . ,

and pk =

Proof.

1 , 1 + θ/λk

k = 2, . . . , n.

Let θ > 0. Conditionally on τn = t, t ≥ 0, we have P(∆ = ` | τn = t) =

θ` t` −θt e , `!

12

` = 0, 1, . . . ,

and then (−1)` θ` (`) Lτn (θ) `!

P(∆ = `) = (`)

where Lτn is the `th derivative of the Laplace transform Lτn . Using Theorem 1, we are led to the result. The moments of ∆ were found be Fu and Li following a different method [3].

Corollary 1 We have E[∆] =

2 θ n

and Var[∆] =

Proof.

4 2 θ + 2 θ2 . n n

We use the fact that E[τn ] =

n X ak k=2

and that E[τn2 ]

=2

λk

n X ak

λ2 k=2 k

=

2 n

=

8 . n2

Under the infinitely-many sites assumption, Tajima [8] studied a sample of size two. After observing ` substitutions, it follows from Bayes Theorem that the conditional distribution of the coalescence time is gamma of shape parameter (1 + `) and scale 1/(1 + θ). Therefore, the conditional expectation 13

of τñ is equal to (1 + `)/(1 + θ) and the conditional variance is equal to (1 + `)/(1 + θ)2 . Similarly, the conditional distribution of the coalescence time τn given that ` substitutions are observed can be deduced from the Bayes formula as follows fτn |∆=` (t) =

θ` t` , `!P(∆ = `)

t ≥ 0,

Using proposition 4, the conditional density can be reformulated as a convex combination of gamma distributions fτn |∆=` (t) =

n X

a(k, `)G(1 + `, λk + θ)(t),

t ≥ 0,

k=2

where G(a, λ)(t) =

λa a−1 −λt t e , Γ(a)

t ≥ 0,

and, for k = 2, . . . , n a(k, `) =

ak Gs (pk , `) , P(∆ = `)

` = 0, 1, . . .

The exact computations of the conditional expectation and the conditional variance may not be so explicit. In order to provide numerical values, we may keep the following formulae E[τn | ∆ = `] =

(1 + `) P(∆ = ` + 1) , θ P(∆ = `)

` = 0, 1, . . .

Table 1 reports the values of P(∆ = `) for n = 10, 30, ` = 0, . . . , 9 and θ = 1, 10. Figure 1 displays the curves of fτn and fτn |∆=` for ` = 0, 2, 5 and θ = 10. Table 2 reports the values of the conditional expectation for θ = 1, 10 and ` = 0, . . . , 10. We remark that the distribution of ∆ is concentrated on 14

small integers, with a rapid decrease as the number of substitutions increases. As well, larger sample sizes lead to more concentrated distributions. On the other hand, the conditional expectations become rather independent on the sample size as the number of substitutions increases.

6

The minimal family

This Section deals with the size Xn of the minimal family of an arbitrary individual in the sample. By definition, the minimal family consists of the equivalence class α1 in the partition C(τn ) which contains the individual labeled 1 at the instant of coalescence of his lineage with the rest of the sample. Formally, Xn is then defined as Xn = #α1 ,

α1 ∈ C(τn ).

The main result in this Section is the exact description of the law of the random variable Xn .

Theorem 2 Let n ≥ 2. The random variable Xn has the following probability distribution P[Xn = x] =

4 , (x − 1)x(x + 1)

x = 2, . . . , n − 1,

and P[Xn = n] =

15

2 . n(n − 1)

A remarkable observation is that the distribution does not depend on n except for the event (Xn = n). Hence the small discrepancy between the finite and asymptotic probability values is created by the chance that the family consists of the whole sample. As a corollary of this result, the average family size can be computed easily. We find that E[Xn ] = 3 −

2 , n

n ≥ 3,

and the expected value converges to 3. In addition, the variance is equal to Var[Xn ] = 4

n X 1 i=3

i

− 6 + o(1),

n ≥ 3,

which is of order log n, ie converges to infinity slowly. Before giving a proof of the Theorem, we establish a useful combinatorial identity in the next lemma.

Lemma 1 Let n ≥ 4, and x an integer so that n > x ≥ 3. We have n−3 X

k(k − 1) · · · (k − x + 3)(n − k − 1)(n − k − 2) = 2

k=x−2

Proof.

n(n − 1) · · · (n − x) . (x − 1)x(x + 1)

First, use induction to prove that n−1 X

k(k − 1) · · · (k − x + 3) =

k=x−2

n(n − 1) · · · (n − x + 2) , x−1

and then use a similar recursion argument to prove that n−2 X

k(k − 1) · · · (k − x + 3)(n − k − 1) =

k=x−2

16

n(n − 1) · · · (n − x + 1) . x(x − 1)

Applying the same recursion scheme, the lemma follows from the above equations. Proof of Theorem 2.

Consider the coalescent starting with n lineages.

A standard result in coalescent theory states that if we pick one lineage at random when there are k ≤ n lineages, then the probability it will contain m of the n starting lineages is n−m−1 k−2 , P(Mkn = m) = n−1 k−1

1 ≤ m ≤ n − k + 1.

For a proof of this result, see eg [2] chap 1., eq. (3.14). At the moment of the coalescence with the rest of the sample, the lineage of individual 1 coalesces with a random subset of size M n−1 . Conditional to K = k, this means that Xn has the same distribution as n−1 Xn ∼ 1 + Mk−1 n−1 where the coalescent starts with (n − 1) lineages in Mk−1 . Then, for all

k = 3, . . . , n, we have P(Xn = 1 + m | K = k) =

n−m−2 k−3 n−2 k−2

,

and, for k = 2, we have P(Xn = n | K = 2) = 1.

17

m = 1, . . . , n − k + 1,

Then, we have P(Xn = n) =

2 , n(n − 1)

and, for all m = 1, . . . , n − 2, P(Xn = 1 + m) =

n−m+1 X k=3

2(k − 1) n(n − 1)

n−m−2 k−3 n−2 k−2

.

For x = 2, we obtain n X 2(k − 1)(k − 2) , P(Xn = 2) = n(n − 1)(n − 2) k=3

which is equal to 2/3. Similarly, for n > x ≥ 3, we obtain n−3 X 2k(k − 1) · · · (k − x + 3)(n − k − 1)(n − k − 2) P(Xn = x) = . n(n − 1)(n − 2) · · · (n − x) k=x−2

Using Lemma 1, we find that P(Xn = x) =

4 (x − 1)x(x + 1)

for all x = 2, . . . , n − 1. The limiting distribution of Xn has mode 2 and is long-tailed. This might be an expected result because this behaviour is a typical feature of the number of species in a genus in traditional hierarchical phylogenetic taxonomy (see eg [1]). In the Markov linear growth model (or Yule branching process), this number has a Yule distribution P(X 0 = x) = ρ Γ(1 + ρ)

Γ(x) , Γ(x + 1 + ρ)

18

x≥1

for some parameter ρ > 0 [14]. As n grows to infinity, the asymptotic distribution of Mn = Xn − 1 is actually given by P(M = m) =

4 , m(m + 1)(m + 2)

m ≥ 1,

and corresponds to the Yule distribution of parameter ρ = 2. Following this approach, a useful comparison for Theorem 2 may be given by the distribution of the number of individuals in a random clade in the coalescent model. Recall that a clade is an equivalence class for some Ci , i = n − 1, . . . , 1, such that the time since the most recent common ancestor is Si+1 . In these notations, i corresponds to the number of ancestors at the instant of coalescence. Choose I = i from the uniform distribution on [n − 1] and consider the number of individuals Yn in the (unique) clade of CI . We have the following result.

Theorem 3 Let n ≥ 2 and Yn be the number of individuals in a random clade of the coalescent. We have P(Yn = y) =

n 2 , (n − 1) y(y + 1)

f or all y = 2, . . . , n − 1,

and P(Yn = n) =

Proof.

1 . n−1

Given that there are i ancestors at time t, the conditional joint

distribution of the numbers of individuals in each class of C(t) is uniform over the set of i-uples (n1 , . . . , ni ) such that n1 + · · · + ni = n. From this, 19

we deduce the conditional distribution of Yn given that there are i ancestors. We obtain, for i = 2, . . . , n − 1,

n−y−1 i−2 , P(Yn = y | I = i) = (y − 1) n−1 i

y = 2, . . . , n − i + 1.

Hence, for 2 ≤ y ≤ n − 2, we have n−2 n(y − 1) X (j − 1)(j − 2) · · · (j − y + 2)(n − j)(n − j − 1) P(Yn = y) = (n − 1) j=y−1 n(n − 1) . . . (n − y)

which yields the result. We remark that the average size of a random clade grows as log n E[Yn ] =

2n (Hn − 1), (n − 1)

n ≥ 3,

where Hn is the harmonic number, while it remains bounded by 3 for Xn . The variance of Yn grows as n. The asymptotic distribution of Yn is given by P(Y = y) =

2 , y(y + 1)

y ≥ 2,

which corresponds to the Yule distribution of parameter ρ = 1 given that the variable is greater than 2.

Acknowlegdments.

We want to thank Sophie Dijoud for her help

during the writing of her report on the subject ”Relations de parenté dans le coalescent” at TIMC. This work was supported by the Institut National Polytechnique de Grenoble.

20

References [1] Aldous, D.J. Stochastic models and descriptive statistics for phylogenetic trees, from Yule to today. Statistical Science, 16 (2001) 23-34. [2] Durrett, R. Probabilistic models of DNA sequences. Springer-Verlag, NewYork, 2003. [3] Fu, Y. X., and W. H, Li. Statistical tests of neutrality of mutations. Genetics, 133, 693-709. [4] Kingman, J.F.C. The coalescent. Stochastic Processes and their Applications, 13 1982, 235–248. [5] Kingman, J.F.C. On the genealogies of large populations. Journal of Applied Probability, 19A 1982, 27–43. [6] Nordborg, M. Coalescent theory. In Handbook of statistical genetics, D.J. Balding et al eds, Wiley and sons, inc, New-York, 2001, 179–208. [7] Saunders, I.W., S. Tavaré and G.A. Watterson. On the genealogy of nested subsamples from a haploid population. Advances in Applied probability, 16 1984, 471–491. [8] Tajima, F. Evolutionary relationship of DNA sequences in finite populations. Genetics, 105, 437–460. [9] Tavaré, S. Ancestral inference in molecular biology, Lecture notes SaintFlour Summer School, Springer-Verlag, to appear.

21

[10] Tavaré, S. Ancestral inference from DNA sequence data. In Mathematical modeling in Ecology, Physiology and cell biology, chap 5, Othmer at al eds, 1997, 81-96. [11] Watterson, G.A. On the number of segregating sites in genetical models without recombination, Theoretical Population Biology, 7 1975, 256–276. [12] Watterson G.A. Mutant substitutions at linked nucleotide sites. Adv Appl Probab.(1982)14, 206–224. [13] Wiuf, C. and P. Donnelly. Conditional genalogies and the age of a neutral mutant, Theoretical Population Biology, 56 1999, 183–201. [14] Yule, G.U. A mathematical theory of evolution, based on the conclusions of Dr J.C. Willis. Philos. Trans. Roy. Soc. London Ser. B, 213 1924, 21–87.

22

θ = 1 θ = 1 θ = 10 θ = 10 ∆ n = 10 n = 30 n = 10 n = 30 0 0.852 0.942 0.426 0.673 1 0.114 0.050 0.218 0.193 2 0.022 0.005 0.120 0.067 3 0.006 0.001 0.071 0.028 4 0.001 − 0.044 0.013 5 − − 0.029 0.007 6 − − 0.020 0.004 7 − − 0.014 0.002 8 − − 0.010 0.001 9 − − 0.007 0.001 sum 0.999 0.999 0.963 0.994 Table 1: Probability distribution of the number of segregating sites ∆. The symbol “−” indicates values below 0.001. The last row gives the sum of the ten probabilities.

θ = 1 θ = 1 θ = 10 θ = 10 ∆ n = 10 n = 30 n = 10 n = 30 0 0.13 0.05 0.05 0.02 1 0.39 0.20 0.11 0.07 2 0.84 0.59 0.17 0.12 3 1.46 1.23 0.25 0.19 4 2.12 1.98 0.32 0.27 5 2.76 2.68 0.41 0.35 6 3.36 3.31 0.49 0.44 7 3.92 3.89 0.58 0.53 8 4.45 4.44 0.67 0.63 9 4.97 4.96 0.76 0.72 10 5.48 5.48 0.86 0.82 Table 2: Conditional expectations of the coalescence time given the number of segregating sites ∆ = 0, . . . , 10.

23

Figure 1: The unconditional coalescence time distribution (solid line) and the conditional distributions given that ∆ = 1, 2, 5 segregating sites are observed. The sample size is n = 10 and the mutation rate is θ = 10.

24