Coalescent-based inference from DNA sequence data using MCMC

Coalescent-based inference from DNA sequence data using MCMC David Balding, Michael Weale, and Ian Wilson University of Reading, Department of Applied Statistics, PO Box 240, Earley Gate, Reading RG6 6FN, UK. [email protected]

1. Introduction Underlying a sample of homologous DNA sequences is a complex pattern of dependencies. These dependencies can be represented by a genealogical tree, for which each tip, or leaf, corresponds to a sequence at the present time. Travelling down the tree towards the root corresponds to going backwards in time, and branches merge, or \coalesce", when the corresponding DNA sequences last had a common ancestor. The root of the tree represents the Most Recent Common Ancestor (MRCA) of all the sequences in the sample. Although the genealogical tree is crucial to modelling the dependence structure of a DNA sample, it is eectively ignored by traditional methods of analysing DNA sequence data which, for example, are based on averaging pairwise statistics over all pairs in the sample. In recent years, however, important advances have been made towards the goal of fully likelihood-based statistical inference from DNA sequences (Griths and Tavare 1994; Kuhner et al. 1995, 98). The key developments underpinning these advances involve (i) a exible class of stochastic models for genealogical trees, based on the coalescent model of Kingman (1982), and (ii) computational techniques such as Markov Chain Monte Carlo (MCMC) and methods based on importance sampling. However, implementing these models and algorithms remains challenging because of the complexity of the processes underlying the data, which include historical patterns of migration, mating behaviour, and population growth, as well as mutation, recombination and selection. Given complex models, many parameters and substantial background information, the Bayesian paradigm would appear to provide the most appropriate approach to statistical inference. A Bayesian approach is particularly natural since coalescent models specify distributions for the genealogical tree underlying the sample before the data are observed. We present an MCMC algorithm for drawing Bayesian inferences from DNA sequence data, within the coalescent modelling framework allowing for population growth and subdivision. Inevitably, fully realistic models for the historical patterns of human mating and migration remain outside our grasp. However, we are able to implement models which capture at least some, and possibly most, of the demographic eects underlying a sample of DNA sequences. We also implement a range of mutation models: for microsatellite, or short tandem repeat (STR) data, for single nucleotide polymorphism (SNP) data, and for unique event polymorphism (UEP) data. However, we do not here consider the eects of selection or recombination. Our methods are thus applicable either to haploid DNA sequence data, or to sequences which are suciently short that the eects of recombination can be neglected. For humans, sequences of up to a few thousand base pairs (bp) are likely to satisfy this requirement.

2. Demographic models The coalescent is a stochastic model for the genealogical tree representing the ancestral relationships among a sample of n, unobserved, DNA sequences. It has two particularly attractive features: (i) it is mathematically tractable, and (ii) it approximates (in the sense of convergence

as the population size increases) an important class of neutral population genetics models, including the Wright-Fisher model of a random-mating population of constant size N sequences. To recover these approximations, one unit of \coalescent" time must be interpreted as N= generations, where denotes the variance in the number of \ospring" of a sequence. It is convenient to allow coalescent time to run backward, with time t 0 denoting the present, corresponding to the tips of the tree, while tj , j 2 f1; 2; : : : ; n,1g, denotes the time of the j th most recent coalescence event. In particular, tn, denotes the time of the root, or MRCA. Under the coalescent model, the between-coalescence intervals tj , tj, have independent exponential distributions with rate parameter (n,j )(n,j +1)=2. At each coalescence event, all pairs of extant lineages are equally likely to be the pair that coalesce. Mutations in the standard coalescent occur along the branches of the tree at the points of a homogeneous Poisson process with rate =2, corresponding to a mutation rate of =2N per locus per generation in a population of N sequences. In addition to the constant population-size approximation, the standard coalescent also approximates R t the genealogy of a sample drawn from a random-mating population of size N(t) at time N (s)ds generations ago, where (t), t 0, is a smooth function taking values in R and with limt"1(t) = 1. Intuitively, an increment of coalescent time corresponds to more generations when the population size is large than when it is small. When the coalescent represents a varying-size population, the mutation process is inhomogeneous, with mutations occurring at rate (t)=2. It is often convenient to employ the a time-change function (t) Rt ds=(s) to make the mutation process homogeneous (Donnelly & Tavare 1995). Marjoram & Donnelly (1994) consider a two-parameter coalescent model for which ( R ta,t 0 < t < ta (t) = 1e t > ta; corresponding to a population of constant size N until Nta generations ago, after which it grew at rate R=N per generation to reach its current size, N (1+R=N )N ta NeRta . We adopt this model, and for convenience refer to it as the \coalescent with growth", even though other formulations of population growth are possible. Populations are often subdivided in nature, whether by geographical, social, or other barriers. To model the eects of population subdivision, we implement a \coalescent with splitting" model, corresponding to a random-mating ancestral population of size N until Nta generations ago, when it split into isolated subpopulations, the ith population having size iN , P with i = 1. We also allow exponential growth after the split, to give the \coalescent with splitting and growth". 2

2

0

1

1

+

0

0

(

)

3. Mutation models The most widely-adopted model for STR mutation is the symmetric ladder, or stepwise, mutation model (SMM) in which the mutant allele diers from its parent by one repeat unit, with steps in each direction being equally likely. Under the SMM, the probability vt(i; j ) of a change from i to j repeats during a (coalescent) time interval of length t, is given by 1 (t=4) k ji,j j X (1) : vt (i; j ) = e,t= k !(k +ji , j j)! k The SMM has no equilibrium distribution, but a prior for the STR repeat number at the root of the genealogical tree which is uniform on the positive integers, although improper, leads to a proper posterior distribution and is adopted below. Models for substitutions of one base for another at an SNP site typically assume that these occur independently of the states of other SNP sites. Under this assumption, an SNP 2 +

2

=0

mutation model may be speci ed by the 4 4 substitution matrix of a Markov chain on the four states A, C , G, and T . We adopt here the F84 mutation model, implemented in the program DNAML in the PHYLIP suite of programs (described in Felsenstein & Churchill 1996). Under the F84 model, a parameter, , speci es the nominal rate of mutations restricted to be transitions, while speci es the rate of substitutions which may be either transitions or transversions. The stationary distribution fA ; C ; G; T g can be speci ed arbitrarily, leading to a total of ve parameters. The overall, eective mutation rate is (2)

h i SNP = 1 , A2 , G2 , C2 , T2 + 2 A+G + C+T ; A G C T

per nucleotide per generation. If some sequences share a speci c subsequence which others lack, it may be reasonable to assume that the subsequence was inserted at only one instance in the past (called a unique event polymorphism, UEP). Although the UEP assumption is not required in our framework, making it can lead to more precise inferences about the topology of the genealogical tree, because any two haplotypes with the UEP insert must be more closely related than two haplotypes which dier at the UEP site. If the substitution rate is low, most or all SNP sites will correspond to UEPs. The assumption that no more than one substitution event can occur at a site leads to the \in nite sites" mutation model for DNA sequences.

4. MCMC algorithm We extend the MCMC algorithm of Wilson & Balding (1998) to incorporate the demographic and mutation models described above. The algorithm is of Metropolis-Hastings type, and employs data-augmentation: the ancestral DNA sequence is inferred at each node of the genealogical tree. This approach simpli es likelihood calculations and hence permits a wide class of mutation models to be employed, including many models not discussed here. Moreover, simpler likelihood calculations implies greater freedom for generating tree proposals, so that proposal distributions leading to good mixing properties can be implemented. For models involving population splits, or \in nite sites" mutation, the obvious modi cations of the basic algorithm are employed: the starting tree must satisfy the constraints of the model and proposals inconsistent with the model are disallowed. Further details will be published elsewhere (Wilson, Weale, & Balding, 2000).

5. Forensic match probabilities Suppose that a DNA pro le is recovered from a crime scene sample, and found to match the DNA pro le obtained from s, a suspect. This observation supports the hypothesis that s is the source of the crime sample. In order to assess the strength of this support, we wish to evaluate the probability that x would also match the crime scene DNA pro le, where x is an alternative possible culprit whose DNA pro le is not available. For a discussion of the role of match probabilities in the assessment of forensic identi cation evidence, see Balding & Donnelly (1995). If we knew the DNA sequence of any of the ancestors of x, this would be informative about the DNA pro le of x. Ancestral information isn't usually directly available, but can be inferred within the coalescent modelling framework, via our MCMC algorithm which is modi ed and applied to the DNA pro les available in a reference sample, together with that of the suspect. At each iteration of the MCMC algorithm, candidate values are assigned both to the genealogical tree underlying the observed sequences, and to the ancestral sequence types at

each node in the tree. We now introduce a branch connecting the DNA sequence of x with the tree, writing z for the new node thus introduced, together with a DNA pro le state for z . The additional branch and state are updated in the same way as for the other branches, except that, since no data is available for the DNA pro le of x, there is no contribution to the likelihood from the branch connecting z with x. There is, however, the usual likelihood contribution from the branches linking z with its ancestor node, and with its other descendant node. At each iteration of the modi ed algorithm, the probability that the mtDNA sequence of x matches that of s, conditional on the location and state of z, is readily calculated. The average of these conditional probabilities approximates the match probability, given the observation DNA pro les and the standard coalescent model. 6.

Results and discussion

Our MCMC algorithm performs well on data simulated under the various models discussed above, and has been applied to a number of previously-published, human DNA sequence datasets, involving Y-chromosome STR haplotypes, autosomal -globin sequences, and mitochondrial SNP sites. Sample sizes of up to several hundred can readily be handled on a desktop workstation. Detailed results will be discussed at the oral presentation, and subsequently published elsewhere (Wilson, Weale, & Balding, 2000). For the forensic DNA pro les, a naive estimator based on relative frequencies, ignoring within sample correlations, is found to be conservative in most cases; i.e. it tends to overstate the more precise (single locus) match probabilities calculated using our algorithm. However, it can be non-conservative when the observed pro le is rare in the reference sample. References

Balding, D.J. & Donnelly, P. 1995. Inferring identity from DNA pro le evidence, Proc. Natl Acad Sci USA 92: 11741-11745. Donnelly, P. & Tavare, S. 1995. Coalescents and genealogical structure under neutrality. Annu. Rev. Genet. 29: 410-421. Felsenstein, J. & Churchill, H.A. 1996. A hidden Markov model approach to variation among sites in rate of evolution. Mol. Biol. Evol. 13: 93-104. Griths R.C. & Tavare S. 1994. Simulating probability-distributions in the coalescent. Theor. Pop. Biol. 46:131-159. Kingman, J.F.C. 1982. The coalescent. Stoch. Proc. Appl. 13: 235-248. Kuhner, M.K., Yamato, J. & Felsenstein, J. 1995. Estimating eective population size from sequence data using Metropolis-Hastings sampling. Genetics 140: 1421-1430. Kuhner, M.K., Yamato, J. & Felsenstein, J. 1998. Maximum likelihood estimation of population growth rates based on the coalescent. Genetics 149: 429-434. Marjoram, P. & Donnelly P. 1994 Pairwise comparisons of Mitochondrial DNA sequences in subdivided populations and implications for early human evolution. Genetics 136: 673-683. Wilson, I.J. & Balding, D.J. 1998. Genealogical Inference from Microsatellite Data. to appear, Genetics. Wilson, I.J. Weale, M.E. & Balding, D.J. 2000. Inferences from DNA sequence data: population histories, evolutionary processes, and forensic match probabilities. In preparation.

Coalescent-based inference from DNA sequence data using MCMC

Coalescent-based inference from DNA sequence data using MCMC

Suggest Documents