Copyright 2002 by the Genetics Society of America
Letter to the Editor On a Misconception About Irreducibility of the Single-Site Gibbs Sampler in a Pedigree Application C. Cannings and N. A. Sheehan1 Division of Genomic Medicine, Royal Hallamshire Hospital, University of Sheffield, Sheffield S10 2JF, United Kingdom and Department of Epidemiology and Public Health, University of Leicester, Leicester LE1 6TP, United Kingdom Manuscript received May 3, 2002 Accepted for publication August 13, 2002
T
HE analysis of genetic data on groups of related individuals, or pedigrees, frequently necessitates the calculation of probabilities and likelihoods. There are wellknown algorithms such as the peeling algorithm (Elston and Stewart 1971), which was extended to include arbitrarily complex pedigrees and genetic models by Cannings et al. (1978), and the expert systems algorithms such as that of Lauritzen and Spiegelhalter (1988) for large Bayesian networks (Jensen 1996), which can perform these calculations in theory. However, in practice, exact methods break down either when the pedigree structure or the genetic model under consideration becomes too complicated. A complex pedigree is one that has too many interconnecting undirected cycles or loops that force large cutsets in any peeling sequence, leading to impossible storage requirements. In a general graph, a sequence of edges for which each edge has a node in common with both preceding and succeeding edges forms an undirected cycle if it begins and ends at the same node. There are many ways of forming loops in a general graph but in pedigrees, loops are typically caused by inbreeding or intermarital relationships. A complex genetic model can pose similar computational challenges, even on a simple pedigree structure, because the underlying graphical model for the computational problem may be highly looped (Sheehan et al. 2002). In these situations, the required probabilities and likelihoods must be estimated. One approach is to approximate the problem by sacrificing a sufficient amount of the complexity to enable exact calculation (e.g., Wang et al. 1996). Alternatively, the complexity of the problem can be preserved and the calculations in question approximated using Markov chain Monte Carlo (MCMC) methods (see Thompson 2001 for an overview). The first application of MCMC methods in pedigree analysis (Sheehan 1990) exploited the single-site Gibbs
1 Corresponding author: Department of Epidemiology and Public Health, University of Leicester, 22–28 Princess Rd. W., Leicester LE1 6TP, UK. E-mail:
[email protected]
Genetics 162: 993–996 (October 2002)
sampler (Geman and Geman 1984) for a simple singlelocus trait on a large complex human pedigree of Greenland Eskimos where the statespace of the underlying Markov chain was the set of genotypic configurations on the pedigree. Reasonable estimates of marginal posterior distributions of founder genotypes were obtained in this way (Sheehan 1992). By noting that the genetic model is Markovian, a neighborhood system can be defined on a pedigree such that, conditional on the neighbors, genotypes of individuals are independent and the set of genotypic configurations on the pedigree is thus a locally dependent Markov random field. The Gibbs sampler moves between genotypic configurations by selecting a single individual and updating the existing estimate of genotype with one drawn at random from the full conditional distribution of genotype, given the current estimates elsewhere on the graph and any phenotypic data. Because of the local dependencies induced by the neighborhood system, the Gibbs sampler is very easy to implement in these genetic applications. However, it was also noted at this time (Sheehan 1990; Sheehan and Thomas 1993) that although the sampler defines an irreducible Markov chain for most traits determined by a diallelic locus, the irreducibility condition necessary to ensure convergence of the sampler to the correct equilibrium distribution is no longer guaranteed for a genetic system with three or more alleles. The pedigree Gibbs sampler, as defined above, will converge to the true posterior distribution of genotypes on the pedigree given available phenotypes if it defines an irreducible chain whereby all states communicate and if each individual genotype is updated infinitely often. This follows from the ergodic theorem for finite, aperiodic, irreducible, Markov chains (Cox and Miller 1965). In many other areas of application, such as Bayesian image analysis, for example, there is considerable flexibility in the specification of the underlying Markov random field, which is usually constructed to satisfy the positivity condition (see Besag 1986), and this, in turn, trivially guarantees irreducibility of the single-site Gibbs sampler. For a pedigree with n individuals, let Si ⫽ {si}
994
C. Cannings and N. A. Sheehan
Figure 1.—An example adapted from Sheehan (2000) where observed data on a dichotomous trait [affected and normal determined by a diallelic genetic system (with alleles A and B)] cause the single-site Gibbs sampler to be reducible. Note that both mother and daughter must be homozygous for the same allele, which means that once one pairing has been assigned, the other can never be visited by updating one individual genotype at a time if heterozygotes are affected with zero probability.
be the set of possible genotypes that can be assigned to the ith pedigree member, ignoring all others. Positivity, in our applications, would mean that any assignment (s1, . . . , sn) of genotypes on the pedigree would have an associated positive probability. This is clearly not the case for discrete genetic models, which pose a complicated excluded adjacency problem. In particular, we cannot start the Gibbs sampler from a random configuration as is usually the case with other applications. The issue, therefore, is whether the legal configurations of genotypes on a pedigree (i.e., those states that agree both with the data and the genetic model) communicate if genotypes are updated one at a time. It can be shown that the single-site Gibbs sampler is generally irreducible over the legal states of a diallelic genetic system. The only exception (Sheehan 1990) occurs within the following scenario. Suppose we have a dichotomous trait (e.g., affected and normal) where both homozygotes are affected with positive probability but the heterozygote is affected with probability zero. As illustrated in Figure 1, depending on the observed individuals, this can cause the single-site Gibbs sampler to be reducible. However, as all that is required for irreducibility is a minor relaxation of the penetrance probabilities, allowing the heterozygote to have some small positive probability of having the same phenotype as either homozygote, this does not cause a problem in practice. In many cases, of course, the phenotypic observations are actually the genotypes themselves and information on individuals is either complete in that they are fully typed or missing altogether. This is the setting that we focus on heretoforth. Note that the Gibbs sampler is always irreducible for diallelic loci in this case.
Figure 2.—A genetic system with three alleles, A, B, and C, adapted from Sheehan (2000), where observations on two siblings force noncommunicating alternatives on their unobserved parents.
Once the genetic system has three or more alleles, it is easy to see how the Gibbs sampler can define a reducible Markov chain. The classic example for a three-allele system, with alleles A, B, and C, say (see Sheehan and Thomas 1993, for example), is shown in Figure 2, where unobserved parents have two typed offspring, one of whom is heterozygous AB and the other homozygous for the third allele, CC. Clearly, each parent carries the C allele and one carries an A while the other carries a B. From the symmetry of genetic inheritance, it is equally plausible that either parent can have one or the other of the genotypes AB and AC. However, once one has been assigned in a starting configuration for a single-site Gibbs sampler, the other must take the complementary type and neither will change for the duration of the run. Reducibility depends on the data. If it were possible to enumerate the noncommunicating classes of the Markov chain described by the Gibbs sampler, a scheme that would move between the different classes using the reversible-jump method of Green (1995) or the Metropolis jumping-kernel method of Lin 1996, for example, could be devised. A deterministic algorithm to find the noncommunicating classes was proposed by Lin et al. (1994), which depended on the assumption that such classes can be caused only by observations on offspring forcing noncommunicating alternatives for unobserved parents, as in the classic scenario of Figure 2. However, this assumption does not suffice: There are many other ways in which data can be assigned on a pedigree to cause the Gibbs sampler to be reducible and the algorithm does not work on several simple examples (Jensen and Sheehan 1998). In particular, partial information can filter down a pedigree from an ancestor, which, when combined with other information coming up from descendants, can create noncommunicating classes on individuals in intermediate generations. To date, there is no algorithm that will find all these classes. Moreover,
Letter to the Editor
Figure 3.—A simple pedigree where one member of each marriage is fully typed. The mother-daughter pair, labeled A?, are either both AB or both AC. Once one configuration has been assigned, the other can never be reached by a singlesite Gibbs sampler.
it would seem that a completely general algorithm would require storage of so much partial information as to make it akin to peeling (Jensen and Sheehan 1998). The practical implication of this is that it may not be possible to sample correctly from the true posterior distribution of genotype, given phenotype, using a single-site Gibbs sampler once there are three or more alleles in the genetic system, and any estimates of probabilities and likelihoods obtained in this way are hence unreliable. In animal breeding circles, especially those concerned with animals where artificial insemination is practiced, it is not uncommon to have a small number of males with large numbers of mates and offspring. In particular, full DNA information would usually be available on these males. The pedigrees resulting from such a breeding program are extremely complex and defy exact calculation of probabilities and likelihoods for any genetic system, however simple. Often, the pedigree structure is sacrificed and estimates are obtained on a simple subpedigree such as a half-sib design, using methods like least squares (Haley et al. 1994), which permit fast analysis using permutation tests (Churchill and Doerge 1994), for example, or maximum likelihood. However, it is more appealing to preserve the pedigree information and apply MCMC methods. It has been claimed in the animal genetics literature (see Hoeschele et al. 1997; Bink and Arendonk 1999, for example) that the single-site Gibbs sampler is irreducible if at least one in every parent pair is observed and many analyses have been performed on animal pedigrees with fully typed males in the belief that this is true. A counter-example to this claim involving a simple unlooped pedigree with three generations is presented in Figure 3. For a half-sib design that includes only individuals over
995
two generations and assumes that the dams are unrelated and distinct, the single-site Gibbs sampler is irreducible when the sires are fully typed, regardless of any observations on mates or offspring. However, it is not the case in general and misleading results can be obtained. The belief that the single-site Gibbs sampler defines an irreducible Markov chain on the space of genotypic configurations over a general pedigree provided one member of each spouse pairing is observed would seem to derive from the erroneous statement in Lin et al. (1994) that noncommunicating classes can arise only on unobserved parents from offspring data. Jensen and Sheehan (1998) presented several counter-examples to this algorithm, involving two-, three-, and four-generational pedigrees with and without loops. The important point is that information forcing noncommunicating alternatives on a set of individuals can come from above via the ancestors as well as from below via the descendants. In particular, as noted by Thompson (1986), the terms “above” and “below” do not necessarily define disjoint sets in a complex pedigree and the routes by which information filters through a pedigree may be interconnected and hence highly complicated. In this article, we have presented a counter-example to the animal genetics community and have attempted to clarify the reducibility issue. However, while it is true that any analysis based on a reducible sampling scheme is potentially dubious, reducibility itself is not really a problem. There are several methods for getting around reducibility (see Thompson 2001 for an overview). Most of these involve some sort of relaxation of the genetic model, which enables the sampler to move between illegal configurations. These include importance sampling with weights of zero and one (Sheehan and Thomas 1993), the Metropolis-coupled samplers of Geyer (1991), the heated companion chains of Lin et al. (1993), and the annealing-type samplers of Geyer and Thompson (1995), to name but a few. The corresponding statespace of the Markov chain could be vastly augmented from that defined by the single-site Gibbs sampler and good mixing of the chain becomes the important issue (Gilks et al. 1993). Using meiosis indicators or descent graphs (Sobel and Lange 1996) rather than genotypes greatly improves mixing of MCMC samplers on pedigrees (Thompson 1994, 2000b). In addition, there are all kinds of joint or block updating schemes such as the blocking Gibbs method of Jensen et al. (1995), which updates large blocks of the pedigree conditional on the values at the remaining variables, the locus-by-locus samplers of Heath (1997) and Sillanpa¨a¨ and Arjas (1999), the meiosis-by-meiosis sampler of Thompson and Heath (2000), or the combination samplers that alternate between chromosome peeling and pedigree peeling (Hurme et al. 2000; Thomas et al. 2000; Thompson 2000a), all of which require that the pedigree can be peeled for at least one locus. For large complex animal
996
C. Cannings and N. A. Sheehan
pedigrees, however, it may not be possible to peel even for a single locus and a relaxed sampler may be required in addition to blocking. The ESIP sampler of Ferna´ndez et al. (2001) samples genotypes over the entire pedigree at a single locus by using a sample obtained from a modified pedigree obtained by iterative peeling as the proposal in a Metropolis-Hastings updating step. However, the main problem in animal genetics applications is that the pedigrees are often enormous and these alternative MCMC methods have not been tested sufficiently on big problems (Hoeschele et al. 1997). Nonetheless, they at least have the potential to yield estimates from the correct distribution and provide a wealth of attractive alternatives to the single-site Gibbs sampler. We acknowledge Rohan Fernando for bringing this problem to our attention. In addition, Nuala Sheehan acknowledges support from the Wellcome Trust Biomedical Research Collaboration Grant 056266/Z/ 98/Z and the TVW Telethon Institute for Child Health Research, Perth, Western Australia.
LITERATURE CITED Besag, J., 1986 On the statistical analysis of dirty pictures. J. R. Stat. Soc. Ser. B 48: 259–302. Bink, M. C. A. M., and J. A. M. V. Arendonk, 1999 Detection of quantitative trait loci in outbred populations with incomplete marker data. Genetics 151: 409–420. Cannings, C., E. A. Thompson and M. H. Skolnick, 1978 Probability functions on complex pedigrees. Adv. Appl. Probab. 10: 26–61. Churchill, G. A., and R. W. Doerge, 1994 Empirical threshold values for quantitative trait mapping. Genetics 138: 963–971. Cox, D. R., and H. D. Miller, 1965 Stochastic Processes. Methuen & Co., London. Elston, R. C., and J. Stewart, 1971 A general model for the genetic analysis of pedigree data. Hum. Hered. 21: 523–542. Ferna´ndez, S. A., R. L. Fernando, B. Guldbrandtsen, L. R. Totir and A. L. Carriquiry, 2001 Sampling genotypes in large pedigrees with loops. Genet. Sel. Evol. 33: 337–367. Geman, S., and D. Geman, 1984 Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Patt. Anal. Mach. Intell. 45: 721–741. Geyer, C. J., 1991 Markov chain Monte Carlo maximum likelihood, pp. 156–163 in Computer Science and Statistics: Proceedings of the 23rd Symposium on the Interface, edited by E. M. Keramidas and S. M. Kaufman. Interface Foundation of North America, Fairfax Station, VA. Geyer, C. J., and E. A. Thompson, 1995 Annealing Markov chain Monte Carlo with applications to ancestral inference. J. Am. Stat. Assoc. 90 (431): 909–920. Gilks, W. R., D. Clayton, D. J. Spiegelhalter, N. G. Best, A. J. McNeil et al., 1993 Modelling complexity: applications of gibbs sampling to medicine. J. R. Stat. Soc. Ser. B 55: 39–52. Green, P. J., 1995 Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82: 711– 732. Haley, C. S., S. A. Knott and J. M. Elsen, 1994 Mapping quantitative trait loci in crosses between outbred lines using least squares. Genetics 136: 1195–1207. Heath, S. C., 1997 Markov chain Monte Carlo segregation and linkage analysis for oliogenic models. Am. J. Hum. Genet. 61: 748–760. Hoeschele, I., P. Uimari, F. Grignola, Q. Zhang and K. Gage,
1997 Advances in statistical methods to map quantitative trait loci in outbred populations. Genetics 147: 1445–1457. Hurme, P., M. J. Sillanpa¨a¨, E. Arjas, T. Repo and O. Savolainen, 2000 Genetic basis of climatic adaptation in Scots Pine by Bayesian quantitative trait locus analysis. Genetics 156: 1309–1322. Jensen, C. S., and N. Sheehan, 1998 Problems with determination of noncommunicating classes for Monte Carlo Markov chain applications in pedigree analysis. Biometrics 54: 416–425. Jensen, C. S., U. Kjaerulff and A. Kong, 1995 Blocking Gibbs sampling in very large probabilistic expert systems. Int. J. Hum. Comput. Stud. 42: 647–666. Jensen, F. V., 1996 An Introduction to Bayesian Networks. University College Press, London. Lauritzen, S. L., and D. J. Spiegelhalter, 1988 Local computations with probabilities on graphical structures and their applications to expert systems. J. R. Stat. Soc. Ser. B 50: 157–224. Lin, S., 1996 Multipoint linkage analysis via Metropolis jumping kernels. Biometrics 52: 1417–1427. Lin, S., E. Thompson and E. Wijsman, 1993 Achieving irreducibility of the Markov chain Monte Carlo method applied to pedigree data. IMA J. Math. Appl. Med. Biol. 10: 1–17. Lin, S., E. Thompson and E. Wijsman, 1994 Finding noncommunicating sets for Markov chain Monte Carlo estimations on pedigrees. Am. J. Hum. Genet. 54: 695–704. Sheehan, N., 1992 Sampling genotypes on complex pedigrees with phenotypic constraints: the origin of the B allele among the Polar Eskimos. IMA J. Math. Appl. Med. Biol. 9: 1–18. Sheehan, N., and A. Thomas, 1993 On the irreducibility of a Markov chain defined on a space of genotype configurations by a sampling scheme. Biometrics 49: 163–175. Sheehan, N. A., 1990 Genetic restoration on complex pedigrees. Ph.D. Thesis, University of Washington, Seattle. Sheehan, N. A., 2000 On the application of Markov chain Monte Carlo methods to genetic analyses on complex pedigrees. Int. Stat. Rev. 68: 83–110. Sheehan, N. A., B. Guldbrandtsen, M. S. Lund and D. A. Sorensen, 2002 Bayesian MCMC mapping of quantitative trait loci in a half-sib design: a graphical model perspective. Int. Stat. Rev. 70: 241–267. Sillanpa¨a¨, M. J., and E. Arjas, 1999 Bayesian mapping of multiple quantitative trait loci from incomplete outbred offspring data. Genetics 151: 1605–1619. Sobel, E., and K. Lange, 1996 Descent graphs in pedigree analysis: applications to haplotyping, location scores, and marker-sharing statistics. Am. J. Hum. Genet. 58: 1323–1337. Thomas, A., A. Gutin, V. Abkevich and A. Bansal, 2000 Multilocus linkage analysis by blocked Gibbs sampling. Stat. Comput. 10: 259–269. Thompson, E. A., 1986 Pedigree Analysis in Human Genetics. Johns Hopkins University Press, Baltimore. Thompson, E. A., 1994 Monte Carlo likelihood in genetic mapping. Stat. Sci. 9 (3): 355–366. Thompson, E. A., 2000a MCMC estimation of multi-locus genome sharing and multipoint gene location scores. Int. Stat. Rev. 68 (1): 53–73. Thompson, E. A., 2000b Statistical Inference from Genetic Data on Pedigrees, NSF-CBMS Regional Conference Series in Probability and Statistics, Vol. 6. Institute of Mathematical Statistics and the American Statistical Association, Beachwood, OH. Thompson, E. A., 2001 Monte Carlo methods on genetic structures, pp. 176–218 in Complex Stochastic Systems, edited by O. E. Barndorff-Nielsen, D. R. Cox and C. Kluppelberg. Chapman & Hall, London. Thompson, E. A., and S. C. Heath, 2000 Estimation of conditional multilocus gene identity among relatives, pp. 95–113 in Statistics in Molecular Biology and Genetics, IMS Lecture Notes, edited by F. Seiller-Moiseiwitsch. Institute of Mathematical Statistics, American Mathematical Society, Providence, RI. Wang, T., R. L. Fernando, C. Stricker and R. C. Elston, 1996 An approximation to the likelihood for a pedigree with loops. Theor. Appl. Genet. 93: 1299–1309. Communicating editor: N. Takahata