Document not found! Please try again

Determining Inheritance Distributions via Stochastic ...

8 downloads 0 Views 596KB Size Report
Cerney, and Hewitt 2000; Goldstein, Dudoit, and Speed 2001;. Sham, Purcell ...... Cordell, H. J., Todd, J. A., Bennett, S. T., Kawaguchi, Y., and Farrall, M. (1995),.
Journal of the American Statistical Association

ISSN: 0162-1459 (Print) 1537-274X (Online) Journal homepage: http://www.tandfonline.com/loi/uasa20

Determining Inheritance Distributions via Stochastic Penetrances Ola Hössjer To cite this article: Ola Hössjer (2003) Determining Inheritance Distributions via Stochastic Penetrances, Journal of the American Statistical Association, 98:464, 1035-1051, DOI: 10.1198/016214503000000990 To link to this article: http://dx.doi.org/10.1198/016214503000000990

Published online: 31 Dec 2011.

Submit your article to this journal

Article views: 36

View related articles

Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=uasa20 Download by: [Stockholm University Library]

Date: 03 May 2017, At: 01:01

Determining Inheritance Distributions via Stochastic Penetrances Ola H ÖSSJER The aim of linkage analysis is to map the position of a gene contributing to an inheritable disease. The statistical model contains the disease allele frequency and penetrance parameters. Here I investigate the inheritance distribution, that is, the conditional distribution of the inheritance vector given phenotypes at the disease locus. Based on this, the likelihood and likelihood score function of Whittemore can be deŽ ned. As a result, a general semiparametric methodology of choosing score functions in linkage analysis is proposed. The proposed approach is valid for arbitrary pedigrees, and I treat quantitative, dichotomous (binary), and other phenotypes in a uniŽ ed framework. The resulting score functions can be easily incorporated into existing software for multipoint linkage analysis. I use the fact that the inheritance distribution depend on unknown founder alleles. These are treated as hidden data and give rise to “stochastic penetrance factors.” Certain uncorrelated unit variance random variables that are functions of the founder alleles are introduced. I show that the moment-generating function and moments of these play crucial roles in choosing likelihoods and likelihood score functions. Lower-/higher-order moments are more important when the genetic effect is weak/strong, and this corresponds to simultaneous identical by descent (IBD) sharing of few/many individuals. For inbred pedigrees and nonadditive models, the likelihood score function is dominated by individuals homozygous by descent at the disease locus. For outbred pedigrees, the local score function involves pairwise IBD sharing. Relations to existing score functions of nonparametric linkage (Spairs , Sall , Srobdom ) and quantitative trait loci (QTL) are highlighted. KEY WORDS: Founder alleles; Identical by descent sharing; Inheritance distribution; Linkage analysis; Local score functions; Semiparametric linkage analysis; Stochastic penetrances.

1. INTRODUCTION The purpose of statistical linkage analysis is to locate the position along the chromosomes of an unknown gene contributing to a certain disease. Disease-related quantities (phenotypes) and parts of DNA (so-called “genetic markers”) are collected for a number of families (pedigrees). The phenotype can be binary (affection status) or quantitative (e.g., insulin concentration, body mass index) and involve various covariates (e.g., sex, age). Segregation of DNA is governed by Mendelian laws and the occurence of so-called “crossovers,” latter are switching points between grandpaternal and grandmaternal DNA transmission during meiosis, that is, formation of egg or sperm cells. DNA and phenotype segregation are associated with one another at the disease gene. The stength of this association depends on the strength of the genetic component of the disease. For loci on the same chromosome as the disease gene, the association gradually decreases with the genetic distance from the disease gene because of crossovers. The linkage score function is a stochastic process that at each DNA location measures the degree of association between phenotype segregation and DNA segregation at that position. Regions with high linkage score are declared interesting in further Ž ne mapping of the disease gene. A genetic model speciŽ es both the frequency of the possible expressions (alleles) of the disease gene and the statistical relationship (penetrance) between disease alleles and phenotype. Given a certain genetic model, one can specify the conditional distribution of inheritance of disease genes given phenotypes. This distribution is important for several reasons. It summarizes the strength of association between phenotypes and disease genes, and thus re ects the strength of the genetic component of the disease. It is needed for performing parametric linkage analysis; it can be used for calculating the power of Ola Hössjer is Professor, Department of Mathematics, Stockholm University, S-10691 Stockholm, Sweden (E-mail: [email protected]). This research was sponsored by the Swedish Research Council under contract 629-2002-6286. The author thanks Roland Perfekt for suggesting several improvements on an earlier version of this manuscript, particularly the factorization »jk D »j »k in (12). An associate editor and two referees further gave several helpful suggestions to improve the manuscript’s presentation and readability.

parametric and nonparametric linkage tests (Feingold, Brown, and Siegmund 1993; Lander and Kruglyak 1995), for determining the asymptotic accuracy of estimators of the disease locus (Liang, Chin, and Beaty 2001; Hössjer 2003a,b); and it generates sampling criteria for pedigrees (Risch and Zhang 1995, 1996; Liang, Huang, and Beaty 2000; Dudoit and Speed 2000). Another advantage of the conditional inheritance distribution is it’s independence of the sample mechanism (ascertainment procedure). This holds under the mild assumption that pedigrees are sampled on the basis of disease phenotypes only, regardless of marker data. In contrast, the popular variance components techniques used in linkage analysis (Almasy and Blangero 1998) depend on the sample mechanism. For each gene, an individual receives one allele from the mother and one from the father. Mendel’s law of segregation states that the probability that each of these two alleles are of grandpaternal or grandmaternal origin is .5. For a whole pedigree, allele segregation can be speciŽ ed at a certain locus through the inhertiance vector v (see, e.g., Kruglyak, Daly, Reeve-Daly, and Lander 1996), a binary vector whose length equals the number of meioses in the pedigree. I use the fact that the conditional distribution of v at the disease locus given phenotypes can be written as an expectation, summing over all possible founder alleles, that is, the alleles of individuals with no ancestors in the pedigree. In this way each penetrance factor becomes stochastic, viewed as a function of unknown founder alleles. This function can decomposed into additive and dominance components, analogously to variance components or U statistic techniques. It turns out that the conditional distribution of the inheritance vector can be written as the momentgenerating function of a certain array of uncorrelated random variables (depending only on the disease allele frequency) evaluated at a point B, which depends on the occurrence and simultaneous identical by descent (IBD) allele sharing of various founder alleles.

1035

© 2003 American Statistical Association Journal of the American Statistical Association December 2003, Vol. 98, No. 464, Theory and Methods DOI 10.1198/016214503000000990

1036

By letting the penetrance factors depend on a parameter re ecting the strength of the genetic component at the disease locus, the local likelihood-based score function of Whittemore (1996), can be evaluated in great generality. This score function depends on the behavior of the inheritance distribution for weak genetic components and involves pairwise IBD sharing of individuals for outbred pedigrees (i.e., pedigrees without loops). For inbred pedigrees (i.e., pedigrees with loops), the score function involves the dominance components of those individuals who are homozygous by descent (HBD) (i.e., both alleles originate from the same founder) at the disease locus. When the genetic component is strong, the inheritance distribution depends to a larger extent on higher-order factors of B , corresponding to simultaneous allele sharing of many individuals. I also derive general expressions for the inheritance distribution when the disease allele frequency tends to 0, meaning that at most one disease allele is present in the pedigree. My approach is valid for any kind of penetrance factors, particularly binary and quantitative phenotypes, with or without covariates. In the context of dichotomous phenotypes and affected pedigree members, McPeek (1999) has derived general expressions for the inheritance distribution and in particular studied which score functions are optimal for various genetic models. Several of her results can be derived as special cases in my setting with arbitrary phenotypes. The practical implications of my approach is a general semiparametric way of choosing score functions and weighting of pedigrees, valid for arbitrary kinds of phenotypes. These can be easily incorporated into existing nonparametric linkage software, such as the Genehunter program (Kruglyak et al. 1996). In Section 2 I deŽ ne the genetic model, and in Section 3 I introduce stochastic penetrances and bivariate functions of founder alleles. I use these in Sections 4 and 5 to expand the logarithm of the inheritance distribution or the inheritance distribution itself. In the latter case, I apply the results to binary phenotypes and nonparametric linkage. I consider local genetic models and apply them to Gaussian phenotypes in Section 6. In Section 7, I deŽ ne the limiting score function when the disease allele frequency tends to 0. In Section 8 I discuss how to put parametric and nonparametric linkage analysis into a general framework and how to choose score functions of practical use in linkage analysis. This is accompanied by a simulation study in Section 9 and a discussion of possible extensions in Section 10. The more technical parts of the article are collected in a series of appendixes. 2. LINKAGE SCORES AND THE GENETIC MODEL Consider a chromosome with genetic map length of l Morgans. This means that on average, l crossovers occur during each meiosis. Further, the position of each locus t 2 [0; l] on the chromosome refers to genetic map distance, meaning that on average, t and l ¡ t crossovers occur to the left and right of that locus. Let ¿ be the unknown disease locus. Then, for each t 2 [0; l] we with to test H0 : ¿ D 1 against H1 .t/ : ¿ D t, where ¿ D 1 means that the disease gene is located on another chromosome. In the former case, t is unlinked to ¿ , and in the latter case the two loci are perfectly linked. Given a number of families (pedigrees) P1 ; : : : ; PN with unusually large occurrence of

Journal of the American Statistical Association, December 2003

the disease, linkage analysis proceeds by comparing the inheritance pattern of DNA at t (deduced from marker alleles) with inheritance of phenotypes. Following Kruglyak et al. (1996), a possible test statistic for choosing between H0 and H1 .t/ is ,v uN N X uX Z.t/ D ° Z .t/ t ° 2; (1) i

i

iD1

i

iD1

N where fZ i .t/gN iD1 are the family scores at locus t and f°i giD1 are weights assigned to the various pedigrees. The marker information at t is perfect if the inheritance pattern can be unambigously determined from the marker genes surrounding t . If the family scores are normalized to have 0 expectation and unit variance under H0 under perfect marker information, it follows that Z.t/ also has 0 mean and unit variance under H0 . The whole process fZ.t/I 0 · t · lg can be used for testing H0 against the composite alternative hypothesis H1 : ¿ 2 [0; l], by comparing the test statistic

Zmax D sup Z.t/

(2)

0·t ·l

with a given threshold. H0 is rejected when Z max exceeds the threshold. (See, e.g., Feingold et al. 1993; Lander and Kruglyak 1995; and Ängquist and Hössjer 2003 for guidance in choosing thresholds to control the signiŽ cance level.) If H0 is rejected, then a conŽ dence region for ¿ can be computed. This region consists of those loci for which Zmax ¡ Z.t/ is smaller than a given constant, which is chosen to control the coverage probability (see Hössjer 2003a, b for details). Unless otherwise stated, from here on I consider a Ž xed pedigree P with n individuals. Let Yi D .Yi1 ; : : : ; Yir / and x i D .xi1 ; : : : ; xis / be the set of phenotypes and covariates of the ith individual, i D 1; : : : ; n. If f is the number of founders and each nonfounder has both of its parents in the pedigree, then the alleles at a certain locus for all pedigree members originate from the 2f founder alleles. For each nonfounder, there are two meioses that determine which parental alleles are transmitted. Hence the total number of meioses needed to describe transmission of founder alleles is m D 2.n ¡ f /. The inheritance vector v.t/ D .v1 .t/; : : : ; vm .t// at locus t is is a binary vector of length m, where vk .t/ is zero or one depending on whether a grandpaternal or grandmaternal allele is transmitted at locus t during the kth meiosis. Notice that v.t/ is an element of Zm 2 , the additive vector space over the Ž eld of two elements. Let S : Zm 2 ! R be a score function, where large values of S.v.t// D S.v.t/I P ; Y/ indicate a high degree of similarity between the inheritance vector v.t/ and the observed phenotypes. The family score of P at locus t was deŽ ned by Kruglyak et al. (1996) as X ¡ ¢ S.w/P v.t/ D wjmarker data for P ; (3) Z.t/ D w2Zm 2

if S has been standardized to have 0 mean and unit variance under H0 . The probability distribution of the inheritance vector given the marker data can be calculated by means of the forward–backward algorithm for hidden Markov models (cf. Lander and Green 1987). Under perfect marker information, (3) reduces to Z.t/ D S.v.t//, the score function evaluated at the inheritance vector at locus t. Even under perfect

Hössjer: Determining Inheritance Distributions via Stochastic Penetrances

marker information, v.t/ is not completely known, due to uncertainty of founder phases. But because all score functions of practical interest are invariant with repect to this uncertainty, S.v.t// is known even though v.t/ is not (Kruglyak et al. 1996, app. B). An alternative method of calculating Z.t/ was proposed by Kong and Cox (1997), using pointwise likelihood ratio tests for an empirical model deŽ ned by S and the marker data. Example 1 (Two score functions). With binary and scalar (r D 1) phenotypes, the two possible values Yi D 1 and Yi D 0 refer to “affected” and “unaffected.” This is the setup of nonparametric linkage (NPL) analysis. If the disease is rare, including only the affected pedigree members generally involves little loss of information. Then the score function quantitŽ es the degree of IBD allele sharing among the affected individuals, P1 ½ P , with known disease status. Let IBDi1 i2 D IBDi1 i2 .v/ be the number of founder alleles shared IBD by i1 and i2 , which is either 0, 1, or 2. Whittemore and Halpern (1994) introduced a score function that is a constant multiple of X Spairs .v/ D IBD i1 i2 ; (4) i1 > > > > if a ji D a ki D 0 > > > < P ¡ Y j.10/¢ i P .Yi jGi / D > if aji D 0; aki D 1; or a ji D 1; aki D 0 > > ¡ ¢ > > > P Yi j.11/ > > : if a ji D a ki D 1; (10) P .Yi jGi / is a symmetric bivariate function of a ji and aki . Introduce p »j D .aj ¡ p/= pq (11) and, if j 6D k,

»j k D »j »k :

(12)

2f

Under random mating, note that faj gj D1 are iid with a binomial .1; p/ distribution. Hence »j is the standardized version of aj , and E.»j / D E.»j k / D 0;

E.»j »k / D 1fj Dkg ;

E.»j »kl / D 0;

(13)

E.»j k »lm / D 1f.j;k/D.l;m/g C 1f.j;k/D.m;l/g:

Appendix A, describes how to expand bivariate functions of aj and ak as linear combinations of the random variables in (11) and (12). This expansion is similar to the variance components technique of Kempthorne (1957), which has been used by a number of authors to derive IBD probabilities for dichotomous phenotypes. Suarez, Rice, and Reich (1978), and Risch (1990b) consideed affected sib pairs, and Feingold et al. (1993), Feingold and Siegmund (1997), and Teng and Siegmund (1997) treated other kinds of pedigrees with up to four affected members. In the following two sections I generalize these results by deriving explicit expressions for the inheritance distribution P .vjY/ for arbitrary pedigrees and phenotypes. 4. EXPONENTIAL EXPANSION OF PENETRANCES In this section I expand the logarithm of P .Y i jGi /, which depends on the unknown and stochastic Gi . By symmetry, Gi can attain one of three values, .00/, .10/, or .11/. I introduce ¡ ¢ ¡ ¢¢ p ¡ ¡ ·ai D pq p log P Yi j.11/ ¡ log P Y i j.10/ ¡ ¡ ¢ ¡ ¢¢ C q log P Yi j.10/ ¡ log P Yi j.00/ ; ¡ ¡ ¢ ·di D pq log P Yi j.11/ (14) ¡ ¢ ¡ ¢¢ ¡ 2 log P Yi j.10/ C log P Yi j.00/ ; ¡ ¢ ¡ ¢¢ p ¡ ·li D pq log P Y i j.11/ ¡ log P Yi j.00/ ; as the additive, dominant, and loop random  uctuations of log P .Yi jGi /.

Hössjer: Determining Inheritance Distributions via Stochastic Penetrances

Given any Ž xed (nonrandom)array B D .Bj k /1·j ·k·2f , I introduce Á Á 2f !! X X M» .B/ D E exp Bjj »j C B j k »j k ; 1·j ·d i ; j > > < i2Rj k X Bj k D Bj k .v/ D X (16) > ·ai C ·li ; j D k, > > > : i2R N i2Rj

R D R.v/ D fi 2 Pknown ; j i 6D ki g; N D R.v/ N R D fi 2 Pknown ; j i D ki g;

R j D Rj .v/ D fi 2 RI ji D j or ki D j g;

(17)

Rj k D Rj k .v/ D fi 2 RI .ji ; ki / D .j; k/ or .k; j /g; N j D RN j .v/ D fi 2 RI N ji D ki D j g: R

and Pknown is the collection of all pedigree members with known disease phenotypes. N D ? for all v if the pedigree contains no Note that R.v/ loops. Thus, the Ž rst factor on the right side of (15) vanishes for outbred pedigrees, and then P .vjY/ can be expressed solely in terms of the moment-generating function of the array » . Example 5 (Survival analysis, continued). Consider the survival analysis model of Example 4. Then » ®jG j ¡e i 3i ; Yi D 0 log P .Yi jGi / D ®jG i j ®jGi j C log ¸i .ti / ¡ e 3i ; Yi D 1, R ti Pp where ¸i .t/ D ¸0 .t/ exp. lD2 ¯l xil / and 3i D 0 ¸i .u/ du. Inserting this into (14) yields p

·ai D ¡3 i ·aI C 1fYi D1g ·aII ;

2f

We know that f»j gj D1 are iid and, from (13), all components of » are uncorrelated. Thus it is tempting to treat » as if it had independent components and try the approximation Á ! 2f Y Y X ¼ P .vjY/ / exp ·d i M»j .Bjj / ¢ M»jk .Bj k /; N i2R

» D f»j gj D1 [ f»j k g1·j ¡· 3i C ·aII jR1j j; j Dk > a > < i2Rj Bj k D X > > 3i C ·dII jR j k j; j < k, ¡·dI > : i2Rj k

with R1j D fi 2 Rj I Yi D 1g and R 1j k D fi 2 Rj k I Yi D 1g.

1·j > > < C ·Q d i ·Q d i 1fIBD D2g ; i1 ; i2 2 R 1 2 i1 i2 TQi1 i2 D > N ·Q ai ·Q li IBDi1 i2 ; i1 2 R; i2 2 R > > : 1 2 N ·Q d i1 ·Q d i2 C ·Q li1 ·Q li2 IBD i1 i2 =2; i1 ; i2 2 R: (24)

1040

Journal of the American Statistical Association, December 2003

When jQj ¸ 3, TQQ has quite a complicated expression when N 6D ?. I thus restrict the investigation to outbred pediQ\R grees, RN D ?: Given any array b D .bj k I 1 · j · k · 2f / of natural numbers, I deŽ ne the bth central moment of » , b

ºb D E.» / Á 2f Y b DE »j jj ¢ j D1

!

Y

b

»j kj k

1·j > > < j Dk ­ bj k .u/ D ­ ­ > fi 2 QI u i D 2 and .ji ; ki / D .j; k/ or .k; j /g­ ; > > : j < k: For additive models (·Q d i ´ 0), all terms with Q2 .u/ 6D ? will vanish in (26). Hence, only one allele is picked from each individual. Notice further that (26) also holds for jQj D 1; 2 when N D ?, yielding Ti D 0 and Ti i equal to the Ž rst line of (24). R 1 2 I gave an alternative deŽ nition of TQ in terms of IBD-sharing quantities in earlier work (Hössjer 2001). I can now state the following result. Proposition 2. The conditional distribution of the inheritance vector can be expanded as P .vjY/ / 1 C where SQ l .v/ D

jPX known j lD1

X Q½Pknown jQjDl

SQl .v/;

TQQ ;

(27)

(28)

involves the simultaneous allele sharing of l individuals, l D 1; : : : ; jPknown j.

Example 6 (Binary phenotypes, continued). As a particular case of Example 2 without covariates, I consider binary phenotypes P .Yi D 1jGi / D fjGi j ;

(29)

where f0 , f1 , and f2 are the penetrance probabilites of the disease [i.e., fi D 1=.1 C exp.¡®i // in Example 2]. Based on this binary model, Feingold et al. (1993), and Teng and Siegmund (1997) have established expansions of the kind given in (27)

for additive models and outbred pedigrees with up to four affected members. The prevalence and the additive and dominant genetic variance components are Kp D P .Yi D 1/ D q 2 f0 C 2pqf1 C p 2 f2 ; ¡ ¢2 ¾Q a2 D 2pq p.f2 ¡ f1 / C q.f1 ¡ f0 / ;

(30)

¾Q d2 D p 2 q 2 .f2 ¡ 2f1 C f0 /2 :

Then, with R D Kp =.1 ¡ Kp / the prevalence odds ratio, as shown in Appendix D, X X SQ1 .v/ / HBD i ¡ R HBD i ; (31) i2P1

i2P0

where P0 and P 1 are the subsets of individuals in Pknown with Q Yi D 0 and Yi D 1. PFor small prevalences, S1 approximately equals S]aff HBD D i2P1 HBDi . This is the number of affected individuals homozygous by descent, a score function introduced by McPeek (1999). For outbred pedigrees, Á X X SQ 2 .v/ / :5.1 ¡ c/ ¢ IBDi1 i2 ¡ R IBD i1 i2 i1 0), higher-order cumulants of » and larger exponents of Bj0 k .0/ will dominate. This corresponds to simultaneous allele sharing of more than two individuals. In principle, TQ can be deŽ ned for subsets Q ½ Pknown with more than two individuals, although the complexity of the expressions quickly increases with jQj. Example 7 (Gaussian phenotypes). The penetrance parameters in Example 3 are à D .m0 ; m1 ; m2 ; ¯; ¾ 2 /, where ¯ D .¯1 ; : : : ; ¯s /T is the vector of regression parameters. Let m¤0 , m¤1 , and m¤2 be the mean parameters of a Ž xed reference model. When ¯ D 0, the reference model has mean phenotype E.Yi / D q 2 m¤0 C 2pqm¤1 C p2 m¤2 D: m and total variance V .Yi / D ¾g2 C¾ 2 , where ¾g2 D q 2 .m¤0 ¡ m/2 C 2pq.m¤1 ¡ m/2 C p2 .m¤2 ¡ m/2 is the genetic variance. This can be split into additive and dominant variance components as ¾g2 D ¾a2 C ¾d2 , where ¡ ¢2 ¾a2 D 2pq p.m¤2 ¡ m¤1 / C q.m¤1 ¡ m¤0 / ; (41) ¾d2 D .pq/2 .m¤2 ¡ 2m¤1 C m¤0 /2 : Introduce a one-parameter trajectory of penetrance parameters in a direction toward the reference model, ¡ à " D m C "¾ .m¤0 ¡ m/=¾g ; m C "¾ .m¤1 ¡ m/=¾g ; ¢ m C "¾ .m¤2 ¡ m/=¾g ; ¯; ¾ 2 ; for " ¸ 0. Note that E.Yi / D m independentlyof " whereas the heritability (i.e., the ratio between genetic and total variance) is " 2 =."2 C 1/ and hence grows with ".

1042

Journal of the American Statistical Association, December 2003

P Let ri D .Yi ¡ m ¡ slD1 xil ¯l /=¾ be the ith standardized residual. It is proved in Appendix E that the likelihood score function for an outbred pedigree is X ¡ ¡ ¢¢ S D S2 ¡ E 0 .S2 / D 2 ri1 ri2 Ci1 i2 ¡ E0 Ci1 i2 ; (42) i1

Suggest Documents