Mapping Multiple Quantitative Trait Loci for Ordinal Traits - Springer Link

1 downloads 0 Views 201KB Size Report
Yi, Xu, George, and Allison developed to detect multiple QTL and estimate their locations and effects simultaneously, including multi- ple interval mapping (Kao ...
Behavior Genetics, Vol. 34, No. 1, January 2004 (© 2004)

Mapping Multiple Quantitative Trait Loci for Ordinal Traits Nengjun Yi,1,2,5 Shizhong Xu,4 Varghese George,1,2 and David B. Allison1,2,3 Received 23 Jan. 2003—Final 3 July 2003

Many complex traits in humans and other organisms show ordinal phenotypic variation but do not follow a simple Mendelian pattern of inheritance. These ordinal traits are presumably determined by many factors, including genetic and environmental components. Several statistical approaches to mapping quantitative trait loci (QTL) for such traits have been developed based on a single-QTL model. However, statistical methods for mapping multiple QTL are not well studied as continuous traits. In this paper, we propose a Bayesian method implemented via the Markov chain Monte Carlo (MCMC) algorithm to map multiple QTL for ordinal traits in experimental crosses. We model the ordinal traits under the multiple threshold model, which assumes a latent continuous variable underlying the ordinal phenotypes. The ordinal phenotype and the latent continuous variable are linked through some fixed but unknown thresholds. We adopt a standardized threshold model, which has several attractive features. An efficient sampling scheme is developed to jointly generate the threshold values and the values of latent variable. With the simulated latent variable, the posterior distributions of other unknowns, for example, the number, locations, genetic effects, and genotypes of QTL, can be computed using existing algorithms for normally distributed traits. To this end, we provide a unified approach to mapping multiple QTL for continuous, binary, and ordinal traits. Utility and flexibility of the method are demonstrated using simulated data. KEY WORDS: Bayesian analysis; Markov chain Monte Carlo; quantitative trait loci; reversible jump; multiple threshold model.

by many factors, including multiple genes (QTL) and environmental components. In the past several years, several statistical methods for identifying single QTL for binary or ordinal traits have been developed in line crosses (Hackett and Weller, 1995; Rao and Xu, 1998; Rebai, 1997; Visscher et al., 1996; Xu and Atchley, 1996; Yi and Xu, 1999a, b) and human families (Duggirala et al., 1997; Williams et al., 1999). All these approaches were developed based on maximum likelihood or simple least square methods and are not directly extendable to analyzing multiple QTL. It has been recently shown both theoretically and empirically that multiple-QTL methods can improve power in detecting QTL and eliminate biases in estimates of QTL locations and genetic effects that can be introduced by using a single-QTL model (e.g., Haley and Knott, 1992; Martinez and Curnow, 1992). For continuous traits, several statistical methods have been

INTRODUCTION Many human diseases show a binary or ordinal phenotype, including type II diabetes, schizophrenia, and many congenital abnormalities, such as spina bifida. In animal and plant breeding, many traits, such as resistance to disease and degree of calving difficulty, are categorically scored. Although the phenotypes of these characters are discrete, their inheritance is determined 1 2 3

4

5

Department of Biostatistics. Section on Statistical Genetics. Clinical Nutrition Research Center, University of Alabama at Birmingham, Birmingham, Alabama. Department of Botany and Plant Sciences, University of California, Riverside, California. To whom correspondence should be addressed at Department of Biostatistics, University of Alabama at Birmingham, Birmingham, Alabama 35294-0022. Tel: 205-934-4924. Fax: 205-975-2540. e-mail: [email protected]

3 0001-8244/04/0100-0003/0 © 2004 Plenum Publishing Corporation

4 developed to detect multiple QTL and estimate their locations and effects simultaneously, including multiple interval mapping (Kao et al., 1999; Zeng et al., 2000) and Bayesian methodology with the reversible jump MCMC (Heath, 1997; Satagopan and Yandell, 1996; Sillanpää and Arjas, 1998; Stephens and Fisch, 1998). In quantitative genetics, the ordinal traits are usually explained by the concept of threshold model (Falconer and Mackay, 1996; Lynch and Walsh, 1998; Wright, 1934), which assumes a latent continuous variable (called the liability) underlying a categorical trait. The categorical phenotype and the continuous liability are linked through some fixed but unknown thresholds. One can treat the liability as an unobservable quantitative trait. Genes controlling complex categorical traits can be treated as QTL and handled using a QTL mapping approach. Under the threshold model, the Bayesian statistics and reversible jump MCMC algorithms have currently been used to search multiple QTL and estimate their locations and effects for complex binary traits in experimental crosses (Yi and Xu, 2000a; Yi and Xu, 2002). However, statistical methods for mapping multiple QTL for ordinal traits are lacking. Statistical methods for ordinal data analysis are more challenging than those for binary data because we have to deal with not only the unobserved liability but also a set of unknown threshold values (Albert and Chib, 1993; Harville and Mee, 1984; Soreson et al., 1995). In this paper, we propose a Bayesian method and MCMC algorithm to map multiple QTL for ordinal traits in experimental crosses. We first introduce a reparameterized threshold model, which has several attractive features for analyzing ordinal data. A joint sampling scheme is then developed to generate the liability and the threshold values simultaneously. This sampling step combined with the algorithms for normally distributed traits enable us to infer the number, locations, and genetic effects of QTL simultaneously. Finally, we outline several future research directions for analyzing ordinal traits.

THE MULTIPLE THRESHOLD MODEL For ordinal traits, we observe an ordinal response wi on the ith individual, where wi takes one of J ordered categories, 1, . . . , J, for i = 1, 2, . . . , n. Suppose that the ordinal phenotype is the expression of an underlying continuous variable (liability). Let yi∗ represent the value of the liability associated with the ith individual. The relationship between the unobservable

Yi, Xu, George, and Allison liability yi∗ and the observed phenotype wi is ∗ ≤ yi∗ < t j∗ , wi = j ⇔ t j−1

where j ∈ {1, 2, . . . , J }, t0∗ = −∞, t J∗ = +∞ and t1∗ , . . . , t J∗−1 are unknown threshold values that divide the real line into J intervals. Thus, when the realized value of yi∗ belongs to the jth interval, the ordinal response wi is j. We only describe our mapping method for an F2 design. However, the method can be applied to other experimental designs, such as backcrosses, recombination inbred lines, and four-way crosses. Under the threshold model, the liability is treated as a usual quantitative character. If no epistasis is assumed, the liability follows the linear model l  yi∗ = ␮∗ + (u iq aq∗ + viq dq∗ ) + ei∗ , (1) q=1

where ␮∗ is the overall mean; l is the number of QTL affecting the liability on the genome; aq∗ and dq∗ are the additive and dominance effects of the qth QTL, respectively; and u iq and viq are the indicator variables for the genotype of the qth QTL for the ith individual, which are defined as  +1 for Q1 Q1 and u iq = 0 for Q1 Q2 , −1 for Q2 Q2  1 + 2 for Q1 Q2 viq = , − 12 for Q1 Q1 or Q2 Q2 where Q1 Q1 , Q1 Q2 , and Q2 Q2 denote three genotypes, respectively, at the qth QTL, and ei∗ is the residual error assumed to follow N(0, ␴ ∗2 ). ∗ ≤ yi∗ < t j∗ can be Note that the inequality t j−1 re-expressed as ∗ (t j−1 − t1∗ )/␴ ∗ ≤ ( yi∗ − t1∗ )/␴ ∗ < (t j∗ − t1∗ )/␴ ∗ .

Therefore we can take ( yi∗ − t1∗ )/␴ ∗ as the underlying variable, and thereby model (1) can rewritten as  l  aq∗ dq∗ yi∗ − t1∗ ei∗ ␮∗ − t1∗  = + u + v + . (2) iq iq ␴∗ ␴∗ ␴∗ ␴∗ ␴∗ q=1 Thus the threshold model can be reformulated in terms of a reparameterized threshold model whose underlying continuous variables correspond to ( yi∗ − t1∗ )/␴ ∗ , i = 1, . . . , n, whose threshold values correspond to (t j∗ − t1∗ )/␴ ∗ , j = 1, . . . , J, whose overall mean, additive, and dominance effects correspond to dq∗ ␮∗ − t1∗ aq∗ , , q = 1, . . . , l, respectively, and and ␴∗ ␴∗ ␴∗ e∗ whose residual variance corresponds to i∗ . In the latter ␴

Mapping QTL for Ordinal Traits

5

threshold model, the residual variance equals 1 and the first threshold value equals 0. For binary traits where J = 2, therefore, there are no unknown threshold values. For ordinal traits, we further consider the following reparameterization: ␴= aq =

t j∗ − t1∗ ␴∗ , ␮ ∗ ∗ , tj = ␴ t J −1 − t1 ␴∗ aq∗ dq∗ e∗ ␴ ∗ , dq = ␴ ∗ , ei = ␴ i∗ , ␴ ␴ ␴

yi = ␴

=␴

␮∗ − t1∗ , ␴∗

and

yi∗ − t1∗ , ␴∗

(3)

for all i, j, and q. With reparameterization (3), the threshold model (1) becomes yi = ␮ +

l  (u iq aq + viq dq ) + ei .

(4)

q=1

Under model (4), yi is the liability of the ith individual, ␮ is the overall mean, aq and dq are the additive and dominance effects of the qth QTL for the liability yi , respectively, ei is the residual error distributed as N(0, ␴ 2 ). The relationship between the liability yi and the observed phenotype wi is wi = j,

if t j−1 ≤ yi < t j ,

where the reparameterized thresholds are −∞ = t0 < t1 = 0 < t2 < · · · < t J −2 < t J −1 = 1 < t J = +∞. Hereafter, we refer to model (4) and model (2) as the standardized threshold models for ordinal traits and binary traits, respectively. The standardized threshold models have several attractive features (see Discussion). Our statistical method for mapping multiple QTL is developed based on the standardized threshold models. JOINT POSTERIOR DISTRIBUTION AND PRIOR DISTRIBUTIONS In QTL studies, we observe the ordinal responses n w = {wi }i=1 and a set of marker genotypes M = n, K {M jk } j=1,k=1 , where n is the number of individuals in the mapping population and K is the number of markers. For an F2 population, M jk takes one of three values denoting three different genotypes. Map positions of markers are assumed to be known, and Haldane’s mapping function is used to convert genetic distances into recombination ratios. Our goal is to make inference about the number of QTL l, their locations ␭ = {␭q }lq=1 , and their additive

effects a = {aq }lq=1 and dominance effects d = {dq }lq=1 , where ␭q is the location of the qth QTL represented by the distance of the QTL from one end of the corresponding chromosome. We denote the vector of all model parameters by ␪ = (a, d, ␮, ␴ 2 ). The genotypes of putative QTL are usually unobserved and thus the coefficients of model (4), u = {u iq } and v = {viq }, are missing values. Note that viq’s are determined by u iq’s and thus can be suppressed in the list of unknowns. We treat the values of the liability as missing values and will generate these missing values from their posterior distributions. With the generated values, the threshold model is then the same as linear model for normally distributed traits. The dependencies among the observed data {w, M} and the unknowns (l, ␭, ␪, u, t, y) are described by a directed acyclic graph with the explanation of variables given in Fig. 1. The purpose of the Bayesian analysis is to infer the distribution of unknown parameters conditional on the observables, called the posterior distribution. The posterior distribution contains all information about the unknowns and thus can be used to make statistical inference about the parameters of interest. For the multiple threshold model, the joint posterior distribution of all unknowns, given the observed data and prior information for unknowns, can be written as p(l, ␭, ␪, u, t, y | w, M) ∝ p(w | y, t) p(y | l, ␪, u) p(u | l, ␭, M) p(l, ␭, ␪, t).

(5)

The first term in this equation is n  p(wi | yi , t) = p(w | y, t) = i=1   n J   1(t j−1 ≤ yi < t j )1(wi = j) , i=1

j=1

where 1( X ∈ A) is an indicator function, taking a value of 1 if X ∈ A is true and 0 otherwise. The second term is the conditional distribution of the liability given all unknowns and has the following form p(y | l, ␪, u) =

n 

p( yi | l, ␪, u) =

i=1

  l 2 

   y − ␮ − (u a + v d )   i iq q iq q n    q=1 2 n2 (2␲␴ ) exp − .   2␴ 2   i=1     The third term, p(u | l, ␭, M), is the conditional distribution of the genotypes of the putative QTL.

6

Yi, Xu, George, and Allison

Fig. 1. Directed acyclic graph for the Bayesian QTL mapping model. Boxes indicate observed data (y and M) or predetermined priors for unknowns. Ellipses indicate unknown parameters (l, ␭, a, b, ␮, ␴ 2 , and t) or missing values (u, v, and y). Parameter definitions and prior distributions are discussed in more detail in the text.

Assuming that there is at most one QTL on any marker interval, p(u | l, ␭, M) can be factorized into the following products p(u | l, ␭, M) =

n  l     L R p u iq  ␭iq , m iq , m iq , i=1 q=1

L R where m iq and m iq represent the left and the right flanking marker genotypes, respectively. To implement Bayesian analysis, we need to specify the prior distributions on l, ␭, ␪, and t. Assuming prior conditional independence of the parameters, we can factorize the joint prior distribution into the following products

p(l, ␭, ␪, t) = p(l) p(␭ | l) p(␮) p(␴ 2 )

l 

[ p(aq ) p(dq )] p(t).

q=1

The prior distribution for l is chosen to be a uniform distribution between 0 and a prespecified integer lmax . A common choice for the prior of ␭, when no information regarding the locations is available, is uniform over the entire genome. The priors for the overall mean ␮ and the QTL effects aq , dq (q = 1, 2, . . . , l) are assumed to be independently normal, that is, ␮ ∼ N (␩0 , ␶02 ) and aq , dq ∼ N (␩, ␶ 2 ), with prespecified prior means ␩0 and ␩ and variances ␶02 and ␶ 2 . The prior for ␴ 2 is assumed to be of a scaled inverted

chi-square distribution with known hyperparameter values of v0 and ␴02 , so that ␴ 2 ∼ Inv-␹ 2 (v0 , ␴02 ). Finally, we take uniform prior on t = (t2 , . . . , t J −1 ), that is, p(t) ∝ 1, for 0 < t2 < · · · < t J −2 < 1. MARKOV CHAIN MONTE CARLO ALGORITHM The calculation of the above joint posterior distributions is analytically intractable, and thus the Markov chain Monte Carlo (MCMC) approach is required to obtain observations from the joint posterior distribution. We use the reversible jump MCMC algorithm of Green (1995) to perform the posterior computation. To sample from the posterior distribution, we need to generate l, ␭, ␪, u, t, y from their respective conditional distributions. Conditional on l, u, ␭, and y, model (4) is a conventional linear model, and thus ␪ can be sampled using a Gibbs sampler (Gelman et al., 1995; Yi and Xu, 2000a; Yi and Xu, 2002). The QTL locations strongly depend on the QTL genotypes, and hence we adopt a Metropolis-Hastings algorithm to jointly update the locations and the genotypes of QTL (e.g., Uimari and Sillanpää, 2001; Xu and Yi, 2000; Yi and Xu, 2002). The dimension of the parameter space is determined by the number of QTL l, and thereby updating l requires a reversible jump step (Yi and Xu 2000a, 2002). For the multiple threshold model used

Mapping QTL for Ordinal Traits

7

here, we need an algorithm to generate y and t. The technical detail of computational implementation is given as follows. Because of the strong correlation between the liability and the threshold values, it is desirable to jointly generate y and t. To draw t and y jointly from the conditional distribution p(t, y | w, l, ␭, ␪, u) = p(t | w, l, ␪, u) p(y | w, l, ␪, u, t) , we first draw the threshold values t from p(t | w, l, ␪, u), and then draw the liability values y from p(y | w, l, ␪, u, t). It can be easily observed that given l, ␪, u and t, w1 , w2 , . . . , wn are independent. Therefore the conditional distribution (t | w, l, ␪, u) is p(t | w, l, ␪, u) ∝ p(w | t, l, ␪, u) = =

n  J 

n 

p(wi | t, l, ␪, u)

i=1

1(wi = j) p(t j−1 ≤ yi < t j | t, l, ␪, u)

i=1 j=1

   l

  t − ␮ − (u a + v d )  iq q iq q  n  J   j  q   1(wi = j)  =     ␴  i=1 j=1    l

   t j−1 − ␮ − q (u iq aq + viq dq )     −   ,   ␴    

(6)

p(w | t, l, ␪, u) , p(w | t∗ , l, ␪, u)

(7)

where p(w | t, l, ␪, u) and p(w | t∗ , l, ␪, u) are calculated by (6). n  p( yi | wi , l, ␪, Note that p(y | w, l, ␪, u, t) = i=1

p( yi | wi , l, ␪, u, t) 

l

ϕ ␮ + (u iq aq + viq dq ), ␴ 2

=

q



l

t j −␮−

  

u, t). Therefore the liability values can be generated





(u iq aq +viq dq ) ␴

if wi = j,

 ,

l

t j−1 −␮−

   −   

q

(u iq aq +viq dq )

q



   (8)

where ϕ(x, ␴ 2 ) stands for the normal density with mean x and variance We adopt the inverse transformation method to sample from the doubly truncated normal distribution (8) (see Devroye, 1986). With this method, we first simulate U uniformly from interval [0, 1]. Then the draw from the truncated normal (8) is yi = ␮ +

l −1 q (u iq aq + viq dq ) + ␴ [ p1 + U ( p2 − p1 )], where −1  is the inverse c.d.f the standard normal  of l t j −␮− (u iq aq +viq dq )   q=i  p1 =   distribution, and ␴   

where 1(wi = j) is an indicator function, equal to 1 if wi = j and 0 otherwise, and (·) is the standardized normal distribution function. Apparently, this distribution has a nonstandard form. Therefore the Metropolis-Hastings algorithm is used to generate samples from this distribution. To implement a Metropolis-Hastings algorithm, we first sample new threshold values t2∗ , t3∗ , . . . , t J∗−2 uniformly from the intervals [max(0, t2 − d), min(t3 , t2 + d)], [max(t2∗ , t3 − d), min(t4 , t3 + d)], . . . , [max(t J∗−3 , tJ−2 − d ), min(1, tJ−1 + d)], respectively, where d is a predetermined tuning parameter. The proposal t∗ = (t2∗ , t3∗ , . . . , t J∗−2 ) is then accepted with probability min{1, r}, where r=

individual by individual. Given wi , l, ␪, u, and t, the conditional posterior of yi is a truncated normal distribution, that is,

 p2 =   



l

t j−1 −␮−

(u iq aq +viq dq )

q=1



 . 

SIMULATION STUDIES The applicability of the proposed method was demonstrated by analyzing two sets of simulated data. The experimental populations were F2 containing 400 individuals. One chromosome of length 200 cM were simulated, and 41 codominant markers were placed on the simulated chromosome with a distance of 5 cM between consecutive markers. To evaluate the ability of our method for handling missing markers, we randomly deleted 10% of marker genotypes from the complete simulated sets. The data sets with missing marker genotypes were analyzed. For each data set, we simulated five QTL to control a normally distributed liability, which controlled the expression of an observed discrete phenotype. The locations and additive and dominance effects of the simulated QTL are given in Table I. The overall mean and the residual variance were set at ␮ = 0.5 and ␴e2 = 1, respectively. Given

8

Yi, Xu, George, and Allison

Table I. Locations, Effects, and Heritabilities of Simulated QTL QTL Locations (cM) Additive (aq ) Dominance (dq ) Heritability

1

2

3

4

5

24 0.64 0.00 0.10

42 0.54 0.54 0.10

116 0.00 0.90 0.10

137 0.54 −0.54 0.10

163 −0.64 0.00 0.10

this setup, each QTL explains 10% of the variance in liability (Table I). For each individual, the observed ordinal response took one of five and four ordered categories in two data sets, respectively. The observed proportion of each category and the threshold values are given in Table II. We adopted the reparameterization technique to transform the threshold model into the standardized threshold model. For comparison, the first three categories were combined into one category, resulting in two additional data sets with three and two categories for two simulated data sets, respectively. Therefore we analyzed two data sets generated from the same F2 population (analysis I and analysis II). In the standardized threshold model, the overall mean, residual variance, additive, and dominance effects of QTL and threshold values are given in Table III. For all analyses, the MCMC were started with no QTL in the model. The MCMC sampler was run for 5 × 105 cycles after discarding the first 2000 cycles for the burn-in period. The chain was thinned (saved one iteration in every 50 cycles) to reduce serial correlation in the stored samples so that the total number of samples kept in the post Bayesian analysis was 2 × 104 (Gelman et al., 1995). The stored sample was used to infer the parameters of interest. The prior for the overall mean was N(0, 10). We took p(␴ 2 ) ∝ ␴12 as the prior for the residual variance, which is uniform on log ␴ and corresponds to an inverted chi-square distribution with the hyperparameter values of v0 = 0 (Gelman et al., 1995). The priors for all additive and dominance effects were chosen to be N(0, 1.5). The prior for the number of QTL was Uniform(0, 10). Finally, the priors for ␭ and t were taken to be uniform as described earlier. Plots of the change in the number of QTL against the number of the iterations for all analyses were drawn

using the posterior sample of the number of QTL (not shown here). These plots showed that the distribution of the number of QTL appeared to reach its stationary state quickly and the MCMC algorithm mixed well over the number of QTL, changing frequently but being centralized around the posterior mode. In all the runs, the number of QTL never reached 10. The estimated marginal posterior probability distributions for the number of QTL are given in Table IV. The marginal posterior probability, p(l | w, M), was obtained by counting the number of samples in which the number of QTL is l, divided by the total of number of samples. For all analyses, the modes of the marginal posterior distributions for the number of QTL were five, which coincides with the simulated value, and the posterior means were estimated to approximately equal the true value. In analysis I for both data sets, the posterior probabilities that the number of QTL equals the true value were much higher than those of other values. Therefore the number of QTL was estimated accurately using our method. The posterior probabilities of l = 5 in analysis II are lower than those in analysis I for both data sets, indicating that combining some categories reduces the accuracy of QTL number estimation. Inference for the locations, additive effects, dominance effects, and heritabilities were obtained conditional upon the estimated values of the number of QTL, that is, l = 5. The posterior distributions of the QTL locations were depicted via plotting the frequency of hits by the QTL in a short interval (1 cM) against the genome location of the interval. An alternative way to summarize the locations is the QTL intensity function, in which all posterior samples are used to infer the locations (Sillanpää and Arjas, 1998). For all four analyses, these two methods gave similar results (not shown here). The posterior distributions of the locations of five QTL are depicted in Figs. 2a to 5a for all four analyses, respectively. From these posterior distributions, we can obtain the probability that a given chromosomal region contains a QTL. In Fig. 2a, for example, the probability that the region 18 ∼ 27 cM contains a QTL was proximately 0.96. Each posterior distribution has five obvious peaks around the true locations of the five simulated QTL, demonstrating that the estimated positions of the

Table II. Threshold Values and Proportion of Each Category Data I Data II

Proportion Threshold value Proportion Threshold value

0.13 t1 = −1 0.068 t1 = −1.5

0.23 t2 = 0.1 0.127 t2 = −0.5

0.18 t3 = 0.7 0.113 t3 = 0.1

0.21 t4 = 1.5 0.693

0.26

Mapping QTL for Ordinal Traits

9

Table III. Additive and Dominance Effects of QTL, Overall Mean, and Residual Variance in the Reparameterized Threshold Model

QTL Data I

Additive (aq ) Dominance (dq ) Additive (aq ) Dominance (dq ) Additive (aq ) Dominance (dq ) Additive (aq ) Dominance (dq )

Analysis I Analysis II

Data II

Analysis I Analysis II

1

2

3

4

5

0.25 0.00 0.80 0.00 0.40 0.00 0.64 0.00

0.22 0.22 0.67 0.67 0.34 0.34 0.54 0.54

0.00 0.36 0.00 1.13 0.00 0.56 0.00 0.90

0.22 −0.22 0.67 −0.67 0.34 −0.34 0.54 −0.54

−0.25 0.00 −0.80 0.00 −0.40 0.00 −0.64 0.00

identified QTL are close to the true locations. More clearly, the modes of these posterior distributions are given in Table V. Comparing the shapes of the profiles of the locations, we can see that the variances of the locations in analysis II are larger than those in analysis I, and thus analysis II is inferior to analysis I; this is expected because combining some categories does loose some information. For each effect and heritability, we calculated the average value for each short interval (1 cM). We then plotted the average effect and heritability against the chromosomal positions, forming a profile for each effect and heritability (see Figs. 2b to 5b). The estimates of QTL effects and heritability are reliable only in chromosomal regions with sufficiently high posterior density of the positions of QTL. The estimates of the additive effects, dominance effects, and heritabilities at the modes of the locations of QTL are shown in Table V. It can be seen that in most cases the estimates of the effects and heritabilies of QTL are reliable. The overall means (the standard errors) were estimated as 0.637 (0.001), −0.112 (0.029), 1.285 (0.003), and 0.439 (0.008) for the four analyses, respectively, and the residual variances (the standard errors) were 0.213 (0.007), 1.866 (0.154), and 0.395 (0.003) for the first three analyses, respectively (note that in the fourth analysis, the analysis of binary data, the residual variance is assumed to be 1). Obviously, the estimates of

Overall mean

Residual variance

0.60

0.16

−0.25

1.56

1.25

0.38

0.40

1.00

these two parameters in analysis I are more accurate than the corresponding estimates in analysis II. In analysis II for both two data sets, there are no unknown threshold values (see models 2 and 4). In analysis I for data I, the estimates of the two unknown threshold values (the standard errors) were 0.447 (0.001) and 0.679 (0.001). In analysis I for data II, the unknown threshold value (the standard error) was 0.643 (0.001). DISCUSSION Genetic studies often yield phenotypic data that are recorded according to a well-defined ordinal scale. Because of the discrete nature, statistical analysis for such data is usually conducted using quite different methods from those of continuously distributed traits. In this study, we adopted the probit model to analyze ordinal traits under the Bayesian framework. The main point of the method is that by introducing an underlying variable (liability) into the problem, the probit model on the ordinal trait is linked to the normal linear model on the continuous liability and a set of unknown threshold values. Therefore we are able to turn the problem of discrete traits into a missing value problem in continuous traits. The missing values, however, are easily augmented with the MCMC algorithm. With the generated values of the liability, the MCMC algorithms for the number, locations, and genetic effects of

Table IV. Estimates of the Posterior Distribution of the QTL Number and Its Expectation Estimated distribution, for l =

Data I Data II

Analysis Analysis Analysis Analysis

I II I II

0

1

2

3

4

5

6

7

8

9

10

Expectation

0 0 0 0

0 0 0 0.001

0.012 0.005 0 0.026

0.010 0.017 0 0.042

0.166 0.147 0.049 0.083

0.760 0.509 0.807 0.565

0.049 0.248 0.131 0.232

0.002 0.063 0.012 0.045

0 0.010 0.001 0.007

0 0.002 0 0

0 0 0 0

4.931 5.215 5.106 5.098

10

Yi, Xu, George, and Allison

Fig. 2. (a) Posterior distribution of QTL locations and (b) profiles of additive, dominance effects and heritabilities of QTL for analysis I of data I.

QTL are the same as those for normally distributed traits. The proposed Bayesian method estimates the number, positions, and genetic effects of QTL simultaneously, and thus avoids the problems that can arise

from misspecification of the number of QTL. Inferences about particular parameters of interest are obtained conditional on the observed data but not on particular values of any of the other unknowns. Therefore our

Mapping QTL for Ordinal Traits

11

Fig. 3. (a) Posterior distribution of QTL locations and (b) profiles of additive, dominance effects and heritabilities of QTL for analysis II of data I.

Bayesian method can provide more robust inferences than non-Bayesian methods. In addition to point estimates, posterior confidence intervals and variance estimates for any parameter of interest can be obtained from the marginal posterior distribution. In this way, we can avoid the difficult problems concerning the critical value for testing multiple QTL hypotheses.

Finally, the Bayesian method can incorporate biologically meaningful prior information, which can improve statistical power (Gelman et al., 1995). The idea of our method is similar to that of Yi and Xu (2000a) for mapping multiple QTL for binary traits. Compared with mapping multiple QTL for binary traits, however, the MCMC algorithm for ordinal traits is

12

Yi, Xu, George, and Allison

Fig. 4. (a) Posterior distribution of QTL locations and (b) profiles of additive, dominance effects and heritabilities of QTL for analysis I of data II.

more challenging because we face an additional sampling problem, that is, generating threshold values. To improve the efficiency of the MCMC algorithm, we adopted a reparameterized threshold model and a joint

sampling scheme to draw the liability and the threshold values. Because of the strong correlation between the liability and the threshold values, the joint sampling scheme could achieve faster convergence and better

Mapping QTL for Ordinal Traits

13

Fig. 5. (a) Posterior distribution of QTL locations and (b) profiles of additive, dominance effects and heritabilities of QTL for analysis II of data II.

mixing behavior than the primary MCMC algorithm considered in Albert and Chib (1993) (Chen and Dey, 2000). The reparameterized threshold model has several attractive features. First, the model has only J − 3

unknown threshold values. When J = 3, therefore, there are no unknown thresholds. Second, the number of the threshold values is reduced by 1 at the expense of increasing a model parameter, that is, the residual

14

Yi, Xu, George, and Allison Table V. Models of the Posterior Distributions of Locations, Estimated Additive and Dominance Effects and Heritabilities QTL

Data I

Analysis I

Analysis II

Data II

Analysis I

Analysis II

Mode of location Additive (aq ) Dominance (dq ) Heritability Mode of location Additive (aq ) Dominance (dq ) Heritability Mode of location Additive (aq ) Dominance (dq ) Heritability Mode of location Additive (aq ) Dominance (dq ) Heritability

1

2

3

4

5

22 (24) 0.320 (0.25) 0.042 (0.00) 0.144 (0.10) 22 (24) 0.702 (0.80) −0.122 (0.00) 0.092 (0.10) 23 (24) 0.337 (0.40) −0.035 (0.00) 0.071 (0.10) 21 (24) 0.517 (0.64) 0.193 (0.00) 0.085 (0.10)

41 (42) 0.203 (0.22) 0.244 (0.22) 0.101 (0.10) 41 (42) 0.635 (0.67) 0.765 (0.67) 0.122 (0.10) 41 (42) 0.423 (0.34) 0.315 (0.34) 0.137 (0.10) 42 (42) 0.703 (0.54) 0.381 (0.54) 0.158 (0.10)

115 (116) 0.004 (0.00) 0.319 (0.36) 0.083 (0.10) 114 (116) 0.030 (0.00) 0.893 (1.13) 0.072 (0.10) 116 (116) −0.012 (0.00) 0.553 (0.56) 0.092 (0.10) 117 (116) 0.016 (0.00) 0.727 (0.90) 0.077 (0.10)

136 (137) 0.106 (0.22) −0.320 (−0.22) 0.085 (0.10) 138 (137) 0.111 (0.67) −0.756 (−0.67) 0.058 (0.10) 136 (137) 0.449 (0.34) −0.314 (−0.34) 0.146 (0.10) 136 (137) 0.533 (0.54) −0.433 (−0.54) 0.104 (0.10)

163 (163) −0.197 (−0.25) 0.041 (0.00) 0.077 (0.10) 164 (163) −0.566 (−0.80) −0.016 (0.00) 0.059 (0.10) 162 (163) −0.458 (−0.40) 0.025 (0.00) 0.123 (0.10) 162 (136) −0.531 (−0.64) 0.044 (0.00) 0.078 (0.10)

Note: The true values are given in parentheses.

variance ␴ 2 . However, the posterior distribution of ␴ 2 is a standard distribution, that is, an inverse ␹ 2 distribution, and thus is easily sampled. Finally, all unknown threshold values are between 0 and 1, that is, 0 ≤ t j < 1 for j = 2, 3, . . . , J − 2, and thus it may be easier to generate these thresholds by using the MCMC algorithm. The proposed approach may be especially attractive for multiple ordinal traits or multiple continuous and ordinal traits. Correlated ordinal and continuous traits are encountered in many QTL studies. Joint analysis of multivariate traits can usually improve statistical power in the detection of QTL and can provide formal procedures to investigate the genetic mechanisms such as pleiotropy and close linkage (Jiang and Zeng, 1995; Williams et al., 1999). Because calculating the likelihood for correlated ordinal and continuous traits is difficult, statistical analysis based on the maximum likelihood methods can be intractable. Recently, however, several Bayesian statistical methods for analyzing correlated ordinal data have been developed (Chen and Dey, 2000; Chib and Greenberg, 1998). These statistical methods will be incorporated into our mapping procedure. There are several possible future research projects for mapping multiple QTL for ordinal traits. In this study, we have assumed that the underlying liability is a normally distributed variable, which leads the threshold model to be statistically equivalent to the probit model. Because the liability is a hypothetical variable, the interpretation of categorical data with a normal liability can be delicate. The normal assumption for the

liability may be robust to departure from normality under most situations (Tan et al., 1999). To fit a wider range of data, however, Albert and Chib (1993) and Chen and Dey (2000) proposed to use t distributions to model the underlying liability. Future research may be warranted to combine t distributions into our procedure and compare these approaches. Recently, there has been accumulating evidence that complex interactions among multiple genes play an important role in the genetic control and evolution of complex traits. Therefore it is important to include epistatic effects in the multiple threshold model and develop MCMC algorithms to identify epistatic QTL for ordinal traits. Recently, Yi and Xu (2002) developed a reversible jump MCMC algorithm to map epistatic QTL for complex binary traits. With the proposed algorithms herein for generating the liability and the threshold values, the method of Yi and Xu (2002) could easily be extended to ordinal traits. With respect to human genetic studies, the threshold model based on the variance component approach has been used to map QTL for binary traits using maximum likelihood methods (Duggirala et al., 1997; Williams et al., 1999) and Bayesian framework (Yi and Xu, 2000b). The methods developed herein for ordinal traits will be extended to human genetic research in the future. ACKNOWLEDGMENTS This work was supported by NIH Grants R01ES009912, P41RR006009, R01DK054298, and P30DK56336 to D. B. A.

Mapping QTL for Ordinal Traits REFERENCES Albert, J. H., and Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. J. Am. Statist. Assoc. 88:669–679. Chen, M. H., and Dey, D. K. (2000). Bayesian analysis for correlated ordinal data models. In Dey, D. K., Ghosh, S. K., and Mallick, B. K. (eds.), Generalized Linear Models: A Bayesian Perspective. New York: Marcel Dekker. Chib, S., and Greenberg, E. (1998). Analysis of multivariate probit models. Biometrika 85:347–61. Devroye, L. (1986). Non-Uniform Random Variable Generation. New York: Springer Verlag. Duggirala, R., Williams, J. T., Williams-Blangero, S., and Blangero, J. (1997). A variance component approach to dichotomous trait linkage analysis using a threshold model. Genet. Epidemiol. 14:987–992. Falconer, D. S., and Mackay, T. F. C. (1996). Introduction to Quantitative Genetics (4th ed.). London: Longman. Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (1995). Bayesian Data Analysis. London: Chapman & Hall. Green, P. J. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82:711–732. Hackett, C. A., and Weller, J. I. (1995). Genetic mapping of quantitative trait loci for traits with ordinal distributions. Biometrics 51:1252–1263. Haley, C. S., and Knott, S. A. (1992). A simple regression method for mapping quantitative trait loci in line crosses using flanking markers. Heredity 69:315–324. Heath, S. C. (1997). Markov chain Monte Carlo segregation and linkage analysis for oligogenic models. Am. J. Hum. Genet. 61:784–760. Harville, D. A., and Mee, R. W. (1984). A mixed-model procedure for analyzing ordered categorical data. Biometrics 40:393–408. Jiang, C., and Zeng, Z. B. (1995). Multiple trait analysis of genetic mapping for quantitative trait loci. Genetics 140:1111–1127. Kao, C. H., Zeng, Z.-B., and Teasdale, R. D. (1999). Multiple interval mapping for quantitative trait loci. Genetics 152:1203–1216. Lynch, M., and Walsh, B. (1998). Genetics and Analysis of Quantitative Traits. Sunderland, MA: Sinauer Associates. Martinez, O., and Curnow, R. N. (1992). Estimating the location and the sizes of the effects of quantitative trait loci using flanking markers. Theor. Appl. Genet. 85:480–488. Rao, S., and Xu, S. (1998). Mapping quantitative trait loci for ordered categorical traits in four-way crosses. Heredity 81:214–224. Rebai, A. (1997). Comparison of methods for regression interval mapping in QTL analysis with non-normal traits. Genet. Res. 69:69–74. Satagopan, J. M., and Yandell, B. S. (1996). Estimating the number of quantitative trait loci via Bayesian model determination.

15 Special Contributed Paper Session on Genetic Analysis of Quantitative Traits and Complex Diseases. Biometric Section, Joint Statistical Meeting, Chicago, IL. Stephens, D. A., and Fisch, R. D. (1998). Bayesian analysis of quantitative trait locus data using reversible jump Markov chain Monte Carlo. Biometrics 54:1334–1347. Sillanpää, M. J., and Arjas, E. (1998). Bayesian mapping of multiple quantitative trait loci from incomplete inbred line cross data. Genetics 148:1373–1388. Sorensen, D. A., Andersen, S., Gianola, D., and Korsgaard, I. (1995). Bayesian inference in threshold models using Gibbs sampling. Genet. Sel. Evol. 27:229–249. Tan, M., Qu, Y., and Rao, J. S. (1999). Robustness of the latent variable model for correlated binary data. Biometrics 55:258– 263. Uimari, P., and Sillanpää, M. J. (2001). Bayesian oligogenic analysis of quantitative and qualitative traits in general pedigrees. Genet. Epidemiol. 21:224–242. Visscher, P. M., Haley, C. S., and Knott, S. A. (1996). Mapping QTL for binary traits in backcross and F2 populations. Genet. Res. 68:55–63. Williams, J. T., Eerdewegh, P. V., Almasy, L., and Blangero, J. (1999). Joint multiple linkage analysis of multivariate qualitative and quantitative traits. I. Likelihood formation and simulation results. Am. J. Hum. Genet. 65:1134–1147. Wright, S. (1934). An analysis of variability in number of digits in an inbred strain of guinea pigs. Genetics 19:506–536. Xu, S., and Atchley, W. R. (1996). Mapping quantitative trait loci for complex binary diseases using line crosses. Genetics 143:1417–1424. Xu, S., and Yi, N. (2000). Mixed model analysis of quantitative trait loci. Proc. Natl. Acad. Sci. USA 97:14542–14547. Yi, N., and Xu, S. (1999a). Mapping quantitative trait loci for complex binary traits in outbred populations. Heredity 82:668–676. Yi, N., and Xu, S. (1999b). An random approach to mapping quantitative trait loci for complex binary traits in outbred populations. Genetics 153:1029–1040. Yi, N., and Xu, S. (2000a). Bayesian mapping of quantitative trait loci for complex binary traits. Genetics 155:1391–1403. Yi, N., and Xu, S. (2000b). Bayesian mapping of quantitative trait loci under the identity-by-descent-based variance component model. Genetics 156:411–422. Yi, N., and Xu, S. (2002). Mapping quantitative trait loci with epistatic effects. Genet. Res. 79:185–198. Zeng, Z.-B., Kao, C., and Basten, C. J. (2000). Estimating the genetic architecture of quantitative traits. Genet. Res. 74:279–289.

Edited by Pak Sham