Shrinkage Estimation Method for Mapping Multiple ... - Science Direct

0 downloads 0 Views 272KB Size Report
time is very long. To overcome this problem, a novel method of incorporating the idea described above into maximum likelihood, known as penalized likelihood ...
遗 传 学 报

Acta Genetica Sinica, October 2006, 33 (10):861–869

ISSN 0379-4172

Shrinkage Estimation Method for Mapping Multiple Quantitative Trait Loci ZHANG Yuan-Ming Section on Statistical Genomics,State Key Laboratory of Crop Genetics and Germplasm Enhancement/Chinese National Center for Soybean Improvement, Nanjing Agricultural University, Nanjing 210095, China Abstract: In this article, shrinkage estimation method for multiple-marker analysis and for mapping multiple quantitative trait loci (QTL) was reviewed. For multiple-marker analysis, Xu (Genetics, 2003, 163:789-801) developed a Bayesian shrinkage estimation (BSE) method. The key to the success of this method is to allow each marker effect have its own variance parameter, which in turn has its own prior distribution so that the variance can be estimated from the data. Under this hierarchical model, a large number of markers can be handled although most of them may have negligible effects. Under epistatic genetic model, however, the running time is very long. To overcome this problem, a novel method of incorporating the idea described above into maximum likelihood, known as penalized likelihood method, was proposed. A simulated study showed that this method can handle a model with multiple effects, which are ten times larger than the sample size. For multiple QTL analysis, two modified versions for the BSE method were introduced: one is the fixed-interval method and another is the variable-interval method. The former deals with markers with intermediate density, and the latter can handle markers with extremely high density as well as model with epistatic effects. For the detection of epistatic effects, penalized likelihood method and the variable-interval approach of the BSE method are available. Key words: Bayesian analysis; epistasis; multiple QTL model; quantitative trait locus; Shrinkage estimation

A quantitative trait is defined as a trait whose value varies across individuals in degree rather than in category, and its phenotype is determined by multiple Mendelian loci and environmental variants. These loci are quantitative trait loci (QTL). In modern quantitative genetics, phenotypic values, pedigree, and marker information are used to infer the number of QTL and to estimate the positions and the effects of QTL, which is referred to as QTL mapping. A great progress in QTL mapping was the introduction of interval mapping (IM) for population derived from the cross between inbred lines by Lander and Botstein [1]. This enables the location and the effect of the QTL to be estimated. Henceforth, many methods had been developed, including regression-based analysis, maximum likelihood (ML) analysis, and the Bayesian analysis. The use of ML has been advocated for the

detection of QTL through linkage with molecular markers, and this approach is very effective and feasible [2], i.e., composite-interval mapping [3-5] (CIM) and multiple-interval mapping [6,7] (MIM). Regression-based method is similar to the ML method but is more tractable and versatile than the ML method in certain circumstances because it retains many beneficial features of regression analysis [2]. Recently, Knott [2] reviewed the regression-based methods of mapping QTL in structured outbred populations. As compared with the two methods described above, the Bayesian analysis is difficult, in which the number of QTL is assumed to be variable. So the reversible-jump Markov chain Monte Carlo algorithm can be used to obtain the posterior distributions for all the parameters. When the convergence is met, the estimates for the number, the positions, and the effects of QTL can be obtained.

Received: 2006-04-28; Accepted: 2006-07-12 This work was supported by the National Natural Sciences Foundation of China (No. 30470998), Natural Science Foundation of Jiangsu Province (No. BK2005087), NCET (No. NCET-05-0489), PCSIRT and the Talent Foundation of NAU. ①

Corresponding author. E-mail: [email protected]; Tel : +86-25-8439 9091; Fax: +86-25-8439 9091

862

The main shortcoming of the Bayesian analysis is slow convergence, and the major difficulty lies in the jump associated with the number of QTL. For the complex traits controlled by multiple QTL, it is necessary to use a multiple QTL model to simultaneously estimate all the QTL effects in a single model. Such a model provides increased sensitivity and better differentiation separation of linked QTL. It also provides an opportunity to detect interaction between QTL, known as epistasis [8,9]. Therefore, Zeng [3,4] and Jansen [5] independently proposed the idea of combining IM with multiple-regression analysis to map multiple QTL, which is called the CIM. The key problem of the CIM is how to choose suitable markers as cofactors because too many nuisance markers will decrease the power of QTL detection [10]. In addition, CIM is still implemented using a single QTL model because only one QTL is tested at a time. It is not a true multiple QTL analysis. A true multiple QTL model should include all the QTL in a single model. Recently, several methods for multiple QTL model have been developed. Kao and Zeng [6] and Kao et al. [7] developed a multiple-interval mapping method. Broman and Speed [10] proposed a multipleregression model to tackle the problem of multiple QTL analysis from the least squares point of view. Meanwhile, the Bayesian analysis has become an important approach for mapping multiple QTL [11-19]. Under a multiple QTL model, especially an epistatic genetic model, the number of model effects becomes huge, and most of them have small or zero effects. It is impossible to simultaneously estimate all the effects without resorting to special statistical techniques. Model selection is one of the techniques that is appropriate for this purpose. Ridge regression [20,21] might be another option. However, Xu [22] found that ridge regression works only if the number of model effects is in the same order as the number of observations. Based on the idea of Meuwissen et al. [23], Xu [22] modified the ridge regression by allowing the ridge factor to vary across different model effects. This turned out to be equivalent to Bayesian analysis with different model effects taking different prior distribu-

Acta Genetica Sinica

遗传学报

Vol.33 No.10 2006

tions into consideration. In other words, it allows small effects to have a small weight such that their inclusion should have negligible effects on the final results. Actually, it is a Bayesian shrinkage estimation (BSE) method. Based on the idea of Xu [22], further investigations have been carried out. In this review, various modified versions were introduced. Section 1 reviews the approaches of all-marker analysis. Section 2 describes the methods that allow QTL located between markers to be included in the model, so that positions and the effects of QTL are simultaneously estimated. This idea was first proposed by Zhang and Xu [9,24]. Recently, this approach was introduced in detail by Wang et al [25]. Section 3 deals with the methods of estimating epistatic effects [8,9]. All the methods are discussed based on the backcross (BC) design derived from the cross between two inbred lines. Extension to other mating designs is a matter of increasing model dimension, requiring very little novel technique.

1

Multiple-Marker Analysis

Marker analysis is a kind of association study where each marker is considered a candidate for QTL. If the marker map is sufficiently dense, most QTL are potentially detectable because of tight linkage to markers. 1. 1

The Bayesian regression analysis

The Bayesian regression method for all-marker analysis proposed by Xu [22] can simultaneously evaluate all the effects of markers on the entire genome. The key point is that each marker effect is assigned a normal prior distribution with mean zero and a unique variance, and the effect-specific prior variance is further assigned a prior distribution, so that the variance can be estimated from the data. Let yi for i = 1,L , n be the phenotypic value of the ith individual in a BC population. The linear model for yi can be expressed as p

yi = b0 + ∑ xij b j + ei j =1

(1)

ZHANG Yuan-Ming et al.: Shrinkage Estimation Method for Mapping Multiple Quantitative Trait Loci

where, b0 is the population mean, p is the total number of markers on the entire genome, xij is a dummy variable indicating the genotype of the jth marker for the individual i, b j is the QTL effect associated with marker j, and ei is residual error

863

In the MCMC-implemented Bayesian analysis, the unobservables were sampled from the above joint posterior distribution. Sampling was performed in the following sequences. Step 1. Initialization: All unobservables, denoted

with a N (0, σ 2 ) distribution. For a BC population,

by Q(0) = {b0 ,L , bp , σ 2(0) , σ12(0) ,L , σ 2(0) p } , were ini-

an individual can take only one of the two genotypes,

tialized. The location parameters b are initialized

A1 A2 and A2 A2 , at any locus. The dummy variable

with a zero value and the scale parameters v are

is defined as xij = 1 for A1 A2 and xij = −1 for

initialized with a small positive number, i.e., one.

A2 A2 . The genetic effect of b j is actually the difference between the genetic effects associated with A1 A2 and A2 A2 .

In the Bayesian framework, we consider everything as a random variable, including the parameters. We classify the variables into observables and unoby = { yi }

servables. The observables include

for

i = 1,L , n and the marker information. The unobserv-

ables include b = {b j }

p j =0

and v = {σ , σ ,L , σ 2

2 1

2 p

}.

Each random variable has a distribution. The distribution of the observables is a function of the unobservables and is called the likelihood function indicated by

Step 2. Updating b0 : The conditional posterior distribution of b0 is a N (b0 , s02 ) distribution, from which b0 =

exp{−

1 2σ 2

∑(y i =1

i

p

− b0 − ∑ xij b j ) 2 }

is

1 n xij b(0) ∑ ( yi − ∑ j ) n i =1 j =1

and

sampled,

Step 3. Updating b j for j = 1,L , p : The conditional posterior distribution of b j is a N (b j , s 2j ) distribution, which is used to sample new b j , where n

b j = (∑ xij2 + σ2(0) / σ 2(0) ) −1 j

(2)

n

∑ x (y ij

i =1

j =1

p (σ2 ) ∝ 1/ σ 2 , p (b0 ) ∝ 1 , p(b j ) = N (0, σ 2j ) , and

p (σ2j ) ∝ 1/ σ2j for j = 1,L , p . The joint prior distribution of the unobservables p(b, v) takes the product of the prior distributions of individual parameters. So the joint posterior distribution has a form of (3)

where

s02 = σ 2(0) / n . The

i =1

The distribution of the unobservables is called prior distribution. The purpose of Bayesian analysis is to infer the posterior distribution of the parameters given the observed data. From the joint posterior sample, one can easily obtain the desired Bayesian estimates, such as the posterior means and posterior variances. Xu [23] chose the following prior distributions,

p (b, v | y) ∝ p (y | b, v) p (b, v)

b0

p

b0(0) in all subsequent sampling processes.

p(y | b, v) = ∏ p ( yi |b, σ2 ) ∝ (σ 2 ) − n / 2 n

new

sampled b0 is denoted by b0(1) , which will replace

n

i =1

a

i

−b

(0) 0

p

− ∑ xij b ) k≠ j

(4)

(0) j

and n

s 2j = σ 2(0) (∑ xij2 + σ 2(0) σ 2(0) ) −1 j

(5)

i =1

The newly sampled b j is denoted by b(1) j , in all subsequent sampling which will replace b(0) j processes. Step 4. Updating σ2 : The conditional posterior distribution of σ 2 is a scaled inverted chi-square distribution. It is sampled using n

p

i =1

k =1

2 σ 2(1) = ∑ ( yi − b0(0) − ∑ xij b(0) χ 2n j )

(6)

where χ 2n is a random number sampled from a chi-square distribution with n degrees of freedom. The variance σ 2(0) is immediately updated. Step 5. Updating σ2j for j = 1,L , p : The con-

864

遗传学报

Acta Genetica Sinica

Vol.33 No.10 2006

ditional posterior distribution of σ 2j is also a scaled

where η is a constant. Therefore, all the parameters

inverted chi-square distribution. The newly sampled

may be estimated in the following sequences. Step 1. Set η > 0 and provide initial values for

σ 2(1) for j = 1,L , p is b 2(0) χ12 , where χ12 is a j j

random number sampled from a chi-square distribution with one degree of freedom. Step 6. Repeat 2–5: At this stage, we have completed one cycle of sampling for the MCMC and are ready to continue our sampling for the next cycle. When the chain converges to the stationary distribution, the sampled parameters actually follow the joint posterior distribution. When the sample of a single parameter is viewed, this univariate sample is actually the marginal posterior sample for this parameter. 1. 2

all parameters; Step 2. Update b0 using b0 = Step 3. Update b j using −1

n b j = ⎡ ∑ xij2 + σ 2 / σ 2j ⎤ ⎢⎣ i =1 ⎥⎦ p ⎡n ⎤ xij ( yi − b0 − ∑ xik bk ) + μ jσ 2 / σ 2j ⎥ ; ⎢⎣ i∑ =1 k≠ j ⎦

Step 4. Update σ 2 using

σ2 =

Penalized maximum likelihood method

Zhang and Xu [8] proposed a penalized maximum likelihood (PML) method for estimating the marker effects. This method allows the spurious effects to be shrunk toward zero, whereas the estimates of large QTL effects are subjected to virtually no shrinkage. The key to the success of this method lies in choosing some suitable prior distributions and in incorporating the idea of estimating the parameters of the prior distributions from the data into the PML method [26]. The simulated study shows that this method can handle a model with multiple effects, which are ten times larger than the sample size [8]. Penalized likelihood is similar to the posterior distribution of the parameters, with the prior distribution of the parameters serving as the penalty. The difference between the PML method and the Bayesian method is that the parameters in the prior distributions are simultaneously estimated along with the parameters of interest by means of maximizing penalized likelihood function. In the over-saturated model, the choice of the prior distributions of parameters is very important. Each marker effect is assigned a normal prior distribution with a unique mean and a unique variance: b j ~ N ( μ j , σ 2j ) . The effect-specific prior mean and variance are further assigned their corresponding prior distributions so that they can be estimated from the data, i.e.,

μ j ~ N (0, σ 2j η )

and

σ 2j ∝ 1

( j = 1, 2,L , p ),

p 1 n ∑ ( yi − ∑ xij b j ) ; j =1 n i =1

p 1 n 2 ∑ ( yi − b0 − ∑ xij b j ) ; j =1 n i =1

Step 5. Update μ j using μ j = b j /(η + 1) ; Step 6. Update σ 2j using 1

σ 2j = ⎡⎣(b j − μ j ) 2 + ημ 2j ⎤⎦ ; 2 Step 7. Repeat step 2 to step 6 until a certain criterion of convergence is satisfied.

2

Multiple QTL Analysis

Although multiple-marker analysis as described in section 1 is not QTL mapping because no QTL positions are estimated, the idea of Xu [22] has been used to map multiple QTL by Zhang and Xu [8, 9, 24]. Recently, Wang et al. [25] described this method in detail, known to be the BSE method. Wu and Li [27] concluded that this method allowed analytical strategies for QTL mapping to expand to whole-genome mapping of epistatic QTLs by the use of all markers. 2. 1

Fixed interval

Provided that each interval bracketed by two markers has a QTL with an unknown position, the above-mentioned Bayesian approach can be used to estimate the position and the effect of this QTL. When the marker density is high, most intervals may contain no QTL. Instead of deleting these intervals from the model, a special shrinkage approach can be used to force the estimated QTL effects of these in-

ZHANG Yuan-Ming et al.: Shrinkage Estimation Method for Mapping Multiple Quantitative Trait Loci

tervals to be close to zero. This is equivalent to deleting these intervals from the model. The notation here is the same as that described earlier in the multiple-marker analysis except that p is the number of intervals (or QTL), rather than the number of markers, and xij is the (unobservable)

some, all the markers are bracketed by two QTL, assuming that one QTL exists in all the intervals. Assume that the kth marker of the individual i is missing and the marker is bracketed by QTL j and j+1. Given the QTL genotypes, the missing marker genotype is sampled from the following probability,

QTL genotype indicator, rather than the marker

(0) (0) p(mik | xij(0) , xi(0) ( j +1) , λ j , λ j +1 )

genotype. The definition of b j is still referred to as

=

the genetic effect of QTL. The general principle has been introduced in section 1, which is similar to the Bayesian regression approach for all-marker analysis except that we need to add extra steps of updating the positions and geno-

865

(0) (0) p ( xij(0) , mik , xi(0) ( j +1) | λ j , λ j +1 ) 2

∑ p( x h =1

(0) ij

, mik = h, x

(0) i ( j +1)

| λ ,λ (0) j

(7) (0) j +1

)

(0) (0) where p( xij(0) , mik , xi(0) ( j +1) | λ j , λ j +1 ) is obtained from

the Markov model. If the kth marker is at the end of a

types of the QTL. The sampling process is performed

chromosome, the missing genotype of mik is sam-

in the following sequences.

pled from p ( xij(0) , mik | λ (0) j ) , assuming that the jth

Steps 1-5: Same as Steps 1-5 in section 1.1. Step 6: Updating the position λ j and genotype x j of the jth QTL for j = 1,L , p . Because the posi-

tion λ j is highly dependent on x j = ( x1 j ,L , xnj )T , we jointly update λ j and x j using a one-locus-at-a-time approach conditioned on fixed values of all other loci. Initially, we sampled a new value of the position of the jth QTL, which is denoted by λ*j around the existing position λ (0) from a uniform distribution, j (0) ( λ (0) j − δ, λ j + δ ), where δ is a tuning parameter

usually taking a value of 2 cM. We also sampled the genotypes of the jth QTL corresponding to the new position, which is denoted by x *j from the posterior probabilities of Q j q j and q j q j genotypes conditioned on marker information, QTL position, and phenotypic values. If the new position is accepted, the current position will be replaced by λ*j , and the QTL genotypes are updated according to the posterior probabilities. If the new position is rejected, the current position will be kept to the next cycle; however, the genotypes of QTL located at the old position are still subjected to updation. Step 7: Updating missing marker genotypes. Unless a marker is located in one end of a chromo-

QTL is close to the marker k. Step 8: Repeat 2–7.Until now, we have carried out one cycle of the MCMC and are ready to continue our sampling for the next cycle. The sampling process continues until the chain converges to the stationary distribution. We then collected the posterior samples to infer the estimates of all the parameters. 2. 2

Variable interval

When the marker density is very high, the fixed-interval approach will encounter several problems. One such problem is the high colinearity of the QTL genotypes, which may generate unstable estimates of the QTL effects. The other problem is that the model dimension becomes too high so that the estimation may take unreasonably long time to complete, especially when the interaction effects are considered. We introduced a modified version of this method to make the number of QTL substantially smaller than the number of intervals. The new method is called the variable-interval approach. In the fixed-interval approach, the number of QTL is the same as the number of intervals. For a chromosome of p markers, the number of QTL, denoted by q , is p − 1 . In the variable-interval approach, q is less than p − 1 . We may set the average interval size of a QTL to be s, say s = 40 cM. For

866

Acta Genetica Sinica

遗传学报

Vol.33 No.10 2006

example, if a chromosome is 120 cM long, we set the

L , p . As described in Zhang and Xu [8], the model (8)

number of QTL to be 120 40 = 3 . In the vari-

is simplified to

able-interval approach, the positions of the three QTLs are disjoint and vary along the chromosome. We developed a mechanism to sample QTL positions. Some intervals may contain more markers and some may contain less, but on an average, an interval should contain at least two markers. Detailed steps of the sampling process are identical to the fixed-interval approach except that the boundaries of the intervals are redefined for each cy-

q

yi = a0 + ∑ zij a j + ε i

(9)

j =1

where a0 = b0 ; a j = b j and zij = xij ( j = 1,L , p ); q = p ( p + 1) / 2 ; a j = brs

and zij = xir xis (r = 1,L ,

p − 1; s = r + 1,L , p; j = p + 1,L , q ) . Obviously,

the

models (9) and (1) are similar. Therefore, the PML method may be carried out in the same manner.

cle of the sampling. Let λ j be the position of QTL j.

3. 2

The interval defining this QTL should be (λ j −1 , λ j +1 ) .

The multiple-QTL epistatic model for a BC population is similar to the model (8) except that the definitions of the symbols are different from those in section 3.1; however, they are similar to those de-

Therefore, λ j should be sampled within this interval. Once λ j is sampled, it should be treated as the lower boundary for sampling λ j +1 . As a result, the intervals vary from one cycle to another. When the Markov chain converges to the stationary distribution, the intervals actually containing QTL are shorter than those intervals containing no QTL.

3

Detection of Epistasis

Epistatic effects are defined as the interaction effects between different QTL. An epistatic genetic model should include potential pair-wise interaction effects of all the loci; therefore, the model is oversaturated. This is a main problem for detecting epistasis. Fortunately, two approaches may be used to overcome this problem. One is the penalized likelihood method for all the marker analysis and another is the variable-interval approach of Bayesian shrinkage estimation for mapping multiple QTL. 3. 1

For an epistatic genetic model for all-marker analysis p −1

yi = b0 + ∑ xij b j + ∑ j =1

p

∑x

s =1 t = s +1

x b + εi

is it st

scribed in section 2.2, i.e., bst is the interaction effect between the sth and tth QTL for s = 1,L , p − 1 and t = s + 1,L , p ; and the value of p is much lesser than

the number of marker intervals in the entire genome. The variable-interval approach implemented with the MCMC algorithm is identical to that described earlier for the main effects, with exception as given below. The likelihood function becomes p(y | b, v) ∝ (σ 2 ) − n / 2 exp {−

n

1 2σ 2

∑(y

i

i =1

p

p

j =1

s

Suggest Documents