Genetics: Published Articles Ahead of Print, published on April 28, 2006 as 10.1534/genetics.105.049510
1
2
3
Genomic Assisted Prediction of Genetic Value with Semi-parametric Procedures Daniel Gianola ;y;z;1 ,
Rohan L. Fernando
and
y
Alessandra Stella
4
Department of Animal Sciences, University of Wisconsin-Madison, Madison, Wisconsin 53706 y
Parco Tecnologico Padano, 26900 Lodi, Italy
z
Department of Animal and Aquacultural Sciences, Norwegian University of Life Sciences, N-1432 Ås, Norway Department of Animal Science, Iowa State University, Ames, Iowa 50011
5
March 5, 2006
1
6
7 8 9
10 11
Running head: Semi-parametric prediction of genetic value Key words: quantitative genetics, marker assisted selection, single nucleotide polymorphisms, nonparametric procedures, kernel regression, multivariate analysis, Bayesian methods 1
Corresponding author: Daniel Gianola, Department of Animal Sciences, 1675 Observatory Dr., Madison, WI 53706. Phone: (608) 265-2054. Fax: (608) 262-5157. E-mail:
[email protected]
2
12
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
ABSTRACT Semi-parametric procedures for prediction of total genetic value for quantitative traits, that make use of phenotypic and genomic data simultaneously, are presented. The methods focus on the treatment of massive information provided by, e.g., single-nucleotide polymorphisms. It is argued that standard parametric methods for quantitative genetic analysis cannot handle the multiplicity of potential interactions arising in models with, e.g., hundreds of thousands of markers, and that most of the assumptions required for an orthogonal decomposition of variance are violated in arti…cial and natural populations. This makes non-parametric procedures attractive. Kernel regression and reproducing kernel Hilbert spaces regression procedures are embedded into standard mixed e¤ects linear models, retaining additive genetic e¤ects under multivariate normality for operational reasons. Inferential procedures are presented, and some extensions are suggested. An example is presented, illustrating the potential of the methodology. Implementations can be carried out after modi…cation of standard software developed by animal breeders for likelihood-based or Bayesian analysis.
Massive quantities of genomic data are now available, with potential for enhancing accuracy of prediction of genetic value of, e.g., candidates for selection in animal and plant breeding programs, or for molecular classi…cation of disease status in subjects (GOLUB et al. 1999). For instance, WONG et al. (2004) reported a genetic variation map of the chicken genome containing 2.8 million single-nucleotide polymorphisms (SNPs), and demonstrated how the information can be used for targeting speci…c genomic regions. Likewise, HAYES et al. (2004) found 2,507 putative SNPs in the salmon genome that could be valuable for marker assisted selection in this species. The use of molecular markers as aids in genetic selection programs has been discussed extensively. Important early papers are SOLLER and BECKMANN (1982) and FERNANDO and GROSSMAN (1989), with the latter focusing on best linear unbiased prediction of genetic value when marker information is used. Most of the literature on marker assisted selection deals with the problem of locating one or few quantitative trait loci (QTL) using ‡anking markers. However, in the light of current knowledge about genomics, the widely used single-QTL search approach is naive, since there 3
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85
is evidence of abundant QTLs a¤ecting complex traits, as discussed, e.g., by DEKKERS and HOSPITAL (2002). This would support the in…nitesimal model of FISHER (1918) as a sensible statistical speci…cation for many quantitative traits, with complications being the accommodation of non-additivity and of feedbacks (GIANOLA and SORENSEN 2004). DEKKERS and HOSPITAL (2002) observe that existing statistical methods for marker assisted selection do not deal well with complexity posed by quantitative traits. Some di¢ culties are: speci…cation of “statistical signi…cance” thresholds for multiple testing, strong dependence of inferences on model chosen (e.g., number of QTLs …tted, distributional forms), inadequate handling of non-additivity and ambiguous interpretation of e¤ects in multiple-marker analysis, due to collinearity. Here, we discuss how large-scale molecular information, such as that conveyed by SNPs, can be employed for marker-assisted prediction of genetic value for quantitative traits in the sense of, e.g., MEUWISSEN et al. (2001), GIANOLA et al. (2003) and XU (2003). The focus is on inference of genetic value, rather than detection of quantitative trait loci. A main challenge is that of positing a functional form relating phenotypes to SNP genotypes (viewed as thousands of possibly highly co-linear covariates), to polygenic additive genetic values, and to other nuisance e¤ects, such as sex or age of an individual, simultaneously. Standard quantitative genetics theory gives a mechanistic basis to the mixed e¤ects linear model, treated either from classical (SORENSEN and KENNEDY 1983; HENDERSON 1984) or Bayesian (GIANOLA and FERNANDO 1986) perspectives. MEUWISSEN et al. (2001) and GIANOLA et al. (2003) exploit this connection and suggest highly parametric structures for modelling relationships between phenotypes and e¤ects of hundreds or thousands of molecular markers. A …rst concern is the strength of their assumptions (e.g., linearity, multivariate normality, proportion of segregating loci, spatial within-chromosome e¤ects); it is unknown if their procedures are robust. Secondly, colinearity between SNP or marker genotypes is bound to exist, because of the sheer massiveness of molecular data plus co-segregation of alleles. While adverse e¤ects of colinearity can be tempered when marker e¤ects are treated as random variables, statistical redundancy is undesirable (LINDLEY and SMITH 1972). The genome seems to be much more highly interactive that what standard quantitative genetic models can accommodate (e.g., D’HAESELLEER et al. 2000). In theory, genetic variance can be partitioned into orthogonal addi4
86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
120
121 122
tive, dominance, additive additive, additive dominance, dominance dominance, etc., components, only under highly idealized conditions. These include linkage equilibrium, absence of natural or arti…cial selection, and no inbreeding or assortative mating (COCKERHAM 1954; KEMPTHORNE 1954). Arguably, these conditions are violated in nature and in breeding programs. Actually, marker assisted selection exploits existence of linkage disequilibrium, and even chance creates disequilibrium. Further, estimation of non-additive components of variance is notoriously di¢ cult, even under standard assumptions (CHANG 1988). Therefore, it is doubtful whether or not standard quantitative genetic approaches can model …ne-structure relationships between genotypes and phenotypes adequately, unless either departures from assumptions have mild e¤ects, or statistical constructs based on multivariate normality turn out to be more robust that what is expected on theoretical grounds. These considerations suggest that a nonparametric treatment of the data could be valuable. On the other hand, application of the additive genetic model in selective breeding of livestock has produced remarkable dividends, as shown in DEKKERS and HOSPITAL (2002). Hence, a combination of nonparametric modelling of e¤ects of molecular variables (e.g., SNPs) with features of the additive polygenic mode of inheritance is appealing. Our objective is to present semi-parametric methods for prediction of genetic value for complex traits that make use of phenotypic and genomic data simultaneously. The article is organized as follows. KERNEL REGRESSION ON SNP MARKERS introduces nonparametric regression, kernel functions, smoothing parameters, and proposes a non-parametric approximation to additive genetic value. Next, SEMI-PARAMETRIC KERNEL MIXED MODEL combines features of kernel regression with the mixed e¤ects linear model, and describes classical and Bayesian implementations. The REPRODUCING KERNEL HILBERT SPACES MIXED MODEL section uses established calculus of variations results, and hybridizes the mixed e¤ects linear model with a regression on kernel basis functions. Estimation procedures are presented, the problem of incomplete genotyping is addressed, and a simulated example is given, to illustrate feasibility and potential. The paper concludes with a discussion and with suggestions for additional research. KERNEL REGRESSION ON SNP MARKERS The regression function: Consider a stylized situation in which each of a series of individuals possesses a measurement for some quantitative trait 5
123 124 125 126 127 128 129 130 131
denoted as y, as well as information on a possibly massive number of genomic variables, such as SNP "genotypes", represented by a vector x. In the main, x is treated as a continuously valued vector of covariates, even though SNP genotypes are discrete (coding is done via dummy variates). Also, x could represent gene expression measurements from microarray experiments; here, it would be legitimate to regard this vector as continuous. Although gene expression measurements are typically regarded as response variables, there are contexts in which this type of information could be used in an explanatory role (MALLICK et al. 2005). Let the relationship between y and x be represented as yi = g (xi ) + ei ; i = 1; 2; :::; n
132
133 134
135 136 137 138 139
140 141
(1)
where: yi is a measurement, such as plant height or body weight, taken on individual i; xi is a p 1 vector of dummy SNP or microsatellite covariates observed on i, and g (:) is some unknown function relating these “genotypes”to phenotypes. De…ne g (xi ) =E (yi j xi ) as the conditional expectation function, that is, the mean phenotypic value of an in…nite number of individuals, all possessing the p-dimensional genotype xi . ei (0; 2 ) is a random residual, distributed independently of xi and with variance 2 . The conditional expectation function is R y p (x; y) dy : g(x) = p (x)
(2)
Following SILVERMAN (1986), consider a non-parametric kernel estimator of the p dimensional density of the covariates: n 1 X K pb (x) = nhp i=1
xi
x h
;
(3)
where K xih x is a kernel function and h is a window width or smoothing parameter. In (3), x is the value (“focal point”) at which the density is 6
evaluated and xi (i = 1; 2; :::; n) is the observed p-dimensional SNP genotype of individual i in the sample. Hence, (3) estimates population densities (or frequencies). If pb (x) is to behave as a multivariate probability density function, then it must be true that the kernel function is positive and that the condition 1 Z1 n Z xi x 1 X K dx = 1 pb (x) dx = nhp i=1 h 1
1
is satis…ed. This implies that Z1
1 K hp
xi
x h
dx =1:
1
Similarly, and assuming that a single h parameter su¢ ces, one can estimate the joint density of phenotype and genotypes at point (y; x) as n 1 X pb (x;y) = K nhp+1 i=1
yi
y h
K
xi
x h
;
where K yih y is also a kernel function; again, yi is the observed sample value of variable y in individual i. The numerator of (2) can be estimated as Z Z n yi y 1 X xi x K y pb (x;y) dy = y K dy p+1 nh h h i=1 Z n X 1 1 yi y xi x = yK dy K : (4) p nh i=1 h h h y yi , h
so that dy = hdz and Z Z 1 yi y yK dy = yi K (z) dz + hE (z) : h h
In (4), let z =
The kernelRfunction is typically a proper R probability density function chosen such that K (z) dz = 1 and E (z) = zK (z) dz = 0: If so, the preceding expression is equal to yi , so that (4) becomes Z n 1 X xi x y pb (x;y) dy = yi K : (5) p nh i=1 h 7
Returning to (2), one can form the non-parametric estimator R b (y j x) = gb(x) = y pb (x;y) dy ; E pb (x)
which, upon replacing the numerator and denominator by (5) and (3), respectively, takes the form gb(x) =
where
n X
(6)
wi (x) yi ;
i=1
K xih x wi (x) = P n K xih x i=1
is a weight that depends on the kernel function and window width chosen, and on the xi (i.e., genotypes) observed in the sample. The linear combination of the observations (6) is called the Nadaraya-Watson estimator of the regression function (NADARAYA 1964; WATSON 1964). As seen in (6), the …tted value at coordinate x is a weighted average of all data points, with the value of the weight depending on the “proximity”of xi to x and on the value of the smoothing parameter h. For instance, if the kernel function has the Gaussian form: K
xi
x h
=
1 (2 )
p 2
1 2
exp
xi
x
0
xi
h
x h
;
p
142 143 144 145 146 147
this has a maximum value of (2 ) 2 when x = xi ; and tails o¤ to 0 as the distance between x and xi increases. The values of wi (x) decrease more abruptly as h ! 0. This is the reason why this type of estimator is called "local", in the sense that observations with xi coordinates closer to the focal point x are weighted more strongly in the computation of the …tted value b (y j x). E A speci…cation that is more restrictive than (1) is the additive regression model (HASTIE and TIBSHIRANI 1990; FOX 2005) g (xi ) =
p X j=1
E (yi j xij ) = 8
p X j=1
gj (xij ) ;
(7)
where xij is the genotype for SNP j in individual i: In this model each of the “partial regression” functions is two-dimensional, thus allowing exploration of e¤ects of individual SNPs on phenotypes, albeit at the expense of ignoring, e.g., epistatic interactions between genotypes. A preliminary examination of relationships can be done via calculation of Nadaraya-Watson estimators of each of the gj (:) as
gbj (x) = 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
n P
K
i=1 n P
i=1
xij x h
yi ;
K
j = 1; 2; :::; p;
xij x h
where the kernels are unidimensional. Naturally, one may wish to account for interactions between SNPs, so this type of analysis would be merely exploratory. A speci…cation that is intermediate between (7) and (1) could include sums of single, pairwise, tripletwise, etc., SNP regression functions. Impact of window width: The scatter plot shown in Figure 1, from CHU and MARRON (1991), consists of data on log-income and age of 205 people. The solid line in the top panel is a moving weighted average of the points, with weights proportional to the curve at the bottom of the panel. CHU and MARRON (1991) regard the dip in average income in the middle ages as “unexpected”. The bottom panel gives results from three di¤erent smooths at window widths of 1, 3 or 9. When h = 1, the curve displays considerable sampling variation or roughness. On the other hand, when h = 9; local features disappear, because points that are far apart receive considerable weight in the …tting procedure. If the dip is not an artifact, the oversmoothing results in a “bias". Hence, h must be gauged carefully. MARRON (1988) and CHU and MARRON (1991) discuss data-driven procedures for assessing h. SILVERMAN (1986) gives a discussion in the context of density estimation, whereas MALLICK et al. (2005) consider Hilbert spaces kernel regression, with h treated as an unknown parameter. A conceptually simple and intuitively appealing procedure is cross-validation (e.g., SCHUCANY 2004). For instance, in the "leave-one-out" procedure, the datum for case i; that is (yi ; xi ), is deleted, and a …t is carried out based on the other n 1 cases. Then, the prediction gbi; i (xi jh) of yi is formed; where the notation i indicates that all data other than that for case i is used for estimating the regression function. This process is repeated for all n data
9
points. Subsequently, the cross-validation criterion
CV (h) = 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185
n X
[yi
i=1
gbi; i (xi jh)]2 n
is formed. A cross-validation estimate of h is the minimizer of CV (h); this is found by carrying out the computations over a grid of h values. HART and LEE (2005), however, present evidence of large variability of the “leaveone-out”estimates of h. HASTIE et al. (2001) discuss alternatives based on leaving out 10-20% of the sample values. RUPPERT et al. (2003) present procedures for simple calculation of cross-validation statistics. GULISIJA et al. (2006), used another non-parametric procedure, LOESS, to study the relationship between performance and inbreeding in Jersey cows. There is some relationship between LOESS and kernel regression. In LOESS, the number of points contributing to a focal …tted value is …xed (contrary to kernel regression, where the actual number depends on the gentleness of the kernel chosen) and governed by a spanning parameter. This parameter (ranging between 0-1) is equivalent to h and dictates the fraction of all data points that contribute towards a …tted value. Figure 2 gives a LOESS …t for protein yield (actually, residuals from a parametric model) and 100 bootstrap replicates, illustrating uncertainty about the regression surface. Without getting into details, note that yield decreases gently at low values of inbreeding, followed by a faster linear decline, and then by an increase. Irrespective of the variability (due to that few animals were either non-inbred or highly inbred), neither the change of rate at low inbreeding, nor the increase in yield at high consanguinity, are predicted by standard quantitative genetics theory. This is another illustration of how "irregularities" can be discovered non-parametrically, which would remain hidden otherwise. Estimation of Linear Approximation: If the kernel function is a probability density function, the non-parametric density estimator (3) will be a density function as well, retaining di¤erentiability properties of the kernel (SILVERMAN 1986). Consider E (yjx) = g(x), and suppose one is interested in inferring a linear approximation to g(x) near some …xed point x such as the mean value of the covariates; this leads to a non-parametric counterpart of additive genetic value. The linear function is E appr (yjx) = g(x ) + g(x _ )0 (x 10
x)
(8)
where g(x _ )=
@ g(x) @x
: x=x
is the gradient of g(x) at x = x . This suggests the pseudo-model yi
g(x ) + g(x _ )0 (x
x ) +ei ; i = 1; 2; :::; n
(9)
and the approximate variance decomposition g(x _ )0 V ar (x) g(x _ )+ 2 :
V ar (y) 186 187 188 189
(10)
The …rst term in the expression above can be interpreted as the variance contributed by the linear e¤ect of x in the neighborhood of x . Further, if x is a vector of genotypes for molecular markers, this would be, roughly, the additive variance "due to" markers. A non-parametric estimator of (8) is given by
where
b_ b appr (yjx) = gb(x )+g(x E )0 (x n P
yi K
gb(x ) = i=1 n P
xi x h
= K
i=1
xi x h
n X
x)
wi (x ) yi ;
(11)
(12)
i=1
b_ and g(x ) is an estimate of the vector of …rst derivatives of g(x) with respect to x, evaluated at x . The estimate could be the gradient of (6): 3 2 b_ g(x ) =
=
n X i=1
n X
yi
xi x 7 @ 6 7 6 K h n 5 4 @x P K xih x i=1
yi [di (x )
wi (x ) r (x )] ;
i=1
where
_ xi x K h di (x ) = P n xi x K h i=1
11
x=x
;
(13)
xi
_ K
x
=
h
and r (x ) =
n P
i=1 n P
@ K @x
_ K
xi x h
K
xi x h
xi h
=
i=1
x
n X
; x=x
di (x )
i=1
are vectors whose form depends on the kernel function used. Hence, (11) becomes b appr (yjx) E ( n )0 n X X = wi (x ) yi + yi [di (x ) wi (x ) r (x )] (x x ) : (14) i=1
i=1
Likewise, an estimator of the variance contributed to (10) by the linear e¤ect of x (“non-parametric additive genetic variance from the SNPs) is given by b0 (x )Vd b0 (x ); where Vd g ar (x) g ar (x) is some estimate of the covariance matrix of x: Finally, the relative contribution to variance made by the linear e¤ect of x ( 2 ) can be assessed as: b0 (x )Vd b0 (x ) g ar (x) g b = ; b0 (x )Vd b0 (x ) + b2 g ar (x) g 2
(15)
where b2 is an estimate of 2 ; such as n h i2 X b_ yi gb(x ) g(x )0 (xi x ) b2 = i=1
: (16) n Gaussian kernel: A p dimensional Gaussian kernel with a single band width parameter h has the form K
xi
x
1
=
p 2
exp
1 (xi 2h2
x)0 (xi
x) ; (2 ) so that the estimator of the density at x is the …nite mixture of Gaussians: n 1X 1 xi x pb (x) = K p n i=1 h h h
=
n 1 X 1 p exp p nh i=1 (2h2 ) 2
12
Si (x) ; 2h2
0
where, for xi = xi1 xi2 : xip , at focal x = x1 x2 : xp
Si (x) = h2
xi
x
0
xi
h
x
=
h
p X
0
xk )2
(xik
k=1
:
h2
The weights entering into estimator (6) are then i h Si (x) exp 2h2 wi (x) = P h i: n Si (x) exp 2h2 i=1
The …tted surface is given by
b (y j x) = E
n P
exp
i=1 n P
h
exp
i=1
Si (x) 2h2
i
yi v0 (x) y ; h i = 0 v (x) 1 Si (x) 2h2
where y is the n 1 vector of data points, 1 is an n v0 (x) is an n 1 vector with typical element vi (x) = exp
(17)
1 vector of ones, and
Si (x) ; i = 1; 2; :::; n: 2h2
Consider next the linear approximation in (14). Using (17), write: b_ b appr (yjx) = gb(x )+g(x E )0 (x x ) Gauss v0 (x ) y b +g(x = _ )0 (x x ) 0 v (x ) 1
Now, for a Gaussian kernel, _ K Further
xi
x h
=
@ K @x
xi
x h
=
(xi
_ xi x K (xi h di (x) = P = n K xih x i=1
13
x) h2
K
xi
x) wi (x) ; h2
x h
:
and r (x) =
n X
di (x ) =
i=1
Hence b_ g(x ) =
=
=
n X
i=1 n X i=1 n P
yi [di (x ) yi
"
(xi
n X (xi
x) wi (x) h2
i=1
wi (x ) r (x )] x ) wi (x ) h2
wi (x ) yi (xi
x)
i=1
wi (x )
n X (xi i=1
b(x ) g
h2
n P
(xi
x ) wi (x ) h2
#
x ) wi (x )
i=1
;
and v0 (x ) y appr b E (yjx) = v0 (x ) 1 n P wi (x ) yi (xi i=1 + (x 190 191 192 193 194 195 196 197
x ):
x)
b(x ) g
h2
n P
wi (x ) (xi
x)
i=1
Kernels for discrete covariates: For a biallelic SNP, there are three possible genotypes at each “locus”, as in stylized Mendelian situations. In a standard (parametric) analysis of variance representation, incidence situations (or additive and dominance e¤ects at each of the loci) are described via 2 dummy binary variables per locus, and all corresponding epistatic interactions can be assessed from e¤ects of cross-products of these variables. This leads to a highly parameterized structure and to formidable model selection problems. Consider now the non-parametric approach. For an x vector with p coordinates, its statistical distribution is given by the probabilities of each of the 3p binary outcomes. With SNPs, p can be very large (possibly much larger than n), so it is hopeless to estimate the probability distribution of genotypes accurately from observed relative frequencies, and smoothing is required (SILVERMAN 1986). Kernel estimation extends as follows: for binary covariates the number of disagreements between a focal x and the 14
observed xi in subject i is given by d (x; xi ) = (xi
x)0 (xi
x) ;
where d (:) takes values between 0 and p: For illustration, suppose that one has “genotype”AABbccDd as focal point and that individual i in the sample has been genotyped as aaBbccDD. The vectors of binary covariates for the focal and observed cases (save for an intercept) are given in the matrix: Genotype Observed Focal Agree
AA 0 1 N
Aa 0 0 Y
aa 1 0 N
BB 0 0 Y
Bb 1 1 Y
bb 0 0 Y
CC 0 0 Y
Cc 0 0 Y
cc 1 1 Y
DD 1 0 N
Dd 0 1 N
dd 0 0 Y
where N (Y) stands for disagreement (agreement). Then, d (x; xi ) = 4; which is twice the number of disagreements in genotypes because there are only two “degrees of freedom” per locus. In practice, one should work with a representation of incidences that is free of redundancies. For binary covariates, SILVERMAN (1986) suggests the “binomial”kernel K (x; xi ; h) = hp
d(x;xi )
(1
h)d(x;xi ) ;
h 1; alternative forms of the kernel function are discussed by with 21 AITCHISON and AITKEN (1976) and RACINE and LI (2004). It follows that the kernel estimate of the probability of observing the focal value x is 1X p pb (x) = h n i=1 n
198 199 200
d(x;xi )
(1
h)d(x;xi ) :
(18)
If h = 1; the estimate is just the proportion of cases for which xi = x; if p h = 12 ; every focal point gets an estimate equal to 21 ; irrespective of the observed values x1 ; x2 :::; xn : The non-parametric estimator of the regression function is n P
hp
gb(x) = i=1 n P
d(x;xi )
(1
h)d(x;xi ) yi :
hp
d(x;xi )
i=1
15
(1
h)d(x;xi )
201 202 203 204 205 206 207
Since a discrete distribution does not possess derivatives, the additive genetic value must be de…ned in the classical sense (e.g., FALCONER 1960), that is, by de…ning contrasts between expected values of individuals having appropriate genotypes. Here, one can use either the vectorial representation g (x) ; or the concept of the additive regression model in (7). Hereinafter, it is assumed that the distribution of x is continuous, and a continuous kernel function is employed throughout, as an approximation. SEMI-PARAMETRIC KERNEL MIXED MODEL
208
209 210 211 212 213 214 215 216 217
General considerations: Consider now a situation for which there might be an operational or mechanistic basis for specifying at least part of a model. For instance, suppose y is a measure on some quantitative trait, such as milk production of a cow. Animal breeders have exploited to advantage the in…nitesimal model of quantitative genetics (FISHER 1918). Vectorial representations of this model are given by QUAAS and POLLAK (1980), and applications to natural populations are discussed by KRUUK (2004). In this section, we combine features of the in…nitesimal model with a nonparametric treatment of genomic data, and present semiparametric implementations. Statistical speci…cation: Model (1) is expanded as yi = wi0 + z0i u + g(xi ) + ei ; i = 1; 2; :::; n
(19)
where is a vector of nuisance location e¤ects and u is a q 1vector containing additive genetic e¤ects of q individuals (these e¤ects are assumed to be independent of those of the markers), some of which may lack a phenotypic record; wi0 and z0i are known non-stochastic incidence vectors. As before, g(xi ) is some unknown function of the SNP data. It will be assumed that u N (0; A 2u ), where 2u is the "unmarked" additive genetic variance and A is the additive relationship matrix, whose entries are twice the coe¢ cients of co-ancestry between individuals. Let e = fei g be the n 1 vector of residuals, and assume that e N (0; I 2e ), where 2e is the residual variance. Note that the model implies that yi
0
g(xi ) = wi0 + zi u + ei
(20)
and yi
wi0
0
zi u = g(xi )+ei 16
(21)
218 219 220 221
The preceding means that: 1) the o¤set yi g(xi ) follows a standard mixed e¤ects model, and 2) if and u were known, one could use (6) to estimate g(xi ) employing yi wi0 z0i u as “observations”. This suggests two strategies for analysis of data, discussed below. Strategy 1- Mixed model analysis: This follows from representation (20). First, estimate g(xi ), for i = 1; 2; :::; n, via gb(xi ), as in (6) or, if a Gaussian kernel is adopted, as in (17). Then, carry out a mixed model analysis using the "corrected" data vector and pseudo-model y = fyi
222 223 224
gb(xi )g = W + Zu + e:
where W = fwi0 g and Z = fz0i g are incidence matrices of appropriate order. The pseudo-model ignores uncertainty about g (x) ; since gb(xi ) is treated as if it were the true regression (on SNPs) surface. Under the standard multivariate normality assumptions of the in…nitesimal model, one can estimate the variance components 2u and 2e from y via restricted maximum likelihood (REML, PATTERSON and THOMPSON 1971), and form empirical best linear unbiased estimators and predictors of and u, respectively, by solving the Henderson mixed model equations (HENDERSON 1973) # " b W0 W W0 Z W0 y 2 : (22) = 0 0 1 e Z0 y ZW ZZ+A 2 b u u
225
226 227 228 229
230 231 232
The ratio
2 e 2 u
is evaluated at REML estimates of the variance components.
Solving system (22) is a standard problem in animal breeding even for very large q, since A 1 is easy to compute. The two stage procedure could be iterated several times, i.e., use the solutions to (22), to obtain a new estimate 0 b as "data", then update the pseudo-data y ; etc. of g (x) using yi wi0 b zi u The “total”additive genetic e¤ect of an individual possessing a vector of SNP covariates with focal value x can be de…ned as the sum of the additive genetic e¤ect of the SNPs, in the sense of (8), plus the polygenic e¤ect ui , that is Ti (x) = g(x ) + g(x _ )0 (x x ) + ui : An empirical predictor of Ti can be formed by adding (14) to the ith compob in the mixed model equations. It is not obvious how a measure of nent of u uncertainty about Ti (x) can be constructed using this procedure. 17
A Bayesian approach can be used instead, using the corrected data yi gb(xi ) as observations. Under standard assumptions made for the prior and the likelihood (e.g., WANG et al. 1993, 1994; SORENSEN and GIANOLA 2002), one can draw samples j = 1; 2; :::; m from the pseudo posterior distribution [ ; u; 2u ; 2e jy ] via Gibbs sampling, and then form “semi-parametric” draws of the total genetic value as (j)
Ti (x) =
n X
wk (x ) yk
k=1
+
( n X
yk [dk (x )
)0
wk (x ) r (x )]
k=1
233 234 235 236 237 238 239
(x
(j)
x ) + ui ;
(j)
for j = 1; 2; :::; m: These Ti (x) can be construed as draws from a semiparametric pseudo posterior distribution; one can readily calculate the means, median, percentiles, and variance of this distribution and produce posterior summaries, leading to an approximation to uncertainty about total genetic value. Irrespective of whether classical or Bayesian viewpoints are adopted, this approach ignores the error of estimation of g(x); as noted earlier. Strategy 2-Random g(:) function: The estimate of g (x) can be improved as follows. Consider (21) and suppose, temporarily, that and u are z0i u and the alternative Nadarayaknown. Then, form the o¤set yi wi0 Watson estimator of g(xi ): b (yi wi0 gb(xj ; u; y;h) = E z0i u j x) n X = wk (x) (yk wk0 z0k u) :
(23)
k=1
240 241
The problem is that the location vectors and u are not observable, so they must be inferred somehow. Regard now ; u as unknown quantities possessing some prior distribution, which is modi…ed via Bayesian learning into the pseudo posterior process [ ; ujy ], after observation of the pseudo-data y and of the n p matrix of observed SNP covariates X: Then, for some band width h, gb(xj ; u; y;h) would also be a random variable possessing some pseudo posterior distribution. Let (j) ; u(j) (j = 1; 2; :::; m) be a draw from the pseudo posterior process [ ; u; 2u ; 2e jy ;h] ; as in the preceding section. Further, regard gb(xj (j) ; u(j) ; y;h) as a draw from the pseudo posterior distribution 18
of gb(xj ; u; y;h), given y . Subsequently, estimate features of the pseudo posterior distribution of gb(xj ; u;h) by ergodic averaging. For example, an estimate of its posterior expectation would be 1 X gb(xj g (xj ; u;h)] = ;h [b m j=1 m
b E
242 243 244 245
;ujy
(j)
; u(j) ; h);
where the samples (j) ; u(j) are obtained via a Markov chain Monte Carlo (MCMC) procedure from the pseudo posterior distribution [ ; u; 2u ; 2e jy ;h]. The MCMC procedure would consist of sequential, iterative, sampling from all conditional pseudo posterior distributions, as follows. Sample from [ jELSE], where ELSE denotes all other parameters, y ; X and h: Using standard results, this conditional distribution has the form jELSE N e ; V ; where
e = (W0 W)
and
1
W0 (y
V = (W0 W)
1
Zu) ;
2 e:
Sample u from [ujELSE] : Using similar standard results, the distribution to sample from is N (e u; Vu ) ;
ujELSE where
and
e= u
0
ZZ+A
Vu =
1
2 e 2 u
Z0 Z + A
1
Z0 (y
1
2 e 2 u
W ) 1 2 e:
Sample the two variance components from the scaled inverse chi-square distributions 2 u0 A 1 u+ u Su2 q+2 u ; u 19
and 2 e 246 247
(y
Zu)0 (y
W
Zu) + e Se2
W
2 n+
e
:
Above, u ; Su2 and e ; Se2 are known hyper-parameters of independent scaled inverse chi-square priors assigned to 2u and 2e ; respectively. Form draws from the pseudo-posterior distribution of gb(xj ; u; y;h) as (j)
gb(xj
; u(j) ; h):
As usual, early draws in the MCMC procedure are discarded as burnin and, subsequently, m samples are collected to infer features of interest. Note that the procedure termed as Strategy 1 gives gb(x) as estimate of the relationship between phenotypes and SNP data. In Strategy 2, one can use gb(xj = ; u = u; h) instead; where and u are means of the pseudo posterior distributions [ jy ;h] and [ujy ;h], respectively. Under Strategy 2, a point predictor of total additive genetic value could be T i (x) =
n X
wi0
wi (x ) yi
z0i u
i=1
+
( n X
yi
z0i u [di (x )
wi0
)0
wi (x ) r (x )]
i=1
248
249
250 251 252 253
(x
x ) + ui ;
where ui is the posterior mean of ui : REPRODUCING KERNEL HILBERT SPACES MIXED MODEL General: What follows is motivated by developments in MALLICK et al. (2005) for classi…cation of tumors using microarray data. The underlying theory is outside the scope of the paper. Only essentials are given here, and foundations are in WAHBA (1990, 1999). Using the structure of (19), consider the penalized sum of squares SS [g(x);h] =
n X
[yi
wi0
i=1
2
z0i u g(xi )] + h kg(x)k
(24)
where, as before, h is a smoothing parameter (possibly unknown) and kg(x)k is some norm or “stabilizer”. For instance, in smoothing splines, kg(x)k is 20
a function of the second derivatives of g(x) integrated between end-points that comprise the data. The second term in (24) acts as a penalty: if the unknown function g(x) is rough, in the sense of having slopes that change rapidly, the penalty increases. The main problem here is that of …nding the function g(x) that minimizes (24). Since SS [g(x);h] is a functional on g(x), this is a variational or calculus of variations problem over a space of smooth curves. The solution was given by KIMELDORF and WAHBA (1971) and WAHBA (1999), and the minimizer admits the representation g(:) =
0
+
n X
jK
(:; xj )
j=1
where K (:; :) is called a reproducing kernel. A possible choice for the kernel (MALLICK et al. 2005) is the single smoothing parameter Gaussian function Kh (x; xj ) = exp
(x
xj )0 (x h
xj )
:
Mixed model representation: We embed these results into (19), leading to the speci…cation yi = wi0 + z0i u+
n X
exp
(xi
j=1
254 255 256 257 258
xj )0 (xi h
xj ) j
+ ei ;
(25)
with the intercept parameter 0 included as part of : Note that there are as many regressions j as there are data points. However, the roughness penalty in the variational problem leads to a reduction in the e¤ective number of parameters in reproducible kernel Hilbert spaces regression, as it occurs in smoothing splines (FOX 2002). De…ne the 1 n row vector t0i (h) = the n
exp
1 column vector
(xi
xj )0 (xi h
xj )
;
j = 1; 2; ::; n;
= f j g ; j = 1; 2; :::; n; and the n 2 0 3 t1 (h) 6 t02 (h) 7 7 T (h) = 6 4 : 5: t0n (h) 21
n matrix
Then (25) can be written in matrix form as y = W + Zu + T (h) 259 260 261
+ e:
Suppose, further, that the j coe¢ cients are exchangeable according to the distribution j s N (0; 2 ) : Hence, for a given smoothing parameter h; we are in the setting of a mixed e¤ects linear model. Given h; 2u ; 2e and 2 (at a given h; the three variance components may be estimated by, e.g., REML) one can obtain predictions of the polygenic breeding values u and of the coe¢ cients from the solutions to the system 2 3 W0 W W0 Z W0 T (h) 3 2 3 2 2 6 7 b W0 y 0 1 e 0 6 Z0 W 7 ZZ+A Z T (h) 6 74 u 5 4 Z0 y 5 : 2 6 7 b = u 2 4 0 5 b T0 (h) y T (h) W T0 (h) Z T0 (h) T (h) + I 2e
(26) The “total” additive genetic value of an individual possessing a vector of SNP covariates xi can be de…ned as ( )0 n (x xj )0 (x xj ) @ X Ti = ui + exp (xi x ) j @x j=1 h ( n )0 x=x 0 X (x xj ) (x xj ) = ui + (xi x ) ; (27) exp j h j=1 where
xj )0 j : j h Since Ti is a linear function of the unknown ui and j e¤ects, its empirical best linear unbiased predictor, assuming known h, can be obtained by replacing these e¤ects by the corresponding solutions from (26). Incomplete genotyping: At least in animal breeding, it is not feasible to have all individuals genotyped for SNPs. On the other hand, the number of animals with phenotypic information available is typically in the order of hundreds of thousands, and genotyping is selective, e.g., young bulls that are candidates for progeny testing in dairy cattle production. Animals lacking molecular data are not a random sample from the population, and ignoring this issue may lead to biased inferences. Unless missingness of molecular =
262 263 264 265 266 267 268 269 270 271
2 (x
22
272 273 274 275 276 277 278 279
data is ignorable, in the sense of, e.g., RUBIN (1976), HENDERSON (1975), GIANOLA and FERNANDO (1986) or IM et al. (1989), the procedures given below require modelling of the missing data process, which is di¢ cult and may lack robustness. Here, it is assumed that missingness is ignorable, enabling use of likelihood-based or Bayesian procedures as if selection had not taken place (SORENSEN et al. 2001). Two ad-hoc procedures are discussed, and an alternative approach, suitable for kernel regression, is presented in the CONCLUSION section. 0 Let the vector of phenotypic data be partitioned as y = y10 y20 ; where y1 (n1 1), consists of records of individuals lacking SNP data, whereas y2 (n2 1) includes phenotypic data of genotyped individuals. Often, it will be the case that n1 > p >> n2 . We adopt the model y1 y2
=
W1 W2
+
Z1 Z2
u+
0 T2 (h)
+
e1 e2
:
(28)
For the sake of ‡exibility, assume that e1 v N 0; In1 2e1 and e2 v N 0; In2 2e2 are mutually independent but heteroscedastic vectors. In short, the key assumption made here is that the random e¤ect a¤ects y2 but not y1 , or equivalently, that it gets absorbed into e1 . With this representation, the mixed model equations take the form 3 2 2 2 X X 1 1 1 Wi0 Wi Wi0 Zi W20 T2 (h) 7 6 2 2 2 72 6 3 ei e2 i=1 7 b 6 i=1 ei 2 2 7 6 X X 1 1 1 0 1 0 74 u 6 5 7 b 6 W Z Z0i Zi + A 1 2 Z2 T2 (h) i i 2 2 2 7 6 ei u e2 i=1 7 b 6 i=1 ei 5 4 1 0 1 1 T02 (h) Z T2 (h) W2 T02 (h) T2 (h) + I 2 2 2 e2 2 2e2 3 X 1 Wi0 yi 7 6 2 6 7 e 6 i=1 i 7 2 6 X 7 1 0 7 = 6 (29) 6 Zi yi 7 2 6 7 6 i=1 ei 7 4 1 0 5 T (h) y 2 2 2 e2
280 281
If SNP data are missing completely at random and h; 2u ; 2e and 2 are b and b are treated as known, then b is an unbiased estimator of ; and u 23
282 283 284 285
unbiased predictors of u and , respectively: They are not “best”, in the sense of having minimum variance or minimum prediction error variance, because the smooth function g (x) of the SNP markers is missing in the model for individuals that are not genotyped (HENDERSON 1974). An alternative consists of writing the bivariate model y1 y2
W1 W2
=
Z1 0 0 Z2
+
u1 u2
+
0 T2 (h)
+
e1 e2
;
and then assign to the polygenic component, the multivariate normal distribution u1 0 A 2u1 A u12 N ; : u2 0 A u12 A 2u2 286 287 288 289
Here, 2u1 and 2u2 are additive genetic variances in individuals without and with molecular information, respectively, and u12 is their additive genetic covariance. Computations would be those appropriate for a two-trait linear model analysis (HENDERSON 1984; SORENSEN and GIANOLA 2002). Bayesian analysis: To illustrate, consider the …rst of the two options in the preceding section of the paper. Suppose a kernel has been chosen but that the value of h is uncertain, so that the model unknowns are =
0
; u0 ;
0
;
2 u;
2
;
2 e1 ;
0 2 e2 ; h
:
Let the prior density have the form p ( jH) 2 u 2 e ; Se
= p ( ) N uj0; A p 290 291 292 293 294 295 296 297
2 2 u j u ; Su
p
2
j
; S2 p
2 e1 j
N p
j0; I
2
2 2 e2 j e ; Se
p (hjhmin ; hmax ) (30)
where H denotes the set of all known hyper-parameters (whose values are …xed a priori) and N (:j:; :) indicates a multivariate normal distribution with appropriate mean vector and covariance matrix. The four variance components 2u ; 2 ; 2e1 ; 2e2 are assigned independent scaled inverse chi-squared prior distributions with degrees of freedom and scale parameters S 2 , with appropriate subscripts. Assign an improper prior distribution to each of the elements of and, as in MALLICK et al. (2005), adopt a uniform prior for h; with lower and upper boundaries hmin and hmax , respectively.
24
Given the parameters, observations are assumed to be conditionally independent, and the distribution adopted for the sampling model is y1 y2
N 298 299 300 301 302 303 304
j
W2
W1 + Z1 u + Z2 u + T2 (h)
In1 2e1 0 0 In2 2e2
;
Given h; one is again in the setting of the Bayesian analysis of a mixed linear model, and Markov chain Monte Carlo procedures for this situation are well known. Under standard conjugate prior parameterizations, all conditional posterior distributions are known, save for that of h: Hence, one can construct a Gibbs-Metropolis sampling scheme in which conditional distributions are used for drawing ; u; ; 2u ; 2 ; 2e1 ; 2e2 and a Metropolis update is employed for h: Location e¤ects are drawn from a multivariate normal distribution with mean vector and covariance matrix # ! 1" 2 2 X 1 X 1 1 Wi0 Wi Wi0 (yi Zi u) = W20 T2 (h) ; (31) 2 2 2 ei
i=1
ei
i=1
e2
and 2 X 1
V =
Wi0 Wi 2 ei
i=1
!
1
(32)
;
respectively. Likewise, the additive genetic e¤ects can be sampled from a normal distribution centered at ! 1" 2 # 2 X 1 X 1 0 1 1 Zi Zi + A 1 2 Z0i (yi Wi ) u= W20 T2 (h) ; 2 2 2 i=1
ei
u
ei
i=1
e2
(33)
and with covariance matrix Vu =
2 X 1
Z0i Zi 2 ei
i=1
+A
1
1 2 u
!
1
The conditional posterior distribution of the coe¢ cients normal as well, with mean vector =
1
T02 2 e2
(h) T2 (h) + I
1
1
1
2
2 e2
25
T02 (h) (y2
(34)
:
W2
is multivariate
Z2 u) ;
(35)
and variance-covariance matrix 1
V =
2 e2
T02 (h) T2 (h) + I
1
1
(36)
:
2
All four variance components have scaled inverse chi-squared conditional posterior distributions, and are conditionally independent. The conditional posterior distributions to sample from are 2 u jELSE 2 2 e1 jELSE
s (y1
s u0 A 1 u + vu Su2
jELSE s
0 0
Z1 u) (y1
W1
2 n2 +v
S2
+
2 q+vu ;
(38)
;
Z1 u) +
W1
(37)
2 e Se
2 n1 +ve ;
(39)
and
s (y2
W2
2 e2 jELSE 0 ) (y2 W2
T2 (h)
Z2 u
Z2 u
T2 (h) ) +
2 e Se
2 n2 +ve :
(40) The most di¢ cult parameter to sample is h: Its conditional posterior density can be represented as p (hjELSE) =
F (h) hZ max
(41)
;
F (h) dh
hmin
where
2
F (h) = exp 4
2
n2 1 X 2 e2 i=1
(
yi
wi0
z0i u
n2 X
exp
j=1
(xi
0
xj ) (xi h
xj )
Density (41) is not in a recognizable form. However, a Metropolis algorithm (METROPOLIS et al. 1953) can be tailored for obtaining samples from the distribution [hjELSE] : Suppose the Markov chain is at state h[t] : A proposal value h is drawn from some symmetric candidate-generating distribution, and accepted with probability = min
p (h jELSE) ;1 : p (h[t] jELSE) 26
j
)2 3
5:
If the proposal is accepted, then set h[t+1] = h ; otherwise, stay at h[t] . If a Hastings update is employed, instead, an adaptive proposal distribution could be, e.g., a …nite mixture of the prior density and of a scaled inverse 2 (hla rg e st hsm a lle st ) chi-squared density f with mean value h[t] and variance : 12 m (h) = w 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321
1 hmax
+ (1
hmin
w) f;
where hlargest and hsmallest are the largest and smallest value of h; respectively, accepted up to time t: The weight w in (0; 1) is assessed such that a reasonable acceptance rate for the Metropolis-Hastings algorithm is attained. Illustrative example: Phenotypic and genotypic values were simulated for a sample of N unrelated individuals, for each of two situations. The trait was determined by either …ve bi-allelic QTL, having additive gene action, or by …ve pairs of bi-allelic QTL, having additive by additive gene action. Under non-additive gene action, the additive genetic variance was null, so all genetic variance was of the additive additive type. Heritability (ratio between genetic and phenotypic variance) in both settings was set to 0.5. Genotypes were simulated for a total of 100 bi-allelic markers, including the "true" QTL; all loci were simulated to be in gametic-phase equilibrium. Since all individuals were unrelated and all loci were in gametic-phase equilibrium, only the QTL genotypes and the trait phenotypes provided information on the genotypic value. In most real applications, the location of the QTL will not be known, and so many loci that are not QTL will be included in the analysis. A reproducing kernel Hilbert space (RKHS) mixed model was used to predict genotypic values, given phenotypic values and genotypes at the QTL and at all other loci. The model included a …xed intercept 0 and a random RKHS regression coe¢ cient i for each subject; additive e¤ects were omitted from the model, as individuals were unrelated. The genetic value gi of a subject was predicted as g^i = ^ 0 + t0i (h) ^ ; where ^ 0 and ^ were obtained by solving (26) for this model, using a Gaussian kernel at varying functions of h. The mean squared error of prediction of genetic value (MSEP) was calculated as X MSEP(h; 2e = 2 ) = (gi g^i )2 ; (42) i
27
342
and a grid search was used to determine the values of h and 2e = 2 that minimized (42). To evaluate the performance of g^i as a predictor, another sample of 1000 individuals ("PRED") was simulated, including genotypes, genotypic values and phenotypes. The genotypic values of the subjects in "PRED" were predicted, given their genotypes, using g^i . Genotypic values were also predicted using a multiple linear regression (MLR) mixed model with a …xed intercept and random regression coe¢ cients on the linear e¤ects of genotypes. Results for the RKHS mixed model are in Table 1, and those for the MLR mixed model are in Table 2, where "accuracy" is the correlation between true and predicted genetic values. When gene action was strictly additive, the two methods (each …tting k = 100 loci) had the same accuracy, indicating that RKHS performed well even when the parametric assumptions were valid. On the other hand, when inheritance was purely additive additive, the parametric MLR was clearly outperformed by RKHS, irrespective of the number of loci …tted. An exception occurred at k = 100 and N = 1000; here, the two methods were completly inaccurate. However, when N was increased from 1000 to 5000; the accuracy of RKHS jumped to 0.85 (Table 1), whereas MLR remained inaccurate (Table 2). Note that the accuracy of RKHS decreased (with N held at 1000) when k increased. We attribute this behavior to the use of a Gaussian kernel when, in fact, covariates are discrete.
343
CONCLUSION
322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341
344 345 346 347 348 349 350 351 352 353 354 355 356 357
This article discusses approaches for prediction of genetic value using markers for the entire genome, such as SNPs, and phenotypic measurements for complex traits. In particular, theory for non-parametric and semiparametric procedures, i.e., kernel regression and reproducing kernel Hilbert spaces regression, is developed. The methods consist of a combination of features of the classical additive genetic model of quantitative genetics with an unknown function of SNP genotypes, which is inferred non-parametrically. Mixed model and Bayesian implementations are presented. The procedures can be computed using software developed by animal breeders for likelihoodbased and Bayesian analysis, after modi…cations. Except for the parametric part of the model, that is, the standard normality assumption for additive genetic values and model residuals, the procedures attempt to circumvent potential di¢ culties posed by violation of assumptions required for an orthogonal decomposition of genetic variance stemming from 28
358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395
SNP genotypes (COCKERHAM 1954; KEMPTHORNE 1954). Our expectation is that the non-parametric function of marker genotypes, g (x) ; captures all possible forms of interaction, but without explicit modelling. The procedures should be particularly useful for highly dimensional regression, including the situation in which the number of SNP variables (p) exceeds amply the number of data points (n). Instead of performing a selection of a few “signi…cant” markers based on some ad-hoc method, information on all molecular polymorphisms is employed, irrespective of the degree of colinearity. This is because the procedures should be insensitive to di¢ culties caused by co-linearity, given the forms of the estimators, e.g., (6) or (23). It is assumed that the assignment of genotypes to individuals is unambiguous, i.e., x gives the genotypes for SNP markers free of error. Our methods share the spirit of those of MEUWISSEN et al. (2001), GIANOLA et al. (2003), XU et al. (2003), TER BRAAK et al. (2005), YI et al. (2003), WANG et al. (2005) and ZHANG and XU (2005), but without making strong assumptions about the form of the marker-phenotype relationship, which is assumed linear by all these authors, and without invoking parametric distributions for pertinent e¤ects. A hypothetical example was presented, illustrating potential and computational feasibility of at least the RKHS procedure; a standard additive genetic model was outperformed by RKHS when additive additive gene action was simulated. Comparisons between parametric and non-parametric procedures are needed. It is unlikely that computer simulations would shed much light in this respect. First, there is a huge number of simulation scenarios that can be envisaged, resulting from a limitless combinations of parameter values, number of markers, marker e¤ects, residual distributions, etc. Secondly, simulations tend to have "local" value only, that is, conclusions are tentative only within the limited experimental region explored, and are heavily dependent on the state of nature assumed. Thirdly, end-points and gauges of a simulation tend to be arbitrary. For instance, should frequentist procedures be assessed on Bayesian grounds and vice-versa? We believe that studies based on predictive cross-validation for a range of traits and species are perhaps more fruitful. These studies will be conducted once adequate and reliable phenotypic-SNP data sets become more widely available. There are at least two di¢ culties with the proposed methodology. As noted above, it is assumed that the SNP genotypes are constructed without error, which is seldom the case. In order to solve this problem, one would need to build an error-in-variables model, but at the expense of introducing 29
396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426
additional assumptions. A second di¢ culty is that posed by the fact that many individuals will lack SNP information, at least in animal breeding. Earlier, we presented approximate procedures based on the assumption that missingness of SNP data is ignorable, such that the e¤ect of g (x) can be absorbed into a residual that has a di¤erent variance from that peculiar to individuals possessing SNP data, or into the additive genetic value in a twotrait implementation. A more appropriate treatment of missing data requires imputation of genotypes for individuals lacking SNP information. If the SNP data are missing completely at random or at random, the solutions to system (29), after augmentation with the missing T1 (h) would give the means of the posterior distributions of ; u and , conditionally on the variance components and on h (GIANOLA and FERNANDO 1986; SORENSEN et al. 2001). However, the sampling procedure should address the constraint that SNP genotypes of related individuals must be more alike than those of unrelated subjects. In our treatment, and for operational reasons, we adopted the simplifying assumption that SNP genotypes are independently distributed. This may be anticonservative, and could lead to some bias if SNP data are not missing completely at random It is unknown to what extent our procedures are robust with respect to the choice of kernel function. SILVERMAN (1986) discusses several options and, in the context of univariate density estimation, concludes that di¤erent kernels di¤er little in mean squared error. Also, an inadequate speci…cation of the smoothing parameter h may a¤ect inference adversely. In this respect, the procedures discussed in REPRODUCING KERNEL HILBERT SPACES MIXED MODEL provide for automatic speci…cation of the band width parameter. Here, one could either make an analysis conditionally on the “most likely”value of h or, alternatively, average over its posterior distribution, to take uncertainty into account fully. Here, we have focused on using a continuous kernel, primarily to exploit di¤erentiability properties. However, the vector of markers, x, has a discrete distribution, and careful investigation is needed in order to assess the adequacy of such an approximation. A procedure to accommodate missing genotypes in kernel regression is as follows. Replace g(xi ) in (19) by g(xi ) = g(^ xi ) + g(xi ) = g(^ xi ) + q i
427 428
g(^ xi )
(43)
^ i is the conditional expectation of xi given all the observed genotypes where x in the pedigree, and qi = g(xi ) g(^ xi ). The conditional expectation in 30
429 430 431 432 433 434 435
(43) is a function of the conditional probabilities of the missing genotypes given the observed genotypes. Much research has been devoted to computing such probabilities from complex pedigrees using either approximations (VAN ARENDONK et al. 1989; FERNANDO et al. 1993; KERR and KINGHORN 1996) or MCMC samplers (SHEEHAN AND THOMAS 1993; JENSEN et al. 1995; JENSEN and KONG 1999; FERNANDEZ et al. 2002; STRICKER et al. 2002) . ^ i = xi , and qi is null. When genotypes for individual i are observed, x When genotypes for i are missing, g(^ xi ) is an approximation to the conditional expectation of g(xi ) given the observed genotypes, and qi is a random variable with approximately null expectation. Now, the genotypic value of an individual can be written as g(xi ) + ui = g(^ xi ) + q i + u i :
(44)
where, from the linear approximation given by (9), the variance of qi is _ _ Var(qi ) = g(x )0 V i g(x ); 436 437 438 439 440 441 442
(45)
with V i being the conditional covariance matrix of xi given the observed genotypes of relatives. The jth diagonal in V i is a function of the conditional probabilities of the missing genotypes at locus j given the observed genotypes, and the jkth o¤-diagonal element is a function of the conditional probabilities of the missing genotypes at loci j and k given the observed genotypes. An _ estimate of Var(qi ) can be obtained by replacing g(x ) by its estimate in (45). In the simplest approach to modelling the covariances of the qi , the random e¤ects ui and qi are combined as ai = u i + q i : Now, the model for yi is written as yi = w0i + g(^ xi ) + z 0i a + ei :
443 444 445 446 447
(46)
An approximation to the covariance matrix a of a can be obtained by using the usual tabular algorithm that is based only on pedigree information, after setting the ith diagonal of a to Var(ui ) + Var(qi ). The inverse of this approximate a is sparse and it can be computed e¢ ciently (HENDERSON 1976). 31
Although qi is not null only for individuals with missing genotypes, as discussed below, the observed genotypes on relatives can provide information on the segregation of alleles in individuals with missing genotypes. Thus, observed genotypes can provide information on covariances between the qi . Let qij denote the additive genotypic value at locus j, which can be written as qij = vijm + vijp ; p where vijm and vijp are the gametic values of haplotypes hm ij and hij . When genotype information is available at a locus, it is more convenient to work with the gametic values vijm and vijp rather than the genotypic values qij . For an individual with missing genotypes, the variance of vijx can be written as X 0 2 0 (47) Pr(hxij ) Pr(hijx )( hxij Var(vijx ) = h x) ; ij
0
x hx ij 6=hij
for x = m or p, where hm is the additive e¤ect of haplotype hm ij . These ij additive e¤ects can be estimated from the linear function given by (8) as follows. Let xhj denote the value of x with all elements set to its mean value except at locus j where one of the haplotypes is set to hj . Then, ^hj = g^(x )0 (xhj
x ):
The covariance between vijx and vim0 j , for example, can be written as p x m hm dj ) + Cov(vij ; vdj ) Pr(hi0 j
hpdj ); (48) m m where Pr(hi0 j hdj ) is the probability that the maternal haplotype of i0 is inherited from the maternal haplotype of d its dam. Suppose genotype information is available on the ancestors of d and on the descendants of i0 . Then, the segregation probabilities in (48) may be di¤erent from 0.5, and thus the observed genotypes will contribute information for genetic covariances at this locus. However, even in this situation, the inverse of the gametic covariance matrix Gj for locus j that is constructed using (47) and (48) is sparse and it can be computed e¢ ciently (FERNANDO and GROSSMAN 1989). For an improved approximation to the modeling of covariances between the qi , the model for yi is written as X yi = w0i + g(^ xi ) + k0j v j + z 0i a + ei ; (49) m Cov(vijx ; vim0 j ) = Cov(vijx ; vdj ) Pr(hm i0 j
448 449 450 451 452 453 454 455 456
j2L
32
where L is the set of k loci with the largest e¤ects on the trait. The covariance matrix for the vector v j of gametic values can be computed using (47) and (48). The usual tabular method based on pedigree information is used to compute a , after setting the ith diagonal to X (50) [Var(vijm ) + Var(vijp )]: Var(ai ) = Var(ai ) j2L
457 458 459 460 461 462 463 464 465 466 467 468 469 470
Our models extend directly to binary or ordered categorical responses when a threshold-liability model holds (WRIGHT 1934; FALCONER 1965; GIANOLA 1982; GIANOLA and FOULLEY 1983). Here, the g (x) function would be viewed as a¤ecting a latent variable; MALLICK et al. (2005) used this idea in analysis of gene expression measurements. Extension to multivariate responses is less straightforward. It is conceivable that each trait may require a di¤erent function of SNP genotypes. It is not obvious how this problem should be dealt with without making strong parametric assumptions. In conclusion, we believe this paper gives a …rst description of nonparametric and semi-parametric procedures that may be suitable for prediction of genetic value using dense marker data. However, considerable research is required for tuning, extending and validating some of the ideas presented here.
475
Miguel A. Toro, Bani Mallick and two anonymous reviewers are thanked for useful comments. Research was completed while the senior author was visiting Parco Tecnologico Padano, Lodi, Italy. Support by the Wisconsin Agriculture Experiment Station, and by grants NRICGP/USDA 2003-35205-12833, NSF DEB0089742 and NSF DMS-NSF DMS-044371 is acknowledged.
476
LITERATURE CITED
471 472 473 474
477 478 479 480 481 482 483 484 485 486
AITCHISON, J., and C. G. G. AITKEN, 1976 Multivariate binary discrimination by the kernel method. Biometrika 63: 413-420. CHANG, H. L. A., 1988 Studies on estimation of genetic variances under nonadditive gene action. Ph. D. Thesis, University of Illinois at UrbanaChampaign. CHU, C. K., and J. S. MARRON, 1991 Choosing a kernel regression estimator. Statistical Science 6: 404-436. COCKERHAM, C. C., 1954 An extension of the concept of partitioning hereditary variance for analysis of covariances among relatives when epistasis is present. Genetics 39: 859-882. 33
487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524
DEKKERS, J. C. M., and HOSPITAL, F., 2002 The use of molecular genetics in the improvement of agricultural populations. Nat. Rev. Genet. 3: 22-32. D’HAESELEER, P., S. LIANG, and R. SOMOGYI, 2000 Genetic network inference: from co-expression clustering to reverse engineering. Bioinformatics 16: 707-726. FALCONER, D. S., 1960 Introduction to Quantitative Genetics. Longman, London. FALCONER, D. S., 1965 The inheritance of liability to certain diseases, estimated from the incidence among relatives. Annals of Human Genetics 29: 51–76. FERNÁNDEZ, S. A., R. L FERNANDO, B. GULBRANDTSEN, C. STRICKER, M. SCHELLING, and A. L. CARRIQUIRY, 2002 Irreducibility and e¢ ciency of ESIP to sample marker genotypes in large pedigrees with loops. Genet. Sel. Evol. 34: 537–555. FERNANDO, R. L. and M. GROSSMAN, 1989 Marker assisted selection using best linear unbiased prediction. Genet. Sel. Evol. 21: 467-477. FERNANDO, R. L., C. STRICKER, and R. C. ELSTON, 1993 An ef…cient algorithm to compute the posterior genotypic distribution for every member of a pedigree without loops. Theor. Appl. Genet. 87: 89-93. FISHER, R. A., 1918 The correlation between relatives on the supposition of Mendelian inheritance. Trans. Roy. Soc. Edinburgh 52: 399-433. FOX, J., 2002 An R and S-PLUS companion to Applied Regression. Sage, Thousand Oaks. FOX, J., 2005 Introduction to Nonparametric Regression. Lecture Notes. http://socserv.mcmaster.ca/jfox/Courses/Oxford. GIANOLA, D., 1982 Theory and analysis of threshold characters. J. Anim. Sci. 54: 1079-1096. GIANOLA, D. and R. L. FERNANDO, 1986 Bayesian methods in animal breeding theory. J. Anim. Sci. 63: 217-244. GIANOLA D., and J. L. FOULLEY, 1983 Sire evaluation for ordered categorical data with a threshold model. Genet. Sel. Evol. 15: 201 -244. GIANOLA, D., M. PEREZ-ENCISO, and M. A. TORO, 2003 On markerassisted prediction of genetic value: beyond the ridge. Genetics 163: 347365. GIANOLA,D., and SORENSEN, D., 2004 Quantitative genetic models for describing simultaneous and recursive relationships between phenotypes. 34
525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562
Genetics 167: 1407-1424. GOLUB, T. R., D. SLONIM D., P. TAMAYO, C. HUARD, M. GASENBEEK, J. MESIRO, H. COLLER, M. LOH, J. DOWNING, M. CALIGIURI, C. BLOOMFIELD, and E. LANDER, 1999 Molecular classi…cation of cancer: class discovery and class prediction by gene expression monitoring. Science 286: 531-537. GULISIJA, D., D. GIANOLA, and K. A. Weigel, 2006 Nonparametric analysis of the relationship between performance and inbreeding in Jersey cows. J. Dairy. Sci. (submitted). HART J. D., and C. L. LEE, 2005 Robustness of one-sided cross-validation to autocorrelation. Journal of Multivariate Analysis 92: 77-96. HASTIE T. J., and R. J. TIBSHIRANI, 1990 Generalized Additive Models. Chapman and Hall, London. HASTIE, T., R. TIBSHIRANI, and J. FRIEDMAN, 2001 The Elements of Statistical Learning. Springer, New York. HAYES, B., J. LAERDAHL, D. LIEN, A. ADZHUBEI, and B. H?YHEIM, 2004 Large scale discovery of single nucleotide polymorphism (SNP) markers in Atlantic Salmon (Salmo salar). AKVAFORSK, Institute of Aquaculture Research. www.mabit.no/pdf/hayes.pdf HENDERSON, C. R., 1973 Sire evaluation and genetic trends, pp. 10-41 in Proceedings of the Animal Breeding and Genetics Symposium in Honor of Dr. Jay L. Lush. American Society of Animal Science and American Dairy Science Association. Champaign, Illinois. HENDERSON, C. R., 1974 General ‡exibility of linear model techniques for sire evaluation. J. Dairy Sci. 57: 963-972. HENDERSON, C. R., 1975. Best linear unbiased estimation and prediction under a selection model. Biometrics 31: 423-447. HENDERSON, C. R., 1976. A simple method for computing the inverse of a numerator relationship matrix used in prediction of breeding values. Biometrics 32: 69-83. HENDERSON, C. R. 1984. Applications of linear models in animal breeding. University of Guelph, Guelph. IM, S., R. L. FERNANDO, and D. GIANOLA, 1989 Likelihood inferences in animal breeding under selection: A missing data theory viewpoint. Genet. Sel. Evol. 21: 399-414. JENSEN, C. S., and A. KONG, 1999 Blocking Gibbs sampling for linkage analysis in large pedigrees with many loops. Am. J. Hum. Genet. 65: 885901. 35
563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600
JENSEN, C. S., A. KONG, and U. KJAERULFF, 1995 Blocking Gibbs sampling in very large probabilistic expert systems. Int. J. Hum. Comp. Stud. 42: 647-666. KEMPTHORNE, O., 1954 The correlation between relatives in a random mating population. Proc. R. Soc. London B 143: 103-113. KERR, R. J., and B. P. KINGHORN, 1996 An e¢ cient algorithm for segregation analysis in large populations. J. Anim. Breed. Genet 113: 457-469. KIMELDORF, G., and G. WAHBA, 1971 Some results on Tchebyche¢ an spline functions. Journal of Mathematical Analysis and Applications 33: 8295. KRUUK, L. E. B., 2004 Estimating genetic parameters in natural populations using the ’animal model’. Philosophical Transactions of the Royal Society of London B 359: 873-890. LINDLEY, D. V., and A. F. M. SMITH, 1972 Bayes estimates for the linear model. J. Roy. Statist. Soc. B 34: 1-41. MALLICK B. K., D. GHOSH, and M. GHOSH, 2005 Bayesian classi…cation of tumours by using gene expression data. Journal of the Royal Statistical Society B 67: 219-234. MARRON, J. S., 1988 Automatic smoothing parameter selection: a survey. Empirical Economics 13: 187-208. METROPOLIS, N., A. W. ROSENBLUTH, M. N. ROSENBLUTH, A. H TELLER,and E. TELLER, 1953 Equations of state calculations by fast computing machines. J. Chem. Phys. 21: 1087-1091. MEUWISSEN, T. H. E., B. J. HAYES, and M. E. GODDARD, 2001 Is it possible to predict the total genetic merit under a very dense marker map? Genetics 157: 1819-1829 NADARAYA, E. A., 1964 On estimating regression. Theory of Probability and its Applications 9: 141-142. PATTERSON, H. D., and R. THOMPSON, 1971. Recovery of interblock information when block sizes are unequal. Biometrika 58: 545-554. QUAAS, R. L., and E. J. POLLAK, 1980 Mixed model methodology for farm and ranch beef cattle testing programs. J. Anim. Sci. 51: 1277-1287. RACINE J., and Q. LI, 2004 Nonparametric estimation of regression functions with both categorical and continuous data. Journal of Econometrics 119: 99-130. RUBIN, D. B., 1976 Inference and missing data. Biometrika 63: 581 -582. 36
601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638
RUPPERT, D., M. P. WAND, and R. J. CARROLL, 2003 Semiparametric regression. Cambridge University Press, Cambridge. SCHUCANY, W. R., 2004 Kernel smoothers: an overview of curve estimators for the …rst graduate course in nonparametric statistics. Statistical Science 4: 663-675. SHEEHAN, N., and A. THOMAS, 1993 On the irreducibility of a Markov chain de…ned on a space of genotype con…gurations by a sample scheme. Biometrics 49: 163-175. SILVERMAN, B.W., 1986 Density estimation for statistics and data analysis. Chapman and Hall, London SOLLER, M., and J. BECKMANN, 1982 Restriction fragment length polymorphisms and genetic improvement. Proc. 2nd World Congr. Genet. Appl. Livestock Prod. 6: 396-404. SORENSEN, D., R. L. FERNANDO, and D. GIANOLA, 2001 Inferring the trajectory of genetic variance in the course of arti…cial selection. Gen. Res. 77: 83-94. SORENSEN, D., and D. GIANOLA, 2002 Likelihood, Bayesian, and MCMC Methods in Quantitative Genetics. Springer-Verlag, New York. SORENSEN, D. A., and B. W. KENNEDY 1983. The use of the relationship matrix to account for genetic drift variance in the analysis of genetic experiments. Theor. Appl. Genet. 66: 217-220. STRICKER, C., M. SCHELLING, F. DU, I. HOESCHELE, S. A. FERNANDEZ, and R. L. FERNANDO, 2002 A comparison of e¢ cient genotype samplers for complex pedigrees and multiple linked loci. In: 7th World Congress of Genetics Applied to Livestock Production. CD-ROM Communication No. 21-12. TER BRAAK, C. J. F., M. BOER, and M. BINK, 2005 Extending Xu’s (2003) Bayesian model for estimating polygenic e¤ects using markers of the entire genome. Genetics 170: 1435-1438. VAN ARENDONK, J. A. M., C. SMITH, and B. W. KENNEDY, 1989 Method to estimate genotype probabilities at individual loci in farm livestock. Theor. Appl. Genet. 78: 735-740. WAHBA, G., 1990 Spline Models for Observational Data. Society for Industrial and Applied Mathematics, Philadelphia. WAHBA, G., 1999 Support vector machines, reproducing kernel Hilbert spaces and the randomized GAVC, pp. 68-88 in Advances in Kernel Methods, edited by B. SCHÖLKOPF, C. BURGES and A. SMOLA. MIT Press, Cambridge. 37
639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660
WANG, C.S., J.J. RUTLEDGE, and D. GIANOLA, 1993 Marginal inferences about variance components in a mixed linear model using Gibbs sampling. Genet. Sel. Evol. 25: 41-62. WANG, C.S., J.J. RUTLEDGE, and D. GIANOLA, 1994. Bayesian analysis of mixed linear models via Gibbs sampling with an application to litter size in Iberian pigs. Genet. Sel. Evol. 26: 91-115. WANG, H., Y. M. ZHANG, X. LI, G. MASINDE, S. MOHAN, D. J. BAYLINK and S. XU, 2005 Bayesian shrinkage estimation of quantitative trait loci parameters. Genetics 170: 465-480. WATSON, G. S., 1964 Smooth regression analysis. Sankhy¯a A 26: 359372. WONG et al., 2004 A genetic variation map for chicken with 2.8 million single-nucleotide polymorphisms. Nature 432: 717-722. WRIGHT, S., 1934 The results of crosses between inbred strains of guinea pigs, di¤ering in number of digits. Genetics 19: 537-551. XU S., 2003 Estimating polygenic e¤ects using markers of the entire genome. Genetics 163: 789-801. YI, N., VARGHESE, G., and D. A. ALLISON. 2003 Stochastic Search variable selection for identifying multiple quantitative trait loci. Genetics 164: 1129-1138. ZHANG Y., and S. XU, 2005 A penalized maximum-likelihood method for estimating epistatic e¤ects of QTL. Heredity 95: 96-104.
38
TABLE 1
661
663
Accuracy of predicting genotypic values using the RKHS mixed model. N is the sample size, k is the number of loci …tted in the model, including
664
the QTL, h is the bandwidth, and
665
and "RKHS" variance.
666
Gene Action Additive Non-additive Non-additive Non-additive Non-additive Non-additive
662
2 e 2
is the ratio between environmental
2 2 N k h Accuracy e= 1000 100 8920 0.0005 0.95 1000 10 47 0.01 0.96 1000 25 54 0.01 0.81 1000 50 98 0.01 0.50 1000 100 436 0.0005 0.00 5000 100 300 0.0005 0.85
39
TABLE 2
667
669
Accuracy of predicting genotypic values using the MLR mixed model. N is the sample size, k is the number of loci …tted in the model, including the
670
QTL, and
671
to" linear regression.
668
672
2 e 2
is the ratio between the environmental variance and that "due
Gene Action Additive Non-additive Non-additive Non-additive Non-additive Non-additive
N k 1000 100 1000 10 1000 25 1000 50 1000 100 5000 100
2 2 e=
32 208 748 760 328 3400
40
Accuracy 0.95 0.0 0.18 0.0 0.0 0.0
673 674
FIGURE 1.-Impact of window width on the regression function of logincome on age (CHU and MARRON 1991)
41
1. Figure 1:
42
675 676 677 678
FIGURE 2.- LOESS curves of protein yield deviations (EBLUP) in Jersey cows against their inbreeding coe¢ cients (F, %). The thick curve is the …tted regression (second degree local polynomial, spanning parameter= 0.9); the other curves are 100 bootstrap replicates.
43
44