Aquaculture 420–421 (2014) S8–S14
Contents lists available at ScienceDirect
Aquaculture journal homepage: www.elsevier.com/locate/aqua-online
Implementation and accuracy of genomic selection Jeremy F. Taylor ⁎ Division of Animal Sciences, University of Missouri, Columbia, MO 65211 USA
a r t i c l e
i n f o
Article history: Received 1 August 2012 Accepted 17 February 2013 Available online 24 February 2013 Keywords: Genomic selection Genomic estimated breeding values Single nucleotide polymorphisms Genomic relationship matrix Accuracy
a b s t r a c t Genomic selection is emerging as a powerful tool for the estimation of breeding values in plant and animal breeding. While many analytical approaches have been proposed for the joint estimation of high-density single nucleotide polymorphism (SNP) effects, within the framework of best linear unbiased estimation, genomic selection is equivalent to the prediction of breeding values for individuals with no phenotypes, for which the theoretical solution was first published in 1974. Genomic selection simply replaces the pedigree-derived numerator relationship matrix with the marker-derived realized genomic relationship matrix, an approach first proposed in 1997. The advance facilitated by the availability of high-density SNP genotypes is the ability to precisely estimate realized relationship coefficients among individuals regardless of the availability of pedigree information or the history of selection that has been applied to the population. However, genomic relationship coefficients are usually estimated assuming the independence of SNP genotypes, thus ignoring the effects of linkage disequilibrium, and the utilized SNPs are invariably ascertained to be common variants within the specie's genome which leads to the overestimation of relationship coefficients. The accuracy of the produced genomic estimated breeding values (GEBV) is often evaluated using variously formed validation populations incorporating individuals with genotypes and phenotypes that were not used for the estimation of SNP effects in the training population. However, GEBV accuracies are shown here to be a function of the accuracy of training population GEBV and the magnitudes of genomic relationships between individuals in the training and validation populations. Consequently, genomic selection is ideally suited to populations in which highly accurate GEBV are available for training population individuals and whose marker-selected progeny go on to produce phenotypes and reenter the training population which then becomes dynamic. Conversely, genomic selection is not well suited to the identification of elite individuals within families that have not historically contributed to breeding programs, to static training populations, or to training and implementation in distantly related populations. Thus, the implementation of genomic selection for costly or difficult to measure phenotypes such as feed efficiency or disease resistance will require the periodic regeneration of phenotyped populations for the retraining of GEBV prediction equations or the identification of the causal variants which underlie variation in these traits. The exponentially reducing cost of whole genome resequencing may soon allow the identification of at least the large effect variants. © 2013 Elsevier B.V. Open access under CC BY-NC-ND license.
1. Introduction Genomic selection (GS) was first proposed by Meuwissen et al. (2001) as a method for the prediction of breeding values of individuals without phenotypes but that had been genotyped with a high-density marker panel. The approach is based upon the simultaneous estimation of allele substitution effects (ASE) for each of the markers using linear or non-linear Bayesian models applied to phenotypes or estimated breeding values (EBV) available on genotyped individuals comprising a training population, the determination of the accuracy of the derived prediction equations in an independent validation population and application of the prediction equations to generate genomic estimated breeding values (GEBV) in selection ⁎ Tel.: +1 573 884 4946; fax: +1 573 882 6827. E-mail address:
[email protected]. 0044-8486 © 2013 Elsevier B.V. Open access under CC BY-NC-ND license. http://dx.doi.org/10.1016/j.aquaculture.2013.02.017
candidates within an implementation population. The term training population arises from the idea that some form of model is “trained” on genotypes and phenotypes to produce estimates of ASE and GEBV. The purpose of the validation step is to use phenotypes available on an independent set of genotyped individuals to those used in the training population to produce an estimate of the accuracy of the GEBV that will be generated for the selection candidates. Consequently, the individuals sampled to form the validation population should be representative of the selection candidates in the sense that the accuracies of GEBV produced for the validation population should reflect the accuracies of GEBV produced for the selection candidates in the implementation population. Fig. 1 shows the purpose of each of the populations and illustrates the difference between static and dynamic training populations. It also demonstrates the limited utility of a validation population for the estimation of GEBV accuracy when the training population is static, since the relatedness of individuals
J.F. Taylor / Aquaculture 420–421 (2014) S8–S14
S9
Fig. 1. A model for the implementation of genomic selection indicating the purposes to which the individuals within the training, validation and implementation populations are used. A static training population does not involve the incorporation of new animals; however, if phenotypes are collected on selection candidates after their selection they may be added to the training population and allele substitution effects reestimated at each cycle of selection.
in the implementation population to individuals in the training population decreases with each cycle of selection. In December 2007, the BovineSNP50 (Illumina, San Diego, CA; Matukumalli et al., 2009) assay comprising 54,001 bovine single nucleotide polymorphisms (SNP) became available and by 2009 GS had been implemented within the U.S. dairy industry (VanRaden et al., 2009). By this time, it was also recognized that ASE estimated from random effects models could be used for genome wide association studies (GWAS) for the detection of quantitative trait loci (QTL) (Cole et al., 2009; Kang et al., 2010; Yang et al., 2011) and that the use of the genomic relationship matrix provides protection against the effects of pedigree-based stratification in the estimation of ASE (Kang et al., 2010). However, when SNP ASE are jointly estimated as random effects, the approach provides an estimate of the proportion of variation jointly explained by the markers rather than providing individual tests of marker effects. Tests of individual marker effects can be produced by individually including each of the markers in the model as a fixed effect (e.g., Kang et al., 2010; Schulman et al., 2011); however, this overestimates the effect of each marker which cannot then be combined to produce GEBV. More recently, Illumina has released the 778 K SNP BovineHD and Affymetrix has released the 640 K SNP Axiom BOS 1 (Affymetrix, Santa Clara, CA) assays which have allowed an increased precision for fine-mapping QTL, but surprisingly, only a small increase in the accuracy of estimates of genomic relationship coefficients and GEBV (Erbe et al., 2012). The small increase in GEBV accuracies that occurs with a 15-fold increase in the number of genotyped SNP suggests that many traits possess an underlying genetic architecture involving rare QTL variants that are not detected by the common SNP included in the commercial genotyping assays. By definition, this means that the genomic relationship matrices estimated using these assays (VanRaden, 2008) must produce biased estimates of the realized genomic identity by descent among individuals. Recent work has focused on theoretical (Goddard, 2009) and empirical evaluations (Hayes et al., 2009; Luan et al., 2009; Su et al.,
2010; VanRaden et al., 2009) of the accuracy of estimated GEBV and Habier et al. (2010) have empirically shown that accuracies decrease as the number of generations which separate training and validation datasets increases. However, results which formally present the relationship between the accuracy of GEBV in selection candidates in terms of their relatedness to training population individuals and the accuracy of training population GEBV have yet to be presented. In this paper, results are presented for the implementation of GS under the framework of best linear unbiased prediction (BLUP) including the equivalence of models for the estimation of GEBV and ASE in training populations and for simultaneous training and estimation of GEBV among selection candidates. Also presented are the sampling variances and covariances from which the accuracy of estimates of ASE and GEBV in selection candidates may be derived. 2. Equivalent models for training in genomic selection 2.1. The genomic relationship matrix Consider biallelic autosomal loci i = 1,…,L with alleles Ai and Bi which have population frequencies F(Ai) = pi and F(Bi) = 1-pi. Assume that genotypes AiAi, AiBi and BiBi have genotypic values ai, di and −ai, respectively, and that genotype frequencies are in Hardy Weinberg Equilibrium. Under panmixia, the breeding value ui associated with each genotype at the ith locus is (e.g., pg 121 in Falconer, 1960):
ui ¼
8 < :
2qi αi ðqi −pi Þαi −2pi αi
for genotype
Ai Ai Ai B i Bi Bi
where αi = ai + di (qi − pi) is the ASE at the ith locus.
ð1Þ
S10
J.F. Taylor / Aquaculture 420–421 (2014) S8–S14
Because the markers are in incomplete linkage disequilibrium (LD) with the variants that are causal for their effects on phenotype, the breeding value of an individual can be written as: L X ui þ ε
u ¼
ð2Þ
i¼1
¼ u þ ε
(the ASE are assumed uncorrelated). Yang and Tempelman (2012) have shown that accounting for the LD among loci in BayesA and BayesB analyses resulted in increases in GEBV prediction accuracies of up to 3.6%, similar in magnitude to those achieved by Erbe et al. (2012) when the number of genotyped SNP was increased from 39,745 to 624,213. Finally, we now define Var(u) = ΦσA2 where Φ is the realized identity by descent among the individuals in u. Clearly, lim
Φ = L→S cG for S the genome size. where u* is the genomic breeding value which is dependent on the utilized marker set and ε is the unexplained component of breeding value due to the incomplete LD between SNP and the set of all causal variants underlying additive genetic variation in the trait. From Eqs. (1) and (2) we may write u* = m′α where m′ is a row vector for which the ith element is 2qi, (qi − pi), or − 2pi if the individual's genotype at the ith locus is Ai Ai, Ai Bi, or BiBi, respectively and α is the vector of SNP ASE. Finally, for a collection of q individuals, the vector u* of genomic breeding values can be written as: u ¼ Mα
VarðuÞ ¼ VarðuÞ þ VarðεÞ L X Varðui Þ þ VarðεÞ σ 2A ¼ i¼1 L X
2 pi ð1−pi Þσ M
¼2
i¼1 σ 2G þ
i¼1
VarðuÞ ¼ VarðMα Þ
: ′ σ 2M ¼ MDM
ð4Þ
¼ Gσ 2G From Eq. (4), the genomic relationship matrix G is derived as: 0
MDM
ð5Þ
and from Eq. (5): −1
I¼ϕ
G
ð7Þ
−1
MDM′:
where X and Z are matrices relating observations in y to levels of β and u* for which: " 0 u u Gσ2G and Var ¼ ¼ E 0 e e 0
ð6Þ
This formulation makes it very clear that G is an estimate of the realized genomic relationship matrix among individuals which depends on the sampling scheme for assayed SNP (only random sampling would lead to the expectation that the SNP and QTL allele frequency distributions would be similar), the number of assayed SNP (the larger the sample, the greater the likelihood of finding a SNP in strong LD with each QTL), and the assumption that there is no LD between SNPs
# 0 : Rσ2E
Because u = u* + ε the residual in Eq. (7) is e = Zε + ρ where ρ contains the nonadditive genetic and environmental components of phenotype for which Var(ρ) = ΙσE2. Cov(u*,u) = Var(u*) because Cov(u*,ε) = 0 which leads to Var(e) = Z(Φ − cG)Z′σA2 + ΙσE2 → ΙσE2 as L → S and cG → Φ. The mixed model equations corresponding to model (7) are (Henderson, 1973):
þ VarðεÞ
σ 2A −σ 2G
(VanRaden, 2008). From Eq. (3) we have:
−1
y ¼ Xβ þ Zu þ e
2 = cσA2 is the amount of additive genetic variance (σA2) where σG2 = ϕσM L X pi ð1−pi Þ explained by the markers for 0 ≤ c = σG2/σA2 ≤ 1 and ϕ = 2
G¼ϕ
Consider now that observations are available for the q animals that have been genotyped for L loci. The mixed linear model relating observations to fixed effects, β, and genomic breeding values, u*, may be written as:
ð3Þ
and breeding values may be partitioned as u = u* + ε where M is a q× L matrix containing the vectors m′ corresponding to each individual as its rows (VanRaden, 2008) and ε is a vector of residual breeding values. The parameterization in Eq. (1) causes E[α] = 0 and therefore E 2 2 where σM is the [u*] = 0 from Eq. (3). Suppose that Var(α) = DσM component of variance associated with ASE and D is a matrix which is diagonal in the absence of LD among the SNP (which is clearly not true) and may be parameterized to allow the ASE to be drawn from distributions with different variances corresponding to, e.g., loci of large or minor effect. When the number of loci far exceeds the number of observations on genotyped individuals, L (or more) unique elements in D are not estimable and frequently the infinite allele model is assumed for which D = I. From Eq. (2):
¼
2.2. Equivalent models
X′R −1 X X′R −1 Z Z′R−1 X Z′R−1 Z þ G−1 λG
^ β ^ u
"
¼
X′R−1 y −1 Z′R y
# ð8Þ
where λG = σE2/σG2. When the genomic relationship matrix G is used in place of its expected value A, the numerator relationship matrix, the solutions to Eq. (8) are commonly referred to as genomic best linear unbiased predictions (GBLUP) of genomic breeding values (i.e., GEBV). It has previously been shown (Nejati-Javaremi et al., 1997) that GBLUP of u* provides estimates that have smaller prediction error variances than BLUP of u employing A which requires that Φ is well estimated by G but also reflects that the realized identity by descent between individuals separated by more than a single meiosis can deviate significantly from expectation. Furthermore, while the estimation of G is influenced by characteristics of the genotyping assay and assumptions regarding LD, the computation of A requires the assumption that the expected value of Mendelian sampling effects (the average value of gametes inherited from parents) is 0 (Quaas et al., 1984). This assumption is violated if selection is practiced within families and data are recorded only for those progeny which inherited more desirable combinations of parental alleles than the average. From Eq. (7) we may also write the equivalent model (Garrick, 2007): y ¼ Xβ þ ZMα þ e for which the mixed model equations become:
−1 X′R −1 ZM X′R X −1 −1 −1 M′Z′R X M′Z′R ZM þ D λM
^ β ^ α
"
¼
X′R −1 y M′Z′R−1 y
# ð9Þ
J.F. Taylor / Aquaculture 420–421 (2014) S8–S14
S11
2 where λM = σ E2/σ M = ϕλG. When Z, R and D are all assumed to be identity matrices (which assumes that cG → Φ and that there is no LD among loci) Eq. (9) simplifies to:
From which the GBLUP of u⁎V (i.e., the GEBV for individuals in the validation population) is:
^ ^ V ¼ MV α u
X′X X′M M′X M′M þ IλM
^ β ^ α
X′y ¼ M′y
−1
¼ϕ
ð12Þ
−1 ^T MV DMT GTT u ′
which has a dense coefficient matrix but may be very simply formed from the individual genotypes and iteratively solved. More generally, pre-multiplying the second set of equations in Eq. (9) by ϕ −1 G −1 MD, using result (6) and factoring the M from the second column of the coefficient matrix into the solutions yields:
X′R−1 X X′R−1 Z Z′R−1 X Z′R−1 Z þ G−1 λG
^ β ^ Mα
"
¼
−1
#
X′R y Z′R−1 y
−1 ^T ¼ GVT GTT u
which arises from the regression of u⁎V on u⁎T and is equivalent to fitting Henderson's model (Henderson, 1974) to estimate the breeding values of individuals without observations: 2
−1
−1
X′R Z X′R X 4 Z′R −1 X Z′R−1 Z þ GTT λ G 0 GVT λG
and therefore:
3 32 3 2 ^ 0 X′R −1 y β −1 ^ T 5 ¼ 4 Z′R y 5 G λG 54 u ^ V u 0 GVV λG TV
ð13Þ
^ ¼ Mα ^: u This result is particularly useful since pre-multiplying both sides by ϕ −1DM′G −1 leads to: −1
^ ¼ϕ α
−1
DM′G
^ u
ð10Þ
because ϕ −1DM′G −1 M = I from Eq. (6). Result (10) arises not unexpectedly from the regression of ASE on genomic breeding values u*. Together results (8) and (10) are useful when G is invertible. While there is no simple algorithm to generate G −1 as is available for A −1 (Henderson, 1975, 1976; Quaas et al., 1984) these results allow the 2 ) from the smaller of the two estimation of σG2 (and therefore σM sets of mixed model equations using, e.g., restricted maximum likelihood estimation and at convergence of the variance component estimation, estimates of ASE can simply be produced from Eq. (10) for use in GS or GWAS. 3. An equivalent model for training and validation or implementation of genomic selection Suppose that u* is partitioned into two groups of individuals; those with observations that will be used to estimate α (u⁎T, the training population) and those individuals whose data will be used to validate (u⁎V, the validation population) the GEBV. The results generated for individuals in u⁎V also apply to candidates for selection in the implementation population. This leads to a partitioning of vectors and matrices as follows:
MT GTT uT ;G ¼ u ¼ ;M ¼ MV G#VT uV" −1
¼ϕ
′
MT DMT ′ MV DMT
GTV GVV
′
TT
MT DMV −1 ¼ GVT ′ ; and G G MV DMV
G VV : G TV
If we were to fit model (7) to the animals in u⁎T with data y the mixed model equations become:
X′R−1 X X′R−1 Z Z′R−1 X Z′R−1 Z þ G−1 TT λG
^ β ^ T u
"
¼
# −1 X′R y : −1 Z′R y
Leading to the ASE being estimated as: ^ ^ ¼ ϕ−1 DMT ′ G−1 α TT u T :
ð11Þ
because the last row of Eq. (13) yields: h i−1 VT ^ V ¼ − GVV ^T u G u
and G VVGVT + G VTGTT = 0 from multiplying the partitioned G and G−1 matrices above. Result (13) was recognized by Goddard (2009) and was implemented by Rolf et al. (2010). 4. Inclusion of X-linked loci As much as 5% of the DNA within mammalian genomes is contained in the X chromosome and GWAS and GS applications may include X-linked markers if we assume that random X inactivation leads to heterozygous females with chimeric tissues containing cells that individually express either the Ai or Bi alleles. In this case, we may parameterize the genotypic values of Ai Ai, AiBi and BiBi females as ai, di and − ai and of AiY and BiY males as ai and − ai, respectively. Genotype frequencies in females follow Hardy Weinberg proportions, but are equivalent to allele frequencies in males. Under this model, the breeding value ui associated with each genotype at the ith X-linked locus (regardless of progeny gender) is: 8 qi αi > > > 1 > > > ðq −pi Þαi > > > > > > qi αi > : −pi αi
Ai Ai for female genotype
Ai Bi Bi Bi
for male genotype
ð14Þ
Ai Y Bi Y
for α⁎i = 3ai + di(qi − pi) = 2ai + αi. Thus, sex-linked genotypes can trivially be included in analyses where all individuals are males using the locus-specific coefficients of αi in Eqs. (1) and (14) without the need to modify D. In all female analyses, the locus-specific coefficients of αi in Eq. (1) and α⁎i in Eq. (14) may be used, however, the variance of X-linked ASE is greater than for autosomal loci and this should ideally be reflected in the diagonals of D. In mixed gender analyses, separate X-linked ASE should be estimated for each gender. This is not only germane for the incorporation of X-linked loci into GWAS, but indicates that separate X-linked ASE must be estimated to allow GEBV to be separately produced for males and females (VanRaden et al., 2009) if X-linked loci contribute significantly to trait variation.
S12
J.F. Taylor / Aquaculture 420–421 (2014) S8–S14
5. Analysis of deregressed breeding values
Of somewhat greater interest, Eq. (12) leads to:
In animal breeding applications, deregressed EBV (Garrick et al., 2009) may be used in place of phenotypes in which case the mixed linear model (7) for the ith deregressed EBV can be written as: ^i u ¼ μ þ ui þ ei r2i
ð15Þ
^ i is the parent average corrected EBV for the ith individual where u ^
ðu i Þ originally estimated in an analysis incorporating A, and ri2 = Var Varðui Þ is ^ i (Garrick et al., 2009). Here, u⁎i represents the squared accuracy of u
the genomic breeding value of the ith individual and ei is the model residual which again includes εi the additive genetic effects not captured in u⁎i by the LD between SNP and the set of trait QTL. The variance of the dependent variable is: ^i Þ Varðu ^ u Var r2 i ¼ i r4i Varðui Þ ¼ r2i ð1 þ Fi Þσ 2A ¼ r2i
h i 2 2 ^ V Þ ¼ GVT G−1 Varðu TT cGTT σ A –C22 σ E h n o i −1 −1 −1 × c−1 G−1 TT ΦΤΤ I þ cΦTT −GTT C22 λG GTT GTV h i 2 2 −1 → ΦVT Φ−1 TT ΦΤΤ σ A –C22 σ E ΦTT ΦTV h i 2 2 −1 −1 ^ V ; uV Þ ¼ GVT G−1 Covðu GTT ΦTV TT cGTT σ A –C22 σ E c h i −1 2 2 −1 → ΦVT ΦTT ΦTT σ A –C22 σ E ΦTT ΦTV h i 2 2 −1 ^ V −uV →ΦVV σ 2A −ΦVT Φ−1 Var u TT ΦTT σ A –C22 σ E ΦTT ΦTV : Finally from Eq. (10), the accuracies of estimates of ASE may be produced from: h i 2 2 ′ −1 DM T GTT cGTT σ A –C22 σ E h n o i −1 −1 −1 −1 −1 × c GTT ΦΤΤ I þ cΦTT −GTT C22 λG GTT MT D h i −2 ′ −1 2 2 −1 → ϕ DM T ΦTT ΦΤΤ σ A –C22 σ E ΦTT MT D as cGTT →ΦΤΤ : −2
^Þ ¼ ϕ Varðα
where Fi is the inbreeding coefficient of the ith individual. The variance of the right hand side of Eq. (15) is:
h i −1 2 2 −1 ^ ; α Þ ¼ ϕ−2 DM′T GTT cGTT σ A –C22 σ E GTT MT D Covðα h i −2 ′ −1 2 2 −1 →ϕ DMT ΦTT ΦΤΤ σ A –C22 σ E ΦTT MT D
Var ui þ ei ¼ Gii cσ
^ −α Þ→Dσ M −ϕ Varðα
2 A
þ Varðei Þ
where Gii is the ith diagonal of G. Therefore: " # ð1 þ Fi Þ 2 −cGii σ A Varðei Þ ¼ r2i
which does not involve the heritability of the trait (Garrick et al., 2009). Because σA2 can be scaled from the mixed model equations in Eq. (8), the parameter c which is the proportion of additive genetic variance explained by the markers can be estimated. In this process, c is bounded above by the requirement that Var(ei) > 0 and consequently c b ðG1þFr2i Þ and c ≤ 1. This can be an issue when the ii i
accuracies of EBV are imprecisely estimated or are rounded for reporting. 6. Accuracies of GBLUP estimates of GEBV In practice, the adequacy with which cG estimates Φ is unknown and therefore R is unknown. We therefore have little choice but to fit the mixed model equations with R = I which assumes that Φ is well C C estimated by cG. Let C= ′11 12 be the inverse of the coefficient maC12 C22 trix in Eq. (11) with R=I, then it can be shown that (e.g., Henderson, 1973) that: h i h n o i −1 −1 ^ T Þ ¼ cGTT σ 2A –C22 σ 2E c−1 G−1 Varðu TT ΦTT I þ cΦTT −GTT C22 λG →ΦΤΤ σ 2A –C22 σ 2E as cGTT →ΦΤΤ h i 2 2 ^ T ; uT Þ ¼ cGTT σ 2A –C22 σ 2E c−1 G−1 Covðu ΤΤ ΦΤΤ →ΦΤΤ σ A –C22 σ E −1 2 2 2 ^ T −uT Þ ¼ C22 σ 2E þ C22 G−1 Varðu TT ½ΦΤΤ −cGTT GTT C22 λG σ A →C22 σ E :
2
−2
h i ′ −1 2 2 −1 DM T ΦTT ΦΤΤ σ A –C22 σ E ΦTT MT D:
Clearly these results are complex and the matrix Φ is unknown, which precludes their calculation except in the limiting case when ^ T which is defined to be cG → Φ. For this situation, the accuracy of u ^ T and uT can be obtained from the diagonals the correlation between u ^ T , uT) = ΦΤΤσA2 − C22σE2 and Var(uT) = ΦΤΤσA2. For ^ T ) = Cov( u of Var( u ^ Ti is approximately: the ith individual the accuracy of u
rTi ¼
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi C 1−λA 22ii : Φii
Likewise, the accuracy of GEBV produced for animals in the valida^ V , can be obtained from the tion (or implementation) populations, u −1 −1 ^ V ,uV)=ΦVT Φ TT ^ V )=Cov( u [ΦTT σA2 −C22σE2]ΦTT ΦTV diagonals of Var( u and Var(uV) = ΦVVσA2. These clearly depend on only two factors: 1) the amount of information available on individuals in the training population contained in ΦTTσA2 − C22σE2, and 2) the extent of relatedness between individuals in the validation (or implementation) population contained in ΦVT. 7. Discussion The GBLUP estimation of α for the purpose of GWAS is now becoming more common place (McClure et al., 2012; Yang et al., 2011) in human and agricultural applications. In plant and animal breeding, the GBLUP estimates of α can be used to estimate genomic breeding values of individuals in future generations from high-throughput SNP genotypes produced at birth, or even from embryos, and the accuracies of these GEBV are equivalent for males and females if the effects of sex-linked loci are small. Statistically, this is equivalent to estimating the extent of identity by descent between these individuals and the individuals in the training population which were used to estimate α and finding the regression of selection candidate breeding values on the breeding values of individuals in the training population. This result was first published by Henderson (1974) and is facilitated for GEBV by replacing the numerator relationship matrix with the genomic
J.F. Taylor / Aquaculture 420–421 (2014) S8–S14
S13
relationship matrix in Henderson's mixed model equations as first suggested by Nejati-Javaremi et al. (1997). This is a useful result, because pedigree relationship information between individuals in the training and implementation populations need not be available and the genomic relationships more accurately reflect the realized identity by descent between individuals than do numerator relationship coefficients. Moreover, it makes the process of validating GEBV redundant because the accuracies of GEBV can be directly estimated using results presented here and from direct inversion of the coefficient matrix in (13) (e.g., Rolf et al., 2010). However, it does seem paradoxical that no matter how accurately the SNP ASE are estimated in the training population (the correlation between estimated and true ASE→ 1 as C22 → 0 and cG → Φ based upon the results in Section 5), the accuracies of GEBV
Finally, a number of results in this paper are based upon the assumption that Φ is well estimated by G. This issue was addressed by Rolf et al. (2010) using the BovineSNP50 assay scored in Angus cattle where an asymptotic relationship was found for the correlations between Gij values estimated from reduced sets of markers and the complete set of 41,028 SNP. However, this should be further examined using whole genome resequencing data which captures the entire spectrum of variation to determine the impact of rarer variants not present in the commercial genotyping assays on the estimation of Φ.
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ΦVT Φ−1 TT ΦTV ii rVi ¼ ΦVVii
This work was supported by the Agriculture and Food Research Initiative competitive grant nos. 2009-65205-05635, 2011-68004-30367 and 2011-68004-30214 from the USDA National Institute of Food and Agriculture Animal Genome Program.
for individuals in the validation or implementation populations diminishes as the number of meioses separating individuals in these populations increases (causing the elements of ΦVT to decrease in magnitude). In the case of non-overlapping generations, the loss of accuracy of GEBV may rapidly reduce the utility of GS when training populations are static. However, most livestock populations, particularly those which employ artificial insemination, have overlapping generations which produces complex patterns of relationship among individuals in the training and validation populations (i.e., the rows of ΦVT may be dense). This may reduce the extent to which the accuracies of GEBV decrease in time, however, this requires further investigation to enable the optimization of GS programs for traits that are not routinely measured. Results presented here for the accuracies of GEBV are based on the properties of linear estimation of ASE and genomic breeding values using GBLUP and do not apply to nonlinear (e.g., Bayesian) estimation approaches. However, it is probably reasonable to assume that the accuracies of Bayesian estimates of genomic breeding values are also similarly dependent on the accuracies of training population GEBV and the extent of relatedness between individuals in the training and validation populations. Saatchi et al. (2011) observed that in BayesCπ analyses the accuracies of GEBV decreased when validation populations were formed by minimizing the pedigree relatedness between training and validation populations relative to forming these populations at random from a set of 3570 registered Angus animals. A priority for future research appears to be the identification of analytical approaches that are robust and insensitive to the magnitude of relationship between training and validation population individuals. 8. Conclusions It is apparent that GS will best work for populations in which the training population is dynamic and selection candidates go on to produce phenotypes and can subsequently be incorporated into the training population (Fig. 1). On the other hand, GS may not produce GEBV with high accuracy for individuals and families that have not historically been used to generate the elite parents which commonly form the training population. GS is also unlikely to perform well when training occurs in one population and the estimated ASE are used to produce GEBV in a reproductively isolated population (Hayes et al., 2009; McClure et al., 2012; Toosi et al., 2010). Furthermore, the value of GS for the improvement of difficult to measure phenotypes such as disease resistance and feed efficiency using static experimental training populations will diminish as generations advance from the training population individuals. The extent of this diminution requires investigation since several large, publicly funded agricultural research projects are currently based upon this strategy and the optimum timing and design of retraining experiments will be required to maximize the public value of these projects.
Acknowledgments
References Cole, J.B., VanRaden, P.M., O'Connell, J.R., Van Tassell, C.P., Sonstegard, T.S., Schnabel, R.D., Taylor, J.F., Wiggans, G.R., 2009. Distribution and location of genetic effects for dairy traits. Journal of Dairy Science 92, 2931–2946. Erbe, M., Hayes, B.J., Matukumalli, L.K., Goswami, S., Bowman, P.J., Reich, C.M., Mason, B.A., Goddard, M.E., 2012. Improving accuracy of genomic predictions within and between dairy cattle breeds with imputed high-density single nucleotide polymorphism panels. Journal of Dairy Science 95, 4114–4129. Falconer, D.S., 1960. Introduction to Quantitative Genetics. Oliver and Boyd, London. Garrick, D.J., 2007. Equivalent mixed model equations for genomic selection. Journal of Dairy Science 90 (Suppl. 1), 376 (Abstr.). Garrick, D., Taylor, J.F., Fernando, R.L., 2009. Deregressing estimated breeding values and weighting information for genomic regression analyses. Genetics Selection Evolution 41, 55. Goddard, M.E., 2009. Genomic selection: prediction of accuracy and maximisation of long-term response. Genetica 136, 245–257. Habier, D., Tetens, J., Seefried, F., Lichtner, P., Thaller, G., 2010. The impact of genetic relationship information on genomic breeding values in German Holstein cattle. Genetics Selection Evolution 425, 5. Hayes, B.J., Bowman, P.J., Chamberlain, A.J., Goddard, M.E., 2009. Invited review: genomic selection in dairy cattle: Progress and challenges. Journal of Dairy Science 92, 433–443. Henderson, C.R., 1973. Sire evaluation and genetic trends. Proceedings of the Animal Breeding and Genetics Symposium in Honor of Dr. Jay L. Lush. American Society of Animal Science and American Dairy Science Association, Champaign, Illinois, pp. 10–41. Henderson, C.R., 1974. General flexibility of linear model techniques for sire evaluation. Journal of Dairy Science 57, 963–972. Henderson, C.R., 1975. Rapid method for computing the inverse of a relationship matrix. Journal of Dairy Science 58, 1727–1730. Henderson, C.R., 1976. A simple method for computing the inverse of a numerator relationship matrix used in prediction of breeding values. Biometrics 32, 69–83. Kang, H.M., Sul, J.H., Service, S.K., Zaitlen, N.A., Kong, S.Y., Freimer, N.B., Sabatti, C., Eskin, E., 2010. Variance component model to account for sample structure in genome-wide association studies. Nature Genetics 42, 348–354. Luan, T., Wooliams, J.A., Lien, S., Kent, M., Svendsen, M., Meuwissen, T.H.E., 2009. The accuracy of genomic selection in Norwegian Red cattle assessed by cross-validation. Genetics 183, 1119–1126. Matukumalli, L.K., Lawley, C.T., Schnabel, R.D., Taylor, J.F., Allan, M.F., Heaton, M.P., O'Connell, J., Moore, S.S., Smith, T.P., Sonstegard, T.S., Van Tassell, C.P., 2009. Development and characterization of a high density SNP genotyping assay for cattle. PLoS One 4, e5350. McClure, M.C., Ramey, H.R., Rolf, M.M., McKay, S.D., Decker, J.E., Chapple, R.H., Kim, J.W., Taxis, T.M., Weaber, R.L., Schnabel, R.D., Taylor, J.F., 2012. Genome wide association analysis for Quantitative Trait Loci influencing Warner Bratzler Shear force in five taurine cattle breeds. Animal Genetics 43, 662–673. Meuwissen, T.H., Hayes, B.J., Goddard, M.E., 2001. Prediction of total genetic value using genome-wide dense marker maps. Genetics 157, 1819–1829. Nejati-Javaremi, A., Smith, C., Gibson, J.P., 1997. Effect of total allelic relationship on accuracy of evaluation and response to selection. Journal of Animal Science 75, 1738–1745. Quaas, R.L., Anderson, R.D., Gilmour, A.R., 1984. BLUP School Handbook. Use of Mixed Models for Prediction and for Estimation of (Co)Variance Components. Animal Genetics and Breeding Unit, University of New England, NSW, Australia. Rolf, M.M., Taylor, J.F., Schnabel, R.D., McKay, S.D., McClure, M.C., Northcutt, S.L., Kerley, M.S., Weaber, R.L., 2010. Impact of reduced marker set estimation of genomic relationship matrices on genomic selection for feed efficiency in Angus cattle. BMC Genetics 11, 24. Saatchi, M., McClure, M.C., McKay, S.D., Rolf, M.M., Kim, J.W., Decker, J.E., Taxis, T.M., Chapple, R.H., Ramey, H.R., Northcutt, S.L., Bauck, S., Woodward, B., Dekkers, J.C.M., Fernando, R.L., Schnabel, R.D., Garrick, D.J., Taylor, J.F., 2011. Accuracies of
S14
J.F. Taylor / Aquaculture 420–421 (2014) S8–S14
genomic breeding values in American Angus beef cattle using K-means clustering for cross-validation. Genetics Selection Evolution 43, 40. Schulman, N.F., Sahana, G., Iso-Touru, T., McKay, S.D., Schnabel, R.D., Lund, M.S., Taylor, J.F., Virta, J.U., Vilkki, J.H., 2011. Mapping of fertility traits in Finnish Ayrshire by genome-wide association analysis. Animal Genetics 42, 263–269. Su, G., Guldbrandsen, B., Gregersen, V.R., Lund, M.S., 2010. Preliminary investigation on reliability of genomic estimated breeding values in the Danish Holstein population. Journal of Dairy Science 93, 1175–1183. Toosi, A., Fernando, R.L., Dekkers, J.C.M., 2010. Genomic selection in admixed and crossbred populations. Journal of Animal Science 88, 32–46. VanRaden, P., 2008. Efficient methods to compute genomic predictions. Journal of Dairy Science 91, 4414–4423.
VanRaden, P.M., Van Tassell, C.P., Wiggans, G.R., Sonstegard, T.S., Schnabel, R.D., Taylor, J.F., Schenkel, F.S., 2009. Invited review: reliability of genomic predictions for North American Holstein bulls. Journal of Dairy Science 92, 16–24. Yang, W., Tempelman, R.J., 2012. A Bayesian antedependence model for whole genome prediction. Genetics 190, 1491–1501. Yang, J., Manolio, T.A., Pasquale, L.R., Boerwinkle, E., Caporaso, N., Cunningham, J.M., de Andrade, M., Feenstra, B., Feingold, E., Hayes, M.G., Hill, W.G., Landi, M.T., Alonso, A., Lettre, G., Lin, P., Ling, H., Lowe, W., Mathias, R.A., Melbye, M., Pugh, E., Cornelis, M.C., Weir, B.S., Goddard, M.E., Visscher, P.M., 2011. Genome partitioning of genetic variation for complex traits using common SNPs. Nature Genetics 43, 519–525.