Genetics: Early Online, published on November 9, 2016 as 10.1534/genetics.116.193243
1
Accuracy of genomic prediction in synthetic populations depending on the
2
number of parents, relatedness and ancestral linkage disequilibrium
3
Pascal Schopp*,1, Dominik Müller*,1, Frank Technow*, Albrecht E. Melchinger*
4 September 29, 2016
5 6 7 8
*Institute of Plant Breeding, Seed Science and Population Genetics 1
These authors contributed equally to this work
9 10
University of Hohenheim
11
70599 Stuttgart, Germany
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 1 Copyright 2016.
27
Running Head: Genomic prediction in synthetics
28 29 30
Key Words: genomic prediction, synthetic populations, GBLUP, genetic relationships, linkage
31
disequilibrium
32 33 34
Corresponding Author:
35
A.E. Melchinger
36 37
Institute of Plant Breeding, Seed Sciences and Population Genetics
38
University of Hohenheim
39
Fruwirthstr. 21,
40
Stuttgart 70599, GERMANY
41
[email protected]
42
Tel.: 0049 0711 459-22334
43
Fax.: 0049 0711 459-22343
44 45 46 47 48 49 50 51 52 53 2
54
ABSTRACT
55
Synthetics play an important role in quantitative genetic research and plant breeding, but few studies
56
have investigated the application of genomic prediction (GP) to these populations. Synthetics are
57
generated by intermating a small number of parents (𝑁𝑃 ) and thereby possess unique genetic
58
properties, which make them especially suited for systematic investigations of factors contributing to
59
the accuracy of GP. We generated synthetics in silico from 𝑁𝑃 = 2 to 32 maize (Zea mays L.) lines taken
60
from an ancestral population with either short- or long-range linkage disequilibrium (LD). In eight
61
scenarios differing in relatedness of the training and prediction sets and in the types of data used to
62
calculate the relationship matrix (QTL, SNPs, tag markers, pedigree), we investigated the prediction
63
accuracy of GBLUP and analyzed contributions from pedigree relationships captured by SNP markers
64
as well as from co-segregation and ancestral LD between QTL and SNPs. The effects of training set size
65
𝑁𝑇𝑆 and marker density were also studied. Sampling few parents (2 ≤ 𝑁𝑃 < 8) generates substantial
66
sample LD that carries over into synthetics through co-segregation of alleles at linked loci. For fixed
67
𝑁𝑇𝑆 , 𝑁𝑃 influences prediction accuracy most strongly. If the training and prediction set are related,
68
using 𝑁𝑃 < 8 parents yields high prediction accuracy regardless of ancestral LD because SNPs capture
69
pedigree relationships and Mendelian sampling through co-segregation. As 𝑁𝑃 increases, ancestral LD
70
contributes more information, while other factors contribute less due to lower frequencies of closely
71
related individuals. For unrelated prediction sets, only ancestral LD contributes information and
72
accuracies were poor and highly variable for 𝑁𝑃 ≤ 4 due to large sample LD. For large 𝑁𝑃 , achieving
73
moderate accuracy requires large 𝑁𝑇𝑆 , long-range ancestral LD and high marker density. Our approach
74
for analyzing prediction accuracy in synthetics provides new insights into the prospects of GP for many
75
types of source populations encountered in plant breeding.
3
INTRODUCTION
76
77
Synthetic populations, known as synthetics, have played an important role in quantitative-
78
genetic research on gene action in complex heterotic traits and comparison of selection methods (cf.
79
Hallauer et al. 2010). In many crops, synthetics also serve as cultivars in agricultural production or as
80
source population for recurrent selection programs (cf. Bradshaw 2016). Synthetics are usually created
81
by crossing a small number of parents (𝑁𝑃 ) and subsequently cross-pollinating the F1 individuals for
82
one or several generations (Falconer and Mackay 1996). A prominent example is the “Iowa Stiff Stalk
83
Synthetic” (BSSS) generated from 16 parents of maize, from which numerous successful elite inbred
84
lines such as B73 have been derived (Hagdorn et al. 2003). Further examples of synthetics include
85
composite crosses (Suneson 1956) and multi-parental advanced inter-cross (MAGIC, see Table S1 for
86
list of abbreviations) populations (Cavanagh et al. 2008) advocated for breeding purposes in crops
87
(Bandillo et al. 2013). Importantly, two-way and four-way crosses, widely employed as source material
88
in recycling breeding (Mikel and Dudley 2006), can be viewed as special cases of synthetics when 𝑁𝑃 =
89
2 and 4, respectively.
90
Genomic prediction (GP) proposed by Meuwissen et al. (2001) led to a paradigm-shift in animal
91
breeding during the past decade (Hayes et al. 2009a; de Koning 2016) and has also been widely
92
adopted in plant breeding (Lin et al. 2014). In cattle breeding, GP is predominantly applied within
93
closed breeds and training sets (TS) commonly encompass thousands of individuals. By comparison, in
94
plant breeding the TS sizes are much smaller (e.g., hundreds or fewer of individuals) and populations
95
are usually structured into multiple segregating families or subpopulations. Numerous studies
96
addressed the implementation of GP in structured plant breeding populations (cf. Lorenzana and
97
Bernardo 2009; Albrecht et al. 2011; Lehermeier et al. 2014; Technow and Totir 2015), but systematic
98
investigations on the prospects of GP in synthetics are lacking so far, although they were proposed as
99
particularly suitable source material for recurrent genomic selection (Windhausen et al. 2012; Gorjanc
100
et al. 2016).
4
101
Genomic best linear unbiased prediction (GBLUP), a modification of the traditional pedigree
102
BLUP devised by Henderson (1984), is a widely used method to implement GP in animal and plant
103
breeding (Mackay et al. 2015). Here, the pedigree relationship matrix is replaced by a marker-derived
104
genomic relationship matrix to estimate actual relationships at QTL (Hayes et al. 2009c). The success
105
of this approach depends on three sources of information, namely (i) pedigree relationships captured
106
by markers, (ii) co-segregation of QTL and markers and (iii) population-wide linkage disequilibrium
107
between QTL and markers (Habier et al. 2007, 2013; Wientjes et al. 2013).
108
In classical quantitative-genetics, pedigree relationships between individuals are calculated as
109
twice the probability of identity-by-descent (IBD) of alleles at a locus, conditional on their pedigree
110
(Wright 1922; Falconer and Mackay 1996). However, actual IBD relationships at QTL deviate from
111
pedigree relationships – which correspond to expected IBD relationships – due to Mendelian sampling
112
(Hill and Weir 2011). In GP, pedigree relationships are captured best with a large number of
113
stochastically independent markers (Habier et al. 2007), whereas capturing the Mendelian sampling
114
term requires co-segregation of QTL and markers (Hayes et al. 2009c; Habier et al. 2013).
115
In pedigree analysis, the founders of the pedigree are by definition assumed to be unrelated
116
(i.e., IBD equal to zero), but in reality, there usually exist latent similarities at QTL contributing to
117
variation in identity-by-state (IBS) relationships among these individuals. Markers enable capturing
118
these IBS relationships if they are in population-wide LD with the QTL in an ancestral population of
119
founders. Thus, ancestral LD between QTL and markers provides also information between individuals
120
that are unrelated by pedigree to the TS (Wientjes et al. 2013; Habier et al. 2013). Ancestral LD
121
generally results from various population-historic processes like mutation, drift and selection (Flint-
122
Garcia et al. 2003) and varies within species primarily due to different bottlenecks imposed by artificial
123
selection or population admixture (Hill 1981; Hartl and Clark 2007). The influence of different levels of
124
ancestral LD on prediction accuracy (PA) in synthetics and related types of populations have so far
125
received little attention.
126
The contributions of the three sources of information to PA were demonstrated in theory and
127
simulations by Habier et al. (2013) using half-sib families in cattle breeding and multiple biparental
5
128
(full-sib) families in maize breeding, where both of these examples consisted of numerous families
129
derived from a large number of parents. However, it is unclear whether these results generalize to
130
other breeding situations, in particular those involving only few parents. In such situations, diverse
131
relationship patterns are generated, new statistical associations between loci arise due to sampling,
132
and ancestral LD might be only partially present in the progeny. These factors are expected to
133
profoundly affect the contributions of the three sources of information to PA and thus, affect the
134
application of GP on related and unrelated genotypes. Synthetics represent an ideal framework for
135
examining the influence of these factors on PA, because the number of parents used for producing
136
them can be varied over a wide range. Here, we simulated two ancestral populations differing
137
substantially in their LD and analyzed synthetics generated from different numbers of parents under
138
eight scenarios that enabled dissecting the factors contributing to PA.
139
The objectives of this study were to (i) examine how PA in synthetics depends on the number
140
of parents and LD in the ancestral population, (ii) assess the importance of the three sources of
141
information for PA and how they are influenced by training set size and marker density, and (iii) analyze
142
the relationship of LD between QTL and markers among the ancestral population, parents, and the
143
synthetics generated from them. Finally, we discuss how our approach provides a general framework
144
for analyzing the factors influencing PA and we draw inferences on the prospects of GP in other
145
scenarios encountered in breeding.
146 147
6
148
METHODS
149
Genome properties and genetic map: We used maize (Zea mays L.) as a model species in our study.
150
Physical map positions of the 56K Illumina maize SNP BeadChip were used to account for the markedly
151
reduced recombination rate and lower marker density in the centromere regions (McMullen et al.
152
2009). These positions were converted into genetic map positions required for simulating meiosis
153
events (File S1). In total we obtained 37,286 SNPs distributed over the 10 chromosomes of length 276,
154
200, 193, 188, 221, 171, 203, 173, 151 and 137 cM (1913 cM in total), corresponding to an average
155
marker density of 24.4 SNPs cM-1. All subsequent meiosis events were simulated using the count-
156
location model without crossover-interference, where the number of chiasmata was drawn from a
157
Poisson distribution with parameter 𝜆 equal to the chromosome length in Morgan, and where
158
crossover positions were sampled from a uniform distribution across the chromosome.
159 160
Simulation of ancestral populations: Two ancestral populations that differed substantially in their
161
level and decay of LD (LDA, Figure 1), were simulated with the software QMSim (Sargolzaei and
162
Schenkel 2009). Ancestral population LR displayed extensive long-range LDA, whereas SR displayed only
163
short-range LDA. The simulation of LR was carried out by closely following Habier et al. (2013) and
164
involved the following steps (Figure S1): First, we generated an initial population of 1,500 diploid
165
individuals by sampling alleles at each (biallelic) locus independently from a Bernoulli distribution with
166
probability 0.5. Second, 5,000 loci were randomly sampled from all SNPs and henceforth interpreted
167
as QTL; all remaining loci were considered as SNP markers. Third, these individuals were randomly
168
mated for 3,000 generations using a constant population size of 1,500 and a mutation rate of 2.5×10-
169
5
170
individuals, followed by 15 more generations of random mating to generate extensive long-range LDA.
171
Fifth, we conducted three more generations of random mating with a population size of 10,000
172
individuals to eliminate close pedigree relationships in the ancestral population LR. To produce SR, we
. Fourth, a severe bottleneck was introduced by reducing population size to 30 randomly chosen
7
173
randomly mated LR for 100 more generations at a population size of 10,000 individuals to remove
174
long-range LDA. Thus, LR and SR strongly differed in their LDA structure, but only marginally in their
175
allele frequencies (Table S2). Always in the last generation, a single gamete was randomly sampled per
176
individual from both SR and LR and treated as completely homozygous doubled haploid line. These
177
10,000 lines represented the final ancestral population used for production of the synthetics. All lines
178
were considered unrelated when calculating pedigree relationships among their progeny.
179 180
Simulation of synthetics: We generated synthetics differing in 𝑁𝑃 by sampling 𝑁𝑃 ∈
181
{2, 3, 4, 6, 8, 12, 16, 24, 32} parent lines from the same ancestral population. From these parents, we
182
produced all possible (𝑁2𝑃 ) combinations of single crosses (Syn-1 generation, Figure S1), where the
183
number of Syn-1 progenies per cross was chosen to obtain at least 1,000 individuals in total. For
184
production of the Syn-2 generation, the Syn-1 individuals were intermated at random, allowing also
185
for selfing. Finally, a single doubled haploid line was derived from each of the 1,000 individuals of the
186
Syn-2 generation to obtain the genotypes of the final synthetic. This approach was chosen to avoid
187
additional full-sib relationships among doubled haploid lines that arise when deriving them from the
188
same Syn-2 individual.
189 190
Genetic model: For simulating the polygenic target trait, we sampled a subset of 1,000 of the 5,000
191
QTL in each simulation replicate. Following Meuwissen et al. (2001), the corresponding QTL effects
192
were drawn from a gamma distribution with scale and shape parameter 0.4 and 1.66, respectively.
193
Signs of QTL effects were sampled from a Bernoulli distribution with probability parameter 0.5.
194
The vector 𝒖 of true breeding values for all individuals in the synthetic was calculated as 𝒖 =
195
𝑾𝒂, where 𝑾 is the matrix of genotypic scores at QTL coded {2,0} depending on whether an individual
196
was homozygous for the 1 or 0 allele, respectively, that were adjusted for twice the frequency of the
197
1 allele in the ancestral population (cf. Figure S1), and 𝒂 is the vector of QTL effects. The corresponding
198
vector 𝒚 of phenotypes was obtained as 𝒚 = 𝒖 + 𝒆 (Goddard et al. 2011; de los Campos et al. 2013;
199
Habier et al. 2013), i.e., assuming a null mean and adding a vector of independent normally distributed
8
200
environmental noise variables 𝒆, where variance 𝜎𝑒2 was chosen to be identical for the two ancestral
201
populations and all choices of 𝑁𝑃 , assuming that environmental effects affect phenotypes
202
independently of additive-genetic variance 𝜎𝑢2 in the synthetic. The value of 𝜎𝑒2 was therefore set equal
203
2 to the additive-genetic variance 𝜎𝐴𝑃 in ancestral population LR averaged across 1,000 simulation
204
replicates. The heritability ℎ² of the target trait was then on average equal to 0.5 for LR and SR due to
205
2 nearly identical allele frequencies at QTL, but lower in the synthetics, because 𝜎𝑢2 < 𝜎𝐴𝑃 (Table S2).
206
We restricted our simulations to a single level of heritability, because preliminary analyses showed
207
that changing ℎ² resulted in fairly relatively constant shift of PA.
208 209
Analysis of the sources of information exploited in genomic prediction: We conceived eight scenarios
210
to evaluate to what extent the three sources of information contribute to PA in synthetics (Figure 2),
211
when actual relationships at QTL are estimated by marker-derived genomic relationships. The
212
scenarios can be differentiated by three factors.
213
First, individuals in the TS and prediction set (PS) were either related (“Re”-scenarios) or
214
unrelated (“Un”- scenarios), depending on whether the parents of the TS (𝑃𝑇𝑆 ) and of the PS (𝑃𝑃𝑆 )
215
were identical (i.e., 𝑃𝑇𝑆 = 𝑃𝑃𝑆 ) or disjoint (i.e., 𝑃𝑇𝑆 ∩ 𝑃𝑃𝑆 = ∅). For the “Re”-scenarios, we sampled
216
individuals for the TS and PS from the same synthetic, whereas for the “Un”- scenarios, individuals
217
were sampled from two different synthetics produced from disjoint sets 𝑃𝑇𝑆 and 𝑃𝑃𝑆 , each of size 𝑁𝑃 .
218
Both sets of parents originated always from the same ancestral population.
219
Second, pairs of QTL and SNPs were either in LD (“LDA”-scenarios) as found in the ancestral
220
population, or in linkage equilibrium (“LEA”-scenarios). To achieve the latter, we permuted complete
221
QTL haplotypes among the 𝑁𝑃 parents (for “Un”-scenarios separately in each set 𝑃𝑇𝑆 and 𝑃𝑃𝑆 ), while
222
keeping their SNP haplotypes unchanged (i.e., conserving their LDA). This procedure eliminates any
223
systematic association between QTL and SNP alleles originating from the ancestral population, but
224
maintains allele frequencies and polymorphic states at QTL, as well as LDA between them. In contrast
225
to previous approaches (cf. Habier et al. 2013), this approach avoids influencing PA by altering actual
226
relationships at QTL. Importantly, after removal of LDA, there is still LD between QTL and SNPs in
9
227
the parents, but this LD is purely due to the limited sample size and thus subsequently referred to as
228
sample LD.
229
Third, four different types of data were used to calculate the relationship matrix 𝑲 used in
230
BLUP: (i) For the “SNP”- scenarios, we used SNP genotypes to calculate the marker-derived genomic
231
relationship matrix 𝑲 ≜ 𝑮 = (𝑔𝑖𝑗 ) as 𝑔𝑖𝑗 = ∑𝑚(𝑥𝑖𝑚 − 2𝑝𝑚 )(𝑥𝑗𝑚 − 2𝑝𝑚 )⁄2 ∑𝑚 𝑝𝑚 (1 − 𝑝𝑚 ) (Habier
232
et al. 2007; VanRaden 2008), where 𝑥𝑖𝑚 is the genotype of the 𝑖-th individual at the 𝑚-th locus coded
233
{2,0} depending on whether this individual was homozygous for the 1 or 0 allele, respectively, and 𝑝𝑚
234
is the frequency of the 1 allele at the 𝑚-th SNP marker in the ancestral population. (ii) For the “QTL”-
235
scenarios, the QTL genotypes were used to calculate the actual relationship matrix 𝑲 ≜ 𝑸 =
236
(𝑞𝑖𝑗 ) using the same formula. (iii) For the “Ped”-scenario, pedigree records were used to calculate the
237
pedigree relationship matrix 𝑲 ≜ 𝑨 = (𝑓𝑖𝑗 ) with elements 𝑓𝑖𝑗 being equal to expected IBD
238
relationships (i.e., twice the coefficient of co-ancestry). (iv) For the “Tag”-scenario, tag markers
239
labeling the origin of QTL alleles at each locus from the 𝑁𝑃 parents were used to calculate the actual
240
IBD relationship matrix 𝑲 ≜ 𝑻 = (𝜏𝑖𝑗 ) with elements 𝜏𝑖𝑗 being equal to twice the proportion of
241
identical tag marker alleles between each pair of individuals. Tag markers label each QTL allele,
242
regardless of its state, uniquely with a number є {1, . . , 𝑁𝑃 ) in the parents and thus, they allow tracking
243
the segregation process during intermating and identifying the parental origin of each QTL allele in the
244
synthetic.
245
Scenario Re-LDA-SNP reflects the situation mostly encountered in practical applications of GP
246
and used information from pedigree relationships among individuals in the TS and PS captured by
247
SNPs, deviations from pedigree relationships due to (i) Mendelian sampling at QTL captured by co-
248
segregation of QTL and SNPs and (ii) ancestral LD between QTL and SNPs. Scenario Re-LDA-Ped used
249
only pedigree relationships, but ignored deviations due to Mendelian sampling, whereas Re-LDA-Tag
250
accounted for both pedigree relationships and Mendelian sampling. Both scenarios ignored actual
251
relationships among parents by assuming unrelated founders, and thus, did not account for alleles that
252
are IBS but not IBD in the synthetic. Scenario Re-LEA-SNP was artificial, with the goal of determining
10
253
the influence of ancestral LD on PA in scenario Re-LDA-SNP. Scenario Re-LDA-QTL was employed to
254
determine for the “Re”-scenarios the maximum PA achievable with GBLUP (cf. de los Campos et al.
255
2013), when assuming that each QTL explains an equal proportion of the additive-genetic variance.
256
The purpose was thus to quantify the reduction in PA for all other “Re”-scenarios when using a
257
different data type to estimate actual relationships.
258
Scenarios Un-LDA-SNP and Un-LDA-QTL (“Un”-scenarios) represent the conceptual counter-
259
parts to Re-LDA-SNP and Re-LDA-QTL (Figure 2). Un-LDA-SNP reflects the practical situation of predicting
260
the genetic merit of individuals unrelated to the TS, whereas Un-LDA-QTL provides the corresponding
261
upper bound of PA. For both scenarios, alleles in the TS and PS had IBD probability equal to zero and,
262
thus, the only remaining source of information contributing to PA in Un-LDA-SNP was ancestral LD
263
between QTL and SNPs to track actual relationships among parents. Scenario Un-LEA-SNP was
264
employed as negative-control scenario to validate the simulation designs. As expected, PA for Un-LEA-
265
SNP fluctuated around zero for all investigated settings (results not shown), confirming that there are
266
only three sources of information contributing to PA when using 𝑲 ≜ 𝑮.
267 268
Analysis of linkage disequilibrium and linkage phase similarity: We calculated LD as the squared
269
correlation coefficient (𝑟 2 , Hill and Robertson 1968) between all pairs of QTL and SNPs in (i) each
270
ancestral population (LDA), (ii) the set of 𝑁𝑃 parents sampled from the ancestral population, and (iii)
271
the synthetic generated from the parents. Furthermore, we computed the linkage phase similarity of
272
QTL-SNP pairs in the TS and PS. Here, we adopted a similar approach as de Roos et al. (2008), but
273
replaced the correlation by the cosine similarity
274
𝐿𝑖𝑛𝑘𝑎𝑔𝑒 𝑝ℎ𝑎𝑠𝑒 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 =
𝑇𝑆 𝑃𝑆 ∑𝑛 𝑖 𝑟𝑖 𝑟𝑖 𝑇𝑆 2 𝑛 𝑃𝑆 2 √∑𝑛 𝑖 (𝑟𝑖 ) √∑𝑖 (𝑟𝑖 )
,
(1)
275
where 𝑖 refers to the index of the QTL-SNP pair and 𝑛 is the number of pairs for which linkage phase
276
similarity is calculated. The reason was to account not only for the ranking but also for the absolute
277
size of the 𝑟 statistics in the two data sets (see File S2 for details). Linkage phase similarity was
11
278
calculated for all QTL-SNP pairs falling into consecutive bins of 0.5 cM width. LD was first averaged
279
within each bin and subsequently, both LD and linkage phase similarity statistics were averaged across
280
chromosomes and simulation replicates.
281 282 283
Genomic prediction: The statistical model used for predicting breeding values can be written as 𝒚 = 𝟏𝜇 + 𝒁𝒖 + 𝜺,
(2)
284
where 𝒁 is the incidence matrix linking phenotypes with breeding values, 𝒖 is the vector of random
285
breeding values with mean zero and variance-covariance matrix var(𝒖) = 𝑲𝜎𝑢2 , where 𝑲 is a
286
relationship matrix, calculated from different data types as described above, and 𝜎𝑢2 is the additive-
287
genetic variance in the synthetic. Residuals 𝜺 are random with mean zero and var(𝜺) = 𝑰𝜎𝜀2 , where 𝑰
288
is an identity matrix and 𝜎𝜀2 is the residual variance. Estimates of variance components 𝜎𝑢2 and 𝜎𝜀2 were
289
̂ were predicted using the obtained by restricted maximum likelihood and estimated breeding values 𝒖
290
mixed.solve function from R-package rrBLUP (Endelman 2011). PA was always calculated as the
291
̂ for the PS in each simulation replicate. correlation between 𝒖 and 𝒖
292
Following previous studies (Goddard et al. 2011; de los Campos et al. 2013), we also
293
investigated how well estimated relationships 𝑘𝑖𝑗 (i.e., 𝑔𝑖𝑗 , 𝑓𝑖𝑗 , 𝜏𝑖𝑗 ) between individuals 𝑖 and 𝑗 in the
294
TS and PS reflect the corresponding actual relationships 𝑞𝑖𝑗 at QTL. We therefore calculated the
295
2 coefficient of determination 𝑅𝑘,𝑞 of the regression of 𝑘𝑖𝑗 on 𝑞𝑖𝑗 in each simulation replicate and all
296
2 scenarios (except for Re-LDA-QTL and Un-LDA-QTL, where 𝑅𝑘,𝑞 = 1.0).
297
In order to assess the effect of TS size on PA, we sampled 𝑁𝑇𝑆 = 125, 250, 500 or 750
298
individuals from the 1,000 lines of the synthetic, where 250 was used as default when another factor
299
(e.g., marker density) was varied. For the PS, we always sampled 𝑁𝑃𝑆 = 100 individuals from (i) the
300
remaining individuals that were not part of the TS in the “Re”-scenarios or (ii) the second synthetic in
301
the “Un”-scenarios. For all “SNP”-scenarios, the effect of marker density on PA was evaluated for two
302
values of 5 and 0.25 SNPs cM-1, the former being used as default. The number of randomly sampled
303
SNPs per chromosome in each simulation replicate was proportional to the respective chromosome
12
304
length. The two marker densities of 5 and 0.25 SNPs cM-1 resulted in an average genetic map distance
305
between each QTL and its closest nearby SNP of 0.18 cM and 2.02 cM, respectively (Figure 1).
306
All reported results are arithmetic means over 1,000 simulation replicates, which were
307
stochastically independent conditional on the ancestral populations. A simulation replicate comprises
308
(i) random sampling of 1,000 QTL from the 5,000 initial QTL and sampling of QTL effects, (ii) sampling
309
of the parents from the ancestral population and, in the case of the “LEA”-scenarios, additionally
310
permuting QTL haplotypes, (iii) creation of synthetics from each set of parents, (iv) sampling of the
311
individuals for the TS and PS, (v) sampling of the noise variable 𝑒 and calculation of the breeding and
312
phenotypic values, and (vi) training of the prediction equation and calculation of estimated breeding
313
values as well as PA in the PS (Figure S1). All computations were carried out in the R statistical
314
environment (R Core Team 2012).
13
315
RESULTS
316
Linkage disequilibrium in the ancestral populations: For ancestral population LR, LDA showed a steep
317
decline extending to a genetic map distance ∆ = 0.5 cM and approached an asymptote of about 0.08
318
for ∆ > 1 cM (Figure 1), reflecting the presence of long-range LDA. By comparison, LDA in ancestral
319
population SR started at slightly smaller values for closely linked loci and showed a similar decline for
320
∆ < 1 cM. It levelled off at about ∆ = 2 cM, where it almost reached its asymptotic value of zero due
321
to absence of long-range LDA resulting from the 100 additional generations of random mating.
322 323
Linkage disequilibrium in the parents and the synthetics: Figure 3A shows the distribution of LD
324
between QTL-SNP pairs in the parents, measured as 𝑟 2 , as a function of ∆. LD in the parents takes on
325
only a limited number of values in the interval [0,1], because only a finite number of genotype
326
configurations is possible for two biallelic loci, which depends exclusively on 𝑁𝑃 . For 𝑁𝑃 = 2, all LD
327
values are equal to 1. For 𝑁𝑃 = 3 and 4, possible LD values are { , 1} and {0, , , 1}, respectively,
328
whereas for 𝑁𝑃 = 16, more than 100 values are possible, resulting in a nearly continuous distribution
329
of LD values in the parents. Under LEA (i.e., ancestral linkage equilibrium due to permutation of QTL
330
haplotypes), the frequency of LD values in the parents was thus almost independent of ∆, except for
331
some small residual deviations due to similarity of ancestral allele frequencies at closely linked loci (see
332
File S4 for details). Under LEA, the high frequencies of pairs of loci in high LD for 𝑁𝑃 = 3 and 4
333
demonstrate the magnitude of sample LD (Figure 3A, left column). If additionally, ancestral LD was
334
present, large parental LD values occurred more frequently for tightly linked loci (∆ < 1 cM) for both
335
ancestral populations. Under short-range LDA in SR, the frequencies of high parental LD values were
336
almost identical to those found under LEA for ∆ > 1 cM, regardless of 𝑁𝑃 . Conversely, under long-range
337
LDA in LR, the frequency of high parental LD values was considerably elevated also for ∆ > 1 cM.
338
Altogether, the distribution of LD values in the parents was much stronger influenced by 𝑁𝑃 than by
1 4
1 1 9 3
14
339
ancestral LD. The proportion of QTL-SNP pairs in high LD diminished as 𝑁𝑃 increased, but grew when
340
shifting from short- to long-range LDA (Figure 3A, SR vs. LR).
341
Figure 3B shows the average LD between QTL-SNP pairs in synthetics as a function of ∆ and
342
𝑁𝑃 . The level of the LD curve dropped rapidly as 𝑁𝑃 increased from 2 to 8 and approached the curve
343
of ancestral LD. Under LEA, LD in synthetics was still substantial for 𝑁𝑃 = 4 due to sample LD, yet
344
successively approached zero as 𝑁𝑃 was increased further. For 𝑁𝑃 > 2, the presence of ancestral LD
345
resulted in elevated LD in the synthetics, where the increment was large between tightly linked QTL-
346
SNP pairs (∆ < 1 cM) for both ancestral populations and moderate between loosely linked loci (∆ > 1
347
cM) for LR.
348 349
Linkage phase similarity between training and prediction set: For scenario Re-LDA-SNP (𝑃𝑇𝑆 = 𝑃𝑃𝑆 ),
350
linkage phase similarity between TS and PS exceeded 0.8 up to ∆ = 20 cM, regardless of the ancestral
351
population (Figure 4). By comparison, values were much lower for Un-LDA-SNP (𝑃𝑇𝑆 ∩ 𝑃𝑃𝑆 = ∅).
352
Increasing 𝑁𝑃 reduced linkage phase similarity only marginally for Re-LDA-SNP even for ∆ = 20 cM, but
353
resulted in a substantial increase for Un-LDA-SNP. The higher ancestral LD in LR resulted only in a minor
354
increase in linkage phase similarity in Re-LDA-SNP, but in a large increase for Un-LDA-SNP, irrespective
355
of 𝑁𝑃 . Since permuting QTL haplotypes eliminated ancestral LD in scenario Re-LEA-SNP, linkage phase
356
similarity was identical for SR and LR and showed similar results as Re-LDA-SNP for SR (results not
357
shown).
358 359
Influence of ancestral linkage disequilibrium and number of parents on prediction accuracy: PA
360
declined for all “Re”-scenarios (except Re-LDA-Ped), but increased for all “Un”-scenarios with an
361
increasing number of parents 𝑁𝑃 (Figure 5), where the strongest changes occurred between 𝑁𝑃 = 2
362
and 8 for all scenarios. The highest PA was always achieved by scenario Re-LDA-QTL, closely followed
363
by Re-LDA-SNP for small 𝑁𝑃 , with an increasing difference for larger 𝑁𝑃 . PA increased when shifting
364
from low (SR) to high (LR) ancestral LD for scenario Re-LDA-SNP, but decreased for Re-LEA-SNP. For
15
365
scenario Re-LDA-Tag, PA was always intermediate between Re-LDA-SNP and Re-LEA-SNP. For Re-LDA-
366
Ped, PA concavely increased from 𝑁𝑃 = 2 up to its maximum value of 0.4 for 𝑁𝑃 = 8, followed by a
367
minor decrease. Re-LDA-Ped and Re-LEA-SNP approached identical PA for large 𝑁𝑃 under long-range
368
LDA in LR, whereas Re-LEA-SNP retained superior PA under short-range LDA in SR. For Un-LDA-QTL, PA
369
strongly increased for both ancestral populations, especially from 𝑁𝑃 = 2 to 8, followed by a moderate
370
increase. For Un-LDA-SNP, the overall level of PA was much lower, but showed a similarly increasing
371
curvature as Un-LDA-QTL for long-range LDA, whereas under short-range LDA, PA was almost
372
consistently < 0.2 without sizeable increase for all values of 𝑁𝑃 .
373 374
Influence of training set size and marker density on prediction accuracy: Increasing TS size (𝑁𝑇𝑆 ) from
375
125 to 750 individuals was overall most beneficial for all “Re”-scenarios, except for Re-LDA-Ped (Figure
376
S3). Conversely, Re-LDA-Ped, as well as Un-LDA-SNP under short-range LDA, showed only a minor
377
increase in PA for larger 𝑁𝑇𝑆 . However, for Un-LDA-SNP under long-range LDA and for Un-LDA-QTL under
378
both short- and long-range LDA, the increase in PA along with 𝑁𝑇𝑆 was notable, especially for 𝑁𝑃 > 8.
379
Reducing the marker density from 5 SNPs cM-1 to 0.25 SNPs cM-1 resulted in a substantial
380
reduction of PA for all “SNP”-scenarios (Figure S4). This reduction was reinforced for scenarios utilizing
381
ancestral LD (Re-LDA-SNP and Un-LDA-SNP), especially in the presence of long-range LDA and for large
382
values of 𝑁𝑃 .
16
383
DISCUSSION
384
In plant breeding, GP has been applied to various types of populations such as single or multiple
385
biparental families or diversity panels of inbred lines. These materials differ fundamentally in their
386
pedigree structure, the number of founder individuals involved in their development, as well as the LD
387
in the ancestral population from which they were taken. Synthetics are especially suited for
388
systematically assessing the influence of these factors on prediction accuracy, because the variable
389
number of parents used for generating synthetics leads to (i) different pedigree relationships among
390
individuals and (ii) a trade-off between ancestral LD and sample LD arising in the parents. Thus, our
391
approach provides new insights into how these factors influence the ability of molecular markers to
392
capture actual relationships at causal loci, which determines the accuracy in various applications of
393
GP.
394 395
Influence of the number of parents and ancestral LD on actual relationships at causal loci and
396
prediction accuracy: The accuracy of GP relies on the distribution of actual relationships 𝑞𝑖𝑗 at causal
397
loci (QTL) between individuals in the TS and PS and (ii) the quality of the approximation of 𝑞𝑖𝑗 by
398
marker-derived genomic relationships 𝑔𝑖𝑗 (Goddard et al. 2011; Habier et al. 2013). We first
399
investigated PA using the actual relationship matrix 𝑸, which provides an upper bound of PA given
400
fixed values for 𝑁𝑃 , 𝑁𝑇𝑆 and ℎ² (de los Campos et al. 2013). Subsequently, we estimated 𝑸 by the
401
marker-derived genomic relationship matrix 𝑮 and inferred how the three sources of information
402
contributed to PA.
403
Actual relationships 𝑞𝑖𝑗 between two individuals 𝑖 and 𝑗 can be factorized into
404
𝑞𝑖𝑗 = 𝑓𝑖𝑗 + 𝑚𝑖𝑗 + 𝜉𝑖𝑗 ,
405
where 𝑓𝑖𝑗 is their expected IBD relationship at QTL, 𝑚𝑖𝑗 = 𝜏𝑖𝑗 − 𝑓𝑖𝑗 is the deviation of the actual from
406
the expected IBD relationship due to Mendelian sampling, and 𝜉𝑖𝑗 is the deviation of the actual (IBS)
407
relationship from the actual IBD relationship. Whereas 𝑓𝑖𝑗 and 𝑚𝑖𝑗 provide information solely with
(3)
17
408
respect to the parents (i.e., the founders of the pedigree), 𝜉𝑖𝑗 accounts also for actual relationships
409
among the parents (Powell et al. 2010; Vela-Avitúa et al. 2015).
410
If TS and PS are related (“Re”-scenarios), the distribution of 𝑓𝑖𝑗 depends on 𝑁𝑃 (Figure 6A) and
411
on the mating scheme employed for production of the synthetic (Figure S1). For small 𝑁𝑃 , this
412
distribution is dominated by full-sib and half-sib relationships, whereas distantly related and unrelated
413
individuals dominate for larger 𝑁𝑃 . The closer the pedigree relationships between individuals, the
414
longer are the chromosome segments they inherit from common ancestors and the larger is the
415
conditional variance in actual IBD relationships, i.e., var(𝑚𝑖𝑗 |𝑓𝑖𝑗 ) (Figure 6B, Hill and Weir 2011;
416
Goddard et al. 2011). In other words, var(𝑚𝑖𝑗 |𝑓𝑖𝑗 ) is inversely proportional to the number of
417
independently segregating chromosome segments and, hence, the length and number of
418
chromosomes must be taken into account when transferring our results to other species. For example,
419
in bread wheat (2n = 42), var(𝑚𝑖𝑗 |𝑓𝑖𝑗 ) – and consequently PA attributable to the Mendelian sampling
420
term – are expected to be smaller than in maize (2n = 20).
421
The contribution of 𝜉𝑖𝑗 to 𝑞𝑖𝑗 depends on the level of ancestral LD. Elevated LDA increases
422
var(𝑞𝑖𝑗 ) in the ancestral population (Figure S2, LR vs. SR) and in turn increases the variation in similarity
423
of haplotypes among parents sampled therefrom (Habier et al. 2013). Consequently, var(𝑞𝑖𝑗 |𝑓𝑖𝑗 ) in
424
synthetics increases with ancestral LD (Figure 6B and S2), on top of the variance var(𝑚𝑖𝑗 |𝑓𝑖𝑗 ) caused by
425
Mendelian sampling. Assuming known actual relationships and fixed TS size, PA therefore decreases if
426
(i) 𝑁𝑃 increases and (ii) ancestral LD decreases (Figure 5). This is because both factors reduce the
427
absolute frequency of close actual relationships among TS and PS (Figure S5). If actual relationships
428
among the parents were not accounted for, the decline in PA was reinforced as 𝑁𝑃 increased (Figure
429
5, scenario Re-LDA-Tag). The reason for this follows from the factorization (Eq. 3): the larger 𝑁𝑃 , the
430
more frequent are pairs of individuals with small or zero pedigree relationship (Figure 6A) and the
431
more important it is to account for actual relationships among parents. Conversely, PA was consistently
432
higher for small 𝑁𝑃 due to strong pedigree relationships and Mendelian sampling, despite the
18
433
accompanying negative effects of reduced heritability in the TS and the reduced additive-genetic
434
variance in the PS on PA (Table S2).
435
Restricting predictive information to pedigree relationships (scenario Re-LDA-Ped), resulted in
436
only moderate PA, unless for 𝑁𝑃 = 2 (Figure 5). In this case, all individuals in the TS and PS were full-
437
sibs (Figure 6A), which resulted in identical estimated breeding values by pedigree BLUP, so that PA
438
could not be calculated (indicated as PA = 0 in Figure 5). For 𝑁𝑃 > 2, there was variation in pedigree
439
relationships in synthetics (Figure 6A) and thus, PA > 0. Further research is warranted on the
440
importance of variation in pedigree relationships for GP in the presence of Mendelian sampling and
441
ancestral LD, e.g., by considering mating schemes such as MAGIC, which reduce or even entirely avoid
442
variation in pedigree relationships.
443
If the TS and PS are unrelated (“Un”-scenarios), only 𝜉𝑖𝑗 contributes to variation in 𝑞𝑖𝑗 , because
444
𝑓𝑖𝑗 and 𝑚𝑖𝑗 are equal to zero. Moreover, if 𝑁𝑃 is small, QTL in the TS and PS can (i) be fixed for different
445
alleles (Table S2) or (ii) differ in their LD structure due to sample LD. This limits the occurrence of close
446
actual relationships between TS and PS (Figure S5, Un-LDA-QTL) and reduces the upper bounds of PA
447
compared with the corresponding “Re”-scenarios (Figure 5, Un-LDA-QTL vs. Re-LDA-QTL). As 𝑁𝑃
448
increases, allele frequencies and LD between loci converge towards those in the ancestral population
449
in both “Re”- and “Un”-Scenarios (Table S2). In turn, the closest actual relationships between TS and
450
PS converge as well (Figure S5), resulting ultimately in similar PA for Re-LDA-QTL and Un-LDA-QTL
451
(Figure 5). In conclusion, the difference in predicting related and unrelated genotypes vanishes as 𝑁𝑃
452
increases for a given TS size, because it is then primarily ancestral information that drives the accuracy
453
of GP.
454 455
Sample LD and co-segregation – crucial factors for prediction accuracy in synthetics: LD in the parents
456
represents a combination of LD carrying over from the ancestral population and LD generated anew
457
due to limited 𝑁𝑃 . The latter LD, herein referred to as sample LD, results from a bottleneck in
458
population size similar to that used in our simulations for generating long-range LD in the ancestral
459
population (cf. Figure S1), but can be much stronger if 𝑁𝑃 is small (e.g. 4). Co-segregation is defined as
19
460
the co-inheritance of alleles at linked loci on the same gamete and thus describes the process that
461
prevents parental LD between them from being rapidly eroded by recombination (Figure S6). Together,
462
sample LD and co-segregation result in high LD in synthetics, which for small 𝑁𝑃 exceeds by far the
463
level of ancestral LD (Figure 3B, see File S3 for details). The crucial property of sample LD , however, is
464
that it is specific to a set of parents and thus provides predictive information only for their descendants.
465
Hence, using co-segregation as “source of information” in GP relies on the presence of pedigree
466
relationships (Habier et al. 2013). Conversely, the fraction of parental LD that stems from ancestral LD
467
is a commonality among all descendants of the ancestral population, irrespective of pedigree
468
relationships. The particularly small number of parents used in synthetics makes sample LD and co-
469
segregation crucial factors contributing to PA, a situation that differs greatly from previously
470
investigated scenarios (e.g., Habier et al. 2007, 2013; Wientjes et al. 2013). Hence, knowledge of how
471
ancestral LD and sample LD contribute to parental LD, depending of 𝑁𝑃 , is essential for evaluating the
472
applicability of training data to prediction of both related and unrelated genotypes.
473
The influence of sample LD on parental LD and PA in the “Re”-scenarios is illustrated best by
474
considering different values of 𝑁𝑃 : For 𝑁𝑃 = 2, sample LD in the parents is maximized, because all
475
pairs of polymorphic loci are in complete LD (r² = 1.0), irrespective of ancestral LD, linkage or genetic
476
map distance. Co-segregation of linked QTL and SNPs during intermating largely conserves LD, even
477
for loosely linked loci (Figure S6), so that LD in synthetics remained at high levels (Figure 3B). Therefore,
478
replacing 𝑸 with 𝑮 resulted in merely a marginal reduction of PA (Figure 5, Re-LDA-QTL vs. Re-LDA-SNP).
479
Previous studies claimed that PA in biparental populations is the maximum obtainable for given TS size
480
(Riedelsheimer et al. 2013; Lehermeier et al. 2014), despite absence of variation in pedigree
481
relationships. Our results demonstrate that this is exclusively attributable to the efficient utilization of
482
sample LD via co-segregation. For 𝑁𝑃 = 3 and 4, LD can take two and four discrete values, respectively
483
(see Results). Thus, sample LD still takes up a large share of parental LD (Figure 3A). However, the
484
occurrence of different LD values (in contrast to 𝑁𝑃 = 2) introduces a dependency on ancestral LD:
485
the frequency of loci with high parental LD increases in the presence of ancestral LD compared with
486
ancestral linkage equilibrium. This difference carries over during intermating and resulted in increased
20
487
LD in the synthetics, especially under long-range ancestral LD (Figure 3B, LR). However, the increment
488
in PA was only marginal (Figure 5, Re-LDA-SNP vs. Re-LEA-SNP) owing to the overriding contribution of
489
sample LD to parental LD for 𝑁𝑃 = 3 and 4. Nevertheless, the reduction in sample LD for 𝑁𝑃 = 3 or 4,
490
compared with 𝑁𝑃 = 2, impaired co-segregation information and reinforced the decline in PA when
491
relying on markers (Figure 5, Re-LDA-QTL vs. Re-LDA-SNP). For 𝑁𝑃 ≥ 16, sample LD becomes negligible
492
(Figure 3A) so that parental LD hardly differed from ancestral LD. This led to (i) reinforced reduction in
493
PA, when using markers rather than known QTL genotypes (Figure 5, Re-LDA-QTL vs. Re-LDA-SNP),
494
especially for short-range ancestral LD, and (ii) convergence of PA of GBLUP and pedigree BLUP in the
495
absence of ancestral LD (Figure 5, Re-LEA-SNP vs. Re-LDA-Ped). The reason for the latter is that under
496
marginal contribution of co-segregation, PA stems primarily from capturing pedigree relationships by
497
SNPs.
498
For the “Un”-scenarios, sample LD is manifested independently in 𝑃𝑇𝑆 and 𝑃𝑃𝑆 , which
499
results in different co-segregation “patterns” in TS and PS that cannot reliably be exploited in GP.
500
Therefore, the ancestral LD that is common to both sets of parents – measured by linkage phase
501
similarity in the synthetics (Figure 4) – provides the only source of information connecting the TS and
502
PS. This constraint resulted in a much larger drop in PA when replacing 𝑸 with 𝑮 in the “Un”-scenarios
503
(Figure 5, Un-LDA-QTL vs. Un-LDA-SNP) compared with the corresponding “Re”-scenarios (Figure 5, Re-
504
LDA-QTL vs. Re-LDA-SNP), especially under short-range ancestral LD. This decline in PA when predicting
505
the genetic merit of unrelated instead of related genotypes corroborates previous findings on GP
506
across populations in both animal and plant breeding (Hayes et al. 2009b; de Roos et al. 2009; Technow
507
et al. 2013; Riedelsheimer et al. 2013; Albrecht et al. 2014; Heslot and Jannink 2015).
508
Variation in linkage phase similarity between TS and PS caused by sample LD affects GP of
509
unrelated genotypes in an unforeseeable manner: while identical and reversed QTL-SNP linkage phases
510
manifested by sample LD cancel out on average, individual TS-PS combinations can show above or
511
below average linkage phase similarity and thus, co-segregation “patterns”. This translates into large
512
variation of PA among different TS-PS combinations. Additional simulations using unequal 𝑁𝑃 to derive
513
the TS and PS showed that variation in PA was even higher when using small 𝑁𝑃 to generate the PS
21
514
than for the TS (Figure S7). A possible explanation might be that regardless of the TS composition, small
515
𝑁𝑃 for the PS drastically reduces the frequency of polymorphic loci (Table S2) and thereby increases
516
the variation in linkage phase similarity with the TS for the remaining loci, which in turn increases the
517
variability of prediction. Considering the practical relevance of such prediction scenarios, further
518
research is needed to investigate this finding in greater detail.
519 520
Influence of LD on capturing pedigree relationships: The ability to capture pedigree relationships by
521
SNPs increases with the effective number of independently segregating SNPs in the model (Habier et
522
al. 2007). Higher LD between SNPs reduces this number and thus, reduces the contribution of pedigree
523
relationships captured by SNPs to PA. Scenario Re-LEA-SNP demonstrates this fact for large values of
524
𝑁𝑃 , where LD between QTL and SNPs in synthetics was small (Figure 3B) and hence, PA mainly relied
525
on capturing pedigree relationships. In line with this reasoning, PA decreased from SR to LR (Figure 5,
526
Re-LEA-SNP) as well as when marker density was reduced from 5 to only 0.25 SNPs cM-1, because similar
527
to increasing LD, using low marker density reduced the number of independently segregating SNPs
528
(Figure S4, Re-LEA-SNP).
529
In GBLUP, the consequences of an imprecise estimation of pedigree relationships by SNPs
530
due to strong LD are limited, because the loss in PA compared with pedigree BLUP is mostly
531
overcompensated for by capturing either co-segregation (Figure 5; small 𝑁𝑃 , Re-LEA-SNP vs. Re-LDA-
532
Ped) or long-range ancestral LD (Figure 5; large 𝑁𝑃 , Re-LEA-SNP vs. Re-LDA-Ped). An exception is the
533
combination of large 𝑁𝑃 and short-range ancestral LD, where the comparatively small contribution of
534
ancestral LD to PA does not necessarily compensate for that loss, so that GBLUP might not provide the
535
desired advantage over pedigree BLUP. Alternative models employing variable-selection (e.g., BayesB),
536
which capitalize more on LD rather than pedigree relationships (Habier et al. 2007; Zhong et al. 2009;
537
Jannink et al. 2010), might help to improve the prospects of GP in such cases.
538 539
Influence of training set size on prediction accuracy: In this study, we varied training set size 𝑁𝑇𝑆 for
540
given values of 𝑁𝑃 , because resources devoted to the TS differ between breeding programs and do not
22
541
necessarily depend on 𝑁𝑃 . Under fixed 𝑁𝑃 , the absolute frequency of individuals with close actual
542
relationship among TS and PS increases with 𝑁𝑇𝑆 (Figure S5), which led to similar benefits in PA for all
543
“Re”-scenarios (Figure S3, except Re-LDA-Ped). However, the general decline of PA in these scenarios
544
with increasing NP was only slightly attenuated even when using 750 instead of 125 individuals in the
545
TS. This is because the need for larger TS size increases rapidly as pedigree relationships with the PS
546
decrease (Habier et al. 2010), which in turn shifts the distribution of actual relationships toward lower
547
values (Figure 6B and S2). Thus, 𝑁𝑇𝑆 must generally be increased along with 𝑁𝑃 to counteract as much
548
as possible the expected decline in PA.
549
According to Habier et al. (2013), altering TS size affects the contributions of the three
550
sources of information to PA, but this inference is based on the assumption that TS size was increased
551
by adding new families to the TS (unrelated to the initially included families), which is comparable to
552
increasing 𝑁𝑃 in our study. De los Campos et al. (2013) showed that the estimation of actual
553
2 relationships by SNPs is sufficiently characterized by 𝑅𝑘,𝑞 (Figure S8) and thus, largely independent of
554
𝑁𝑇𝑆 , apart from estimation error. In synthetics, the distribution of actual relationships 𝑞𝑖𝑗 is defined
555
by 𝑁𝑃 and LDA (Figure 6 and S2). Thus, increasing 𝑁𝑇𝑆 increases the chances for each individual in the
556
PS to have several individuals with close actual relationships 𝑞𝑖𝑗 in the TS, which was previously found
557
to be crucial for achieving high PA (Jannink et al. 2010; Clark et al. 2012). Therefore, the contributions
558
to PA from co-segregation and ancestral LD increase with 𝑁𝑇𝑆 , because they are required to capture
559
deviations from pedigree relationships. Conversely, using small 𝑁𝑇𝑆 will tend to hamper the
560
occurrence of high 𝑞𝑖𝑗 values and hence, increase the reliance on pedigree relationships.
561
If TS and PS are unrelated, the absolute frequency of close actual relationships is low, even
562
if 𝑁𝑇𝑆 is large (Figure S5). Additionally, actual relationships are rather poorly estimated by SNPs when
563
relying solely on ancestral LD (Figure S8, Un-LDA-SNP). Consequently, huge 𝑁𝑇𝑆 (>> 750) and high
564
marker density would be required to substantially elevate PA, especially if there is only short-range
565
ancestral LD (cf. de los Campos et al. 2013).
23
566
Influence of marker density on prediction accuracy: High marker density is especially important if LD
567
between QTL and SNPs extends only to short map distances (Solberg et al. 2008; Zhong et al. 2009;
568
Hickey et al. 2014). This applies in our study if either sample LD was negligible (Figure S4; large 𝑁𝑃 , Re-
569
LDA-SNP vs. Re-LEA-SNP) or if TS and PS were unrelated (Figure S4, Un-LDA-SNP), so that PA relied
570
heavily on ancestral LD. Our results also show that in the latter case, using high marker density strongly
571
improved PA for both ancestral populations, implying that capturing LD between tightly linked loci
572
(∆ < 1 cM) is beneficial even if long-range ancestral LD prevails. With low marker density, capturing
573
only the “long-range part” of ancestral LD (Figure 1, LR) still provided moderate PA (Figure S4, LR), but
574
PA dropped below < 0.1 for short-range ancestral LD (Figure S4, SR). This was likely because most SNPs
575
were no longer in LD with QTL and thus contributed mostly noise to the prediction equation. These
576
results are in agreement with former studies (de los Campos et al. 2013; Habier et al. 2013; Hickey et
577
al. 2014; Lorenz and Smith 2015) reporting that under insufficient marker density, adding individuals
578
unrelated to the PS to the TS can even decrease PA.
579
In summary, the required marker density for 𝑁𝑃 ≥ 16 should be chosen in compliance
580
with the extent of ancestral LD. While in this case, high density is mandatory if TS and PS are unrelated,
581
moderate PA can still be achieved under low marker density if TS and PS are related due to pedigree
582
relationships contributing to PA. For small 𝑁𝑃 , extensive LD in synthetics (due to sample LD and co-
583
segregation) lowers the requirements on marker density. Although co-segregation is captured
584
optimally if SNPs and QTL are as tightly linked as possible, medium marker density (≥ 1 SNPs cM-1,
585
depending on 𝑁𝑃 ) is likely sufficient to reach PA near the optimum.
586 587
Expected impact of ancestral LD on GP in synthetics: In GP of genetic predisposition in humans or
588
breeding values of bulls, the availability of several thousand training individuals, in conjunction with
589
high marker densities, allows for efficient use of rather low levels of ancestral LD, as usually observed
590
in these species (de Roos et al. 2008; Goddard and Hayes 2009; de los Campos et al. 2013). We showed
591
that short-range ancestral LD is generally less valuable in plant breeding, where TS usually comprise
592
only hundreds or fewer individuals. Ancestral LD can differ substantially among crops and different
24
593
germplasm within crops (Flint-Garcia et al. 2003). Usually, low levels of ancestral LD are found in
594
diversity panels that encompass lines from different breeding programs and/or geographic origin as
595
well as in materials largely unselected by breeders, such as landraces or gene bank accessions (Hyten
596
et al. 2007; Delourme et al. 2013; Romay et al. 2013). Recently, Gorjanc et al. (2016) proposed GP for
597
recurrent selection of synthetics generated from doubled haploid lines derived from landraces. In the
598
light of our findings, such an approach generally requires large TS size and high marker density to
599
outperform pedigree BLUP, unless one chooses small 𝑁𝑃 to ensure satisfactory PA due to co-
600
segregation.
601
In contrast, extensive long-range ancestral LD is usually found in elite breeding germplasm of
602
major crops such as maize (Windhausen et al. 2012; Unterseer et al. 2014), wheat (Maccaferri et al.
603
2005), barley (Zhong et al. 2009), soybean (Hyten et al. 2007) or sugar beet (Würschum et al. 2013). If
604
synthetics were derived from such germplasm, ancestral LD is expected to contribute substantially to
605
PA, as shown by our results. However, LD determined from biallelic SNPs might overestimate ancestral
606
LD between QTL-SNP pairs, because their allele frequencies can deviate due to ascertainment bias in
607
discarding SNPs with low minor allele frequencies for the construction of SNP arrays (Ganal et al. 2011;
608
Goddard et al. 2011). Such an overestimation would impair the advantage of GP approaches over
609
pedigree BLUP.
610 611
Implications for other scenarios relevant in plant breeding: Research on GP in plant breeding has so
612
far focused primarily on the use of single (e.g., Lorenzana and Bernardo 2009; Riedelsheimer et al.
613
2013) and multiple segregating biparental families (BF) (e.g., Heffner et al. 2011; Albrecht et al. 2011;
614
Schulz-Streeck et al. 2012; Habier et al. 2013; Lehermeier et al. 2014). For 𝑁𝑃 = 2, our scenarios Re-
615
LDA-SNP and Un-LDA-SNP correspond exactly to GP within and between BF derived from unrelated
616
parents. In practice, breeders mostly derive lines directly from F1 crosses (Mikel and Dudley 2006),
617
whereas we applied a further generation of intermating (Figure S1). This additional meiosis slightly
618
reduces LD in synthetics (see File S3 for details) and in turn, PA (results not shown). While GP within BF
619
generally works well, predicting an unrelated BF can be risky and unreliable (Riedelsheimer et al. 2013)
25
620
as underlined by our results for scenario Un-LDA-SNP (Figure S7, 𝑁𝑃 = 2). Similar uncertainties might
621
be encountered if new lines from an untested BF are predicted based on pre-existing data from
622
multiple BF (Heffner et al. 2011), diversity panels (Würschum et al. 2013) or populations of
623
experimental hybrids (Massman et al. 2013) to obtain predicted breeding values prior to partially
624
phenotyping the new cross (Figure S7, 𝑁𝑃 > 2 in TS and 𝑁𝑃 = 2 in PS). The risk of such approaches is
625
likely attenuated in advanced breeding cycles, where putatively “unrelated” BF usually share more
626
recent common ancestors than a TS comprising truly unrelated material, as would be the case in an
627
“ideal” diversity panel. However, Hickey et al. (2014) showed that if two BF share only a grand-parent
628
as their most recent common ancestor, PA was not substantially higher than for unrelated BF. This
629
underpins the need for close relatives in the TS (e.g., full-sibs or half-sibs) to warrant high and robust
630
PA across different prediction targets. Accordingly, previous studies on GP in diversity panels
631
concluded that the observed medium to high PAs were partially attributable to latent groups of related
632
germplasm (e.g., Rincent et al. 2012; Schopp et al. 2015).
633
If a BF is too small for training the prediction equation, multiple BF can be alternatively pooled
634
together (Heffner et al. 2011; Technow and Totir 2015). Such a combined TS can be constructed by
635
sampling lines from each BF to predict the remainder in each BF (“within”) or by using some BF to
636
predict other BF (“across”) (cf. Albrecht et al. 2011). Our scenarios Re-LDA-SNP and Un-LDA-SNP are
637
similar to these “within” and “across” situations for 𝑁𝑃 > 2, but - besides the additional meiosis
638
discussed above - show another important difference to F1-derived multiple BF: generating synthetics
639
by random mating of the Syn-1 generation breaks up the clear pedigree structure in full-sib, half-sib
640
and unrelated families (Figure S9). This reduces both the mean and variance of pedigree relationships,
641
which in turn reduces PA (results not shown). As discussed above, capturing pedigree relationships
642
plays a major role in GP of both synthetics and multiple BF if TS and PS are related, especially if 𝑁𝑃 is
643
large. This is because in both situations, co-segregation is barely used to obtain “accuracy within
644
families” (cf. Habier et al. 2013). In practical breeding programs using multiple BF, the situation might
645
be slightly different, if some parents are overrepresented compared with others and introduce a
646
predominant linkage phase patterns that can be exploited in GP. Moreover, one has the opportunity
26
647
to improve information from co-segregation by (i) clustering related BF into the TS to reflect the co-
648
segregation pattern of the PS or (ii) explicit modelling of co-segregation (cf. Habier et al. 2013) or
649
family-specific effects using hierarchical models (Technow and Totir 2015). However, both of these
650
strategies are not easily accessible in synthetics, unless one replaces random by controlled mating in
651
order to keep track of pedigree relationships. Since ancestral LD persists well over generations (Habier
652
et al. 2007), its contribution to PA is expected to be only marginally affected by additional intermating
653
generations. Thus, ancestral LD can generally be considered of great importance for GP of material
654
related or unrelated to the TS, particularly if NP is large.
655
In the present study, we considered the two most extreme situations of relatedness or
656
unrelatedness of the TS and PS, because their parents were either identical or entirely different.
657
Further research is warranted for situations of partial overlapping of parents among families, which
658
occurs frequently in practice, e.g., when proven inbred lines contribute to multiple crosses in
659
subsequent breeding cycles. Moreover, we focused here exclusively on PA, but the genetic gain from
660
genomic selection, which is of ultimate interest to breeders, depends additionally on the genetic
661
variance in the population. Since both parameters are influenced by the choice of 𝑁𝑃 , the potential of
662
recurrent genomic selection in synthetics needs to be examined for different values of 𝑁𝑃 and different
663
levels of ancestral LD, ideally across multiple selection cycles.
27
664
ACKNOWLEDGMENTS
665
We thank Chris-Carolin Schön, Tobias Würschum, José Marulanda, Willem Molenaar and three
666
anonymous reviewers for valuable suggestions to improve the content of the manuscript. PS
667
acknowledges Syngenta for partially funding this research by a Ph.D. fellowship and AEM the financial
668
contribution of CIMMYT/GIZ through the CRMA Project 15.78600.8-001-00.
669 670
DATA AVAILABILITY STATEMENT
671
The authors state that all simulated data and results necessary for confirming the conclusions
672
presented in the article are represented fully within the article and data supplements. Figure S1
673
provides a detailed overview over the entire simulation scheme and assumptions underlying all results
674
presented herein.
675
28
676
LITERATURE CITED
677 678 679
Albrecht, T., H.-J. Auinger, V. Wimmer, J. O. Ogutu, C. Knaak et al., 2014 Genome-based prediction of maize hybrid performance across genetic groups, testers, locations, and years. Theor. Appl. Genet. 127: 1375–1386.
680 681
Albrecht, T., V. Wimmer, H. Auinger, M. Erbe, C. Knaak et al., 2011 Genome-based prediction of testcross values in maize. Theor. Appl. Genet. 123: 339–350.
682 683
Bandillo, N., C. Raghavan, and P. Muyco, 2013 Multi-parent advanced generation inter-cross (MAGIC) populations in rice: progress and potential for genetics research and breeding. Rice 6: 1–15.
684
Bradshaw, J. E., 2016 Plant Breeding: Past, Present and Future. Springer International Publishing.
685 686
Cavanagh, C., M. Morell, I. Mackay, and W. Powell, 2008 From mutations to MAGIC: resources for gene discovery, validation and delivery in crop plants. Curr. Opin. Plant Biol. 11: 215–221.
687 688 689
Clark, S. a, J. M. Hickey, H. D. Daetwyler, and J. H. J. van der Werf, 2012 The importance of information on relatives for the prediction of genomic breeding values and the implications for the makeup of reference data sets in livestock breeding schemes. Genet. Sel. Evol. 44: 4.
690 691 692
Delourme, R., C. Falentin, B. F. Fomeju, M. Boillot, G. Lassalle et al., 2013 High-density SNP-based genetic map development and linkage disequilibrium assessment in Brassica napus L. BMC Genomics 14: 120.
693 694
Endelman, J. B., 2011 Ridge Regression and Other Kernels for Genomic Selection with R Package rrBLUP. Plant Genome 4: 250–255.
695 696
Falconer, D. F., and T. S. C. Mackay, 1996 Introduction to Quantitative Genetics (1996 Longman, Ed.). Pearson, Essex.
697 698
Flint-Garcia, S. a, J. M. Thornsberry, and E. S. Buckler, 2003 Structure of linkage disequilibrium in plants. Annu. Rev. Plant Biol. 54: 357–74.
699 700 701
Ganal, M. W., G. Durstewitz, A. Polley, A. Bérard, E. S. Buckler et al., 2011 A large maize (Zea mays L.) SNP genotyping array: development and germplasm genotyping, and genetic mapping to compare with the B73 reference genome. PLoS One 6: e28334.
702 703
Goddard, M. E., and B. J. Hayes, 2009 Mapping genes for complex traits in domestic animals and their use in breeding programmes. Nat. Rev. Genet. 10: 381–391.
704 705
Goddard, M. E., B. J. Hayes, and T. H. E. Meuwissen, 2011 Using the genomic relationship matrix to predict the accuracy of genomic selection. J. Anim. Breed. Genet. 128: 409–421.
706 707 708
Gorjanc, G., J. Jenko, S. J. Hearne, and J. M. Hickey, 2016 Initiating maize pre-breeding programs using genomic selection to harness polygenic variation from landrace populations. BMC Genomics 17: 30.
709 710
Habier, D., R. L. Fernando, and J. C. M. Dekkers, 2007 The impact of genetic relationship information on genome-assisted breeding values. Genetics 177: 2389–2397.
711 712
Habier, D., R. L. Fernando, and D. J. Garrick, 2013 Genomic BLUP Decoded: A Look into the Black Box of Genomic Prediction. Genetics 194: 597–607.
713 714
Habier, D., J. Tetens, F. Seefried, P. Lichtner, and G. Thaller, 2010 The impact of genetic relationship information on genomic breeding values in German Holstein cattle. Genet. Sel. Evol. 42: 5.
715 716
Hagdorn, S., K. Lamkey, M. Frisch, G. P. E. O., and M. A. E., 2003 Molecular genetic diversity among progenitors and derived elite lines of BSSS and BSCB1 maize populations. Crop Sci. 43: 474–482.
29
717 718
Hallauer, A. R., M. J. Carena, and J. de M. Filho, 2010 Quantitative genetics in maize breeding. Springer.
719
Hartl, D. L., and A. G. Clark, 2007 Principles of Population Genetics. Sinauer Associates, Inc.
720 721
Hayes, B. J., P. J. Bowman, A. J. Chamberlain, and M. E. Goddard, 2009a Genomic selection in dairy cattle: progress and challenges. J. Dairy Sci. 92: 433–443.
722 723
Hayes, B. J., P. J. Bowman, A. C. Chamberlain, K. Verbyla, and M. E. Goddard, 2009b Accuracy of genomic breeding values in multi-breed dairy cattle populations. Genet. Sel. Evol. 41: 51.
724 725
Hayes, B. J., P. M. Visscher, and M. E. Goddard, 2009c Increased accuracy of artificial selection by using the realized relationship matrix. Genet. Res. Cambridge 91: 47–60.
726 727
Heffner, E. L., J. Jannink, and M. E. Sorrells, 2011 Genomic Selection Accuracy using Multifamily Prediction Models in a Wheat Breeding Program. Plant Genome 4: 65–75.
728
Henderson, C., 1984 Applications of linear models in animal breeding. University of Guelph, ON.
729 730
Heslot, N., and J.-L. Jannink, 2015 An alternative covariance estimator to investigate genetic heterogeneity in populations. Genet. Sel. Evol. 47: 93.
731 732 733
Hickey, J. M., S. Dreisigacker, J. Crossa, S. Hearne, R. Babu et al., 2014 Evaluation of genomic selection training population designs and genotyping strategies in plant breeding programs using simulation. Crop Sci. 54: 1476–1488.
734 735
Hill, W. G., 1981 Estimation of effective population size from data on linkage disequilibrium. Genet. Res. 38: 209–216.
736 737
Hill, W. G., and A. Robertson, 1968 Linkage disequilibrium in finite populations. Theor. Appl. Genet. 38: 226–231.
738 739
Hill, W. G., and B. S. Weir, 2011 Variation in actual relationship as a consequence of Mendelian sampling and linkage. Genet. Res. Cambridge 93: 47–64.
740 741
Hyten, D. L., I. Y. Choi, Q. Song, R. C. Shoemaker, R. L. Nelson et al., 2007 Highly variable patterns of linkage disequilibrium in multiple soybean populations. Genetics 175: 1937–1944.
742 743
Jannink, J.-L., A. J. Lorenz, and H. Iwata, 2010 Genomic selection in plant breeding: from theory to practice. Briefings Funct. genomics proteomics 9: 166–177.
744
de Koning, D.-J., 2016 Meuwissen et al. on Genomic Selection. Genetics 203: 5–7.
745 746
Lehermeier, C., N. Krämer, E. Bauer, C. Bauland, C. Camisan et al., 2014 Usefulness of multi-parental populations of maize (Zea mays L.) for genome-based prediction. Genetics 198: 3–16.
747 748
Lin, Z., B. J. Hayes, and H. D. Daetwyler, 2014 Genomic selection in crops, trees and forages: A review. Crop Pasture Sci. 65: 1177–1191.
749 750
Lorenzana, R. E., and R. Bernardo, 2009 Accuracy of genotypic value predictions for marker-based selection in biparental plant populations. Theor. Appl. Genet. 120: 151–161.
751 752
Lorenz, A. J., and K. P. Smith, 2015 Adding genetically distant individuals to training populations reduces genomic prediction accuracy in Barley. Crop Sci. 55: 2657–2667.
753 754
de los Campos, G., A. I. Vazquez, R. Fernando, Y. C. Klimentidis, and D. Sorensen, 2013 Prediction of Complex Human Traits Using the Genomic Best Linear Unbiased Predictor. PLoS Genet. 9: 7.
755 756
Maccaferri, M., M. C. Sanguineti, E. Noli, and R. Tuberosa, 2005 Population structure and long-range linkage disequilibrium in a durum wheat elite collection. Mol. Breed. 15: 271–289.
757 758
Mackay, I., E. Ober, and J. Hickey, 2015 GplusE: beyond genomic selection. Food Energy Secur. 4: 25– 35.
30
759 760
Massman, J. M., A. Gordillo, R. E. Lorenzana, and R. Bernardo, 2013 Genomewide predictions from maize single-cross data. Theor. Appl. Genet. 126: 13–22.
761 762
Mcmullen, M. D., S. Kresovich, H. S. Villeda, P. Bradbury, H. Li et al., 2009 Genetic Properties of the Maize Nested AssociationMapping Population. Science (80-. ). 325: 737–740.
763 764
Meuwissen, T. H. E., B. J. Hayes, and M. E. Goddard, 2001 Prediction of total genetic value using genome-wide dense marker maps. Genetics 157: 1819–1829.
765 766
Mikel, M. A., and J. W. Dudley, 2006 Evolution of North American dent corn from public to proprietary germplasm. Crop Sci. 46: 1193–1205.
767 768
Powell, J. E., P. M. Visscher, and M. E. Goddard, 2010 Reconciling the analysis of IBD and IBS in complex trait studies. Nat. Rev. Genet. 11: 800–805.
769
R Core Team, 2012 R: A language and environment for statistical computing. ISBN 3-900051-07-0.
770 771
Riedelsheimer, C., J. B. Endelman, M. Stange, M. E. Sorrells, J. L. Jannink et al., 2013 Genomic predictability of interconnected biparental maize populations. Genetics 194: 493–503.
772 773 774
Rincent, R., D. Laloë, S. Nicolas, T. Altmann, D. Brunel et al., 2012 Maximizing the reliability of genomic selection by optimizing the calibration set of reference individuals: comparison of methods in two diverse groups of maize inbreds (Zea mays L.). Genetics 192: 715–728.
775 776
Romay, M. C., M. J. Millard, J. C. Glaubitz, J. a Peiffer, K. L. Swarts et al., 2013 Comprehensive genotyping of the USA national maize inbred seed bank. Genome Biol. 14: R55.
777 778
de Roos, a P. W., B. J. Hayes, and M. E. Goddard, 2009 Reliability of genomic predictions across multiple populations. Genetics 183: 1545–1553.
779 780
de Roos, a P. W., B. J. Hayes, R. J. Spelman, and M. E. Goddard, 2008 Linkage disequilibrium and persistence of phase in Holstein-Friesian, Jersey and Angus cattle. Genetics 179: 1503–1512.
781 782
Sargolzaei, M., and F. S. Schenkel, 2009 QMSim: a large-scale genome simulator for livestock. Bioinformatics 25: 680–681.
783 784 785
Schopp, P., C. Riedelsheimer, H. F. Utz, C.-C. Schön, and A. E. Melchinger, 2015 Forecasting the accuracy of genomic prediction with different selection targets in the training and prediction set as well as truncation selection. Theor. Appl. Genet. 128: 2189–2201.
786 787
Schulz-Streeck, T., J. O. Ogutu, Z. Karaman, C. Knaak, and H. P. Piepho, 2012 Genomic Selection using Multiple Populations. Crop Sci. 52: 2453–2461.
788 789
Solberg, T. R., a K. Sonesson, J. a Woolliams, and T. H. E. Meuwissen, 2008 Genomic selection using different marker types and densities. J. Anim. Sci. 86: 2447–2454.
790
Suneson, C. A., 1956 An Evolutionary Plant Breeding Method. Agron. J. 6: 1–4.
791 792 793
Technow, F., A. Bürger, and A. E. Melchinger, 2013 Genomic prediction of northern corn leaf blight resistance in maize with combined or separated training sets for heterotic groups. G3 3: 197– 203.
794 795
Technow, F., and L. R. Totir, 2015 Using Bayesian Multilevel Whole Genome Regression Models for Partial Pooling of Training Sets in Genomic Prediction. G3 5: 1603–1612.
796 797 798
Unterseer, S., E. Bauer, G. Haberer, M. Seidel, C. Knaak et al., 2014 A powerful tool for genome analysis in maize: development and evaluation of the high density 600 k SNP genotyping array. BMC Genomics 15: 823.
799 800
VanRaden, P. M., 2008 Efficient methods to compute genomic predictions. J. Dairy Sci. 91: 4414– 4423.
31
801 802 803
Vela-Avitúa, S., T. H. Meuwissen, T. Luan, and J. Ødegård, 2015 Accuracy of genomic selection for a sib-evaluated trait using identity-by-state and identity-by-descent relationships. Genet. Sel. Evol. 47: 9.
804 805
Wientjes, Y. C. J., R. F. Veerkamp, and M. P. L. Calus, 2013 The Effect of Linkage Disequilibrium and Family Relationships on the Reliability of Genomic Prediction. Genetics 193: 621–631.
806 807 808
Windhausen, V. S., G. N. Atlin, J. M. Hickey, J. Crossa, J.-L. Jannink et al., 2012 Effectiveness of genomic prediction of maize hybrid performance in different breeding populations and environments. G3 2: 1427–1436.
809
Wright, S., 1922 Coefficients of Inbreeding and Relationship. Am. Nat. 56: 330–338.
810 811
Würschum, T., J. C. Reif, T. Kraft, G. Janssen, and Y. Zhao, 2013 Genomic selection in sugar beet breeding populations. BMC Genet. 14: 85.
812 813 814
Zhong, S., J. C. M. Dekkers, R. L. Fernando, and J.-L. Jannink, 2009 Factors Affecting Accuracy From Genomic Selection in Populations Derived From Multiple Inbred Lines: A Barley Case Study. Genetics 182: 355–364.
815
32
816 817
818 819 820 821 822 823 824 825
826 827 828 829 830 831 832 833 834 835
FIGURES
Figure 1 Linkage disequilibrium (LDA) between pairs of loci plotted against their genetic map distance ∆ in centimorgans (cM), for the two ancestral populations SR (shortrange LD) and LR (long-range LD). The two vertical lines represent the average distance between QTL and its closest nearby SNP for the two marker densities investigated in our study.
Figure 2 Flowchart of the eight scenarios analyzed in this study. Training set and prediction set were either related (“Re”-scenarios) or unrelated (“Un”-scenarios). The arrows represent the changes made between scenarios, e.g., removal of ancestral LD between QTL and SNPs (LDA LEA) or replacing the relationship matrix (𝑮 → 𝑸). The background texture indicates whether identity-by-state or identityby-descent information was used. The green circles show for the SNP-based scenarios the sources of information that contributed to prediction accuracy (cf. Habier et al. 2013), where in addition to LDA, RS refers to pedigree relationships at QTL captured by SNPs and CS refers to co-segregation of QTL and SNPs.
33
836 837 838 839 840 841 842 843 844 845
Figure 3 (A) Frequency of QTL-SNP pairs falling into 8 disjoint intervals of linkage disequilibrium (LD) in the parents of synthetics, plotted against their genetic map distance ∆, for three different numbers of parents 𝑁𝑃 . (B) Average LD between QTL-SNP pairs, plotted against their genetic map distance ∆, for synthetics generated from different 𝑁𝑃 . The mean LD in the respective ancestral population (LDA) is shown for comparison (red graphs). The left column in A and B refers to scenarios Re-LEA-SNP and UnLEA-SNP (independent of the ancestral population), where ancestral LD between QTL and SNPs was eliminated, whereas the other two columns correspond to all other scenarios, for the ancestral populations SR (short-range LD) and LR (long-range LD), respectively.
34
846
847 848 849 850 851 852 853 854
855 856 857 858 859 860
Figure 4 Linkage phase similarity of QTL-SNP pairs in the training set (TS) and prediction set (PS) for scenarios Re-LDA-SNP and Un-LDA-SNP, plotted against the number of parents 𝑁𝑃 used to generate synthetics, for the two ancestral populations SR (short-range LD) and LR (long-range LD), and for different genetic map distances ∆ (0.5, 5 and 20 cM ± 0.5 cM) between QTL and SNPs.
Figure 5 Prediction accuracy for seven scenarios (scenario Un-LEA-SNP not shown), plotted against the number of parents 𝑁𝑃 used to generate synthetics, for the two ancestral populations SR (short-range LD) and LR (long-range LD). Results refer to a training set size of 𝑁𝑇𝑆 = 250 doubled haploid lines and a marker density of 5 SNPs cM-1.
35
861
862 863 864 865 866 867
Figure 6 (A) Frequency of the seven possible values 𝑓𝑖𝑗 of pedigree relationships for different numbers of unrelated inbred parents 𝑁𝑃 used to generate synthetics. (B) Conditional distributions 𝑞𝑖𝑗 |𝑓𝑖𝑗 of actual relationships 𝑞𝑖𝑗 conditional on their pedigree relationship 𝑓𝑖𝑗 between individuals 𝑖 and 𝑗 in the training set and prediction set, respectively, for the two ancestral populations SR (short-range LD) and LR (long-range LD).
36