Accuracy of Genomic Prediction in Synthetic Populations ... - Genetics

9 downloads 0 Views 2MB Size Report
Nov 9, 2016 - number of parents, relatedness and ancestral linkage disequilibrium. 2 ..... QTL haplotypes among the parents (for “Un”-scenarios separately in ...
Genetics: Early Online, published on November 9, 2016 as 10.1534/genetics.116.193243

1

Accuracy of genomic prediction in synthetic populations depending on the

2

number of parents, relatedness and ancestral linkage disequilibrium

3

Pascal Schopp*,1, Dominik Müller*,1, Frank Technow*, Albrecht E. Melchinger*

4 September 29, 2016

5 6 7 8

*Institute of Plant Breeding, Seed Science and Population Genetics 1

These authors contributed equally to this work

9 10

University of Hohenheim

11

70599 Stuttgart, Germany

12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 1 Copyright 2016.

27

Running Head: Genomic prediction in synthetics

28 29 30

Key Words: genomic prediction, synthetic populations, GBLUP, genetic relationships, linkage

31

disequilibrium

32 33 34

Corresponding Author:

35

A.E. Melchinger

36 37

Institute of Plant Breeding, Seed Sciences and Population Genetics

38

University of Hohenheim

39

Fruwirthstr. 21,

40

Stuttgart 70599, GERMANY

41

[email protected]

42

Tel.: 0049 0711 459-22334

43

Fax.: 0049 0711 459-22343

44 45 46 47 48 49 50 51 52 53 2

54

ABSTRACT

55

Synthetics play an important role in quantitative genetic research and plant breeding, but few studies

56

have investigated the application of genomic prediction (GP) to these populations. Synthetics are

57

generated by intermating a small number of parents (𝑁𝑃 ) and thereby possess unique genetic

58

properties, which make them especially suited for systematic investigations of factors contributing to

59

the accuracy of GP. We generated synthetics in silico from 𝑁𝑃 = 2 to 32 maize (Zea mays L.) lines taken

60

from an ancestral population with either short- or long-range linkage disequilibrium (LD). In eight

61

scenarios differing in relatedness of the training and prediction sets and in the types of data used to

62

calculate the relationship matrix (QTL, SNPs, tag markers, pedigree), we investigated the prediction

63

accuracy of GBLUP and analyzed contributions from pedigree relationships captured by SNP markers

64

as well as from co-segregation and ancestral LD between QTL and SNPs. The effects of training set size

65

𝑁𝑇𝑆 and marker density were also studied. Sampling few parents (2 ≤ 𝑁𝑃 < 8) generates substantial

66

sample LD that carries over into synthetics through co-segregation of alleles at linked loci. For fixed

67

𝑁𝑇𝑆 , 𝑁𝑃 influences prediction accuracy most strongly. If the training and prediction set are related,

68

using 𝑁𝑃 < 8 parents yields high prediction accuracy regardless of ancestral LD because SNPs capture

69

pedigree relationships and Mendelian sampling through co-segregation. As 𝑁𝑃 increases, ancestral LD

70

contributes more information, while other factors contribute less due to lower frequencies of closely

71

related individuals. For unrelated prediction sets, only ancestral LD contributes information and

72

accuracies were poor and highly variable for 𝑁𝑃 ≤ 4 due to large sample LD. For large 𝑁𝑃 , achieving

73

moderate accuracy requires large 𝑁𝑇𝑆 , long-range ancestral LD and high marker density. Our approach

74

for analyzing prediction accuracy in synthetics provides new insights into the prospects of GP for many

75

types of source populations encountered in plant breeding.

3

INTRODUCTION

76

77

Synthetic populations, known as synthetics, have played an important role in quantitative-

78

genetic research on gene action in complex heterotic traits and comparison of selection methods (cf.

79

Hallauer et al. 2010). In many crops, synthetics also serve as cultivars in agricultural production or as

80

source population for recurrent selection programs (cf. Bradshaw 2016). Synthetics are usually created

81

by crossing a small number of parents (𝑁𝑃 ) and subsequently cross-pollinating the F1 individuals for

82

one or several generations (Falconer and Mackay 1996). A prominent example is the “Iowa Stiff Stalk

83

Synthetic” (BSSS) generated from 16 parents of maize, from which numerous successful elite inbred

84

lines such as B73 have been derived (Hagdorn et al. 2003). Further examples of synthetics include

85

composite crosses (Suneson 1956) and multi-parental advanced inter-cross (MAGIC, see Table S1 for

86

list of abbreviations) populations (Cavanagh et al. 2008) advocated for breeding purposes in crops

87

(Bandillo et al. 2013). Importantly, two-way and four-way crosses, widely employed as source material

88

in recycling breeding (Mikel and Dudley 2006), can be viewed as special cases of synthetics when 𝑁𝑃 =

89

2 and 4, respectively.

90

Genomic prediction (GP) proposed by Meuwissen et al. (2001) led to a paradigm-shift in animal

91

breeding during the past decade (Hayes et al. 2009a; de Koning 2016) and has also been widely

92

adopted in plant breeding (Lin et al. 2014). In cattle breeding, GP is predominantly applied within

93

closed breeds and training sets (TS) commonly encompass thousands of individuals. By comparison, in

94

plant breeding the TS sizes are much smaller (e.g., hundreds or fewer of individuals) and populations

95

are usually structured into multiple segregating families or subpopulations. Numerous studies

96

addressed the implementation of GP in structured plant breeding populations (cf. Lorenzana and

97

Bernardo 2009; Albrecht et al. 2011; Lehermeier et al. 2014; Technow and Totir 2015), but systematic

98

investigations on the prospects of GP in synthetics are lacking so far, although they were proposed as

99

particularly suitable source material for recurrent genomic selection (Windhausen et al. 2012; Gorjanc

100

et al. 2016).

4

101

Genomic best linear unbiased prediction (GBLUP), a modification of the traditional pedigree

102

BLUP devised by Henderson (1984), is a widely used method to implement GP in animal and plant

103

breeding (Mackay et al. 2015). Here, the pedigree relationship matrix is replaced by a marker-derived

104

genomic relationship matrix to estimate actual relationships at QTL (Hayes et al. 2009c). The success

105

of this approach depends on three sources of information, namely (i) pedigree relationships captured

106

by markers, (ii) co-segregation of QTL and markers and (iii) population-wide linkage disequilibrium

107

between QTL and markers (Habier et al. 2007, 2013; Wientjes et al. 2013).

108

In classical quantitative-genetics, pedigree relationships between individuals are calculated as

109

twice the probability of identity-by-descent (IBD) of alleles at a locus, conditional on their pedigree

110

(Wright 1922; Falconer and Mackay 1996). However, actual IBD relationships at QTL deviate from

111

pedigree relationships – which correspond to expected IBD relationships – due to Mendelian sampling

112

(Hill and Weir 2011). In GP, pedigree relationships are captured best with a large number of

113

stochastically independent markers (Habier et al. 2007), whereas capturing the Mendelian sampling

114

term requires co-segregation of QTL and markers (Hayes et al. 2009c; Habier et al. 2013).

115

In pedigree analysis, the founders of the pedigree are by definition assumed to be unrelated

116

(i.e., IBD equal to zero), but in reality, there usually exist latent similarities at QTL contributing to

117

variation in identity-by-state (IBS) relationships among these individuals. Markers enable capturing

118

these IBS relationships if they are in population-wide LD with the QTL in an ancestral population of

119

founders. Thus, ancestral LD between QTL and markers provides also information between individuals

120

that are unrelated by pedigree to the TS (Wientjes et al. 2013; Habier et al. 2013). Ancestral LD

121

generally results from various population-historic processes like mutation, drift and selection (Flint-

122

Garcia et al. 2003) and varies within species primarily due to different bottlenecks imposed by artificial

123

selection or population admixture (Hill 1981; Hartl and Clark 2007). The influence of different levels of

124

ancestral LD on prediction accuracy (PA) in synthetics and related types of populations have so far

125

received little attention.

126

The contributions of the three sources of information to PA were demonstrated in theory and

127

simulations by Habier et al. (2013) using half-sib families in cattle breeding and multiple biparental

5

128

(full-sib) families in maize breeding, where both of these examples consisted of numerous families

129

derived from a large number of parents. However, it is unclear whether these results generalize to

130

other breeding situations, in particular those involving only few parents. In such situations, diverse

131

relationship patterns are generated, new statistical associations between loci arise due to sampling,

132

and ancestral LD might be only partially present in the progeny. These factors are expected to

133

profoundly affect the contributions of the three sources of information to PA and thus, affect the

134

application of GP on related and unrelated genotypes. Synthetics represent an ideal framework for

135

examining the influence of these factors on PA, because the number of parents used for producing

136

them can be varied over a wide range. Here, we simulated two ancestral populations differing

137

substantially in their LD and analyzed synthetics generated from different numbers of parents under

138

eight scenarios that enabled dissecting the factors contributing to PA.

139

The objectives of this study were to (i) examine how PA in synthetics depends on the number

140

of parents and LD in the ancestral population, (ii) assess the importance of the three sources of

141

information for PA and how they are influenced by training set size and marker density, and (iii) analyze

142

the relationship of LD between QTL and markers among the ancestral population, parents, and the

143

synthetics generated from them. Finally, we discuss how our approach provides a general framework

144

for analyzing the factors influencing PA and we draw inferences on the prospects of GP in other

145

scenarios encountered in breeding.

146 147

6

148

METHODS

149

Genome properties and genetic map: We used maize (Zea mays L.) as a model species in our study.

150

Physical map positions of the 56K Illumina maize SNP BeadChip were used to account for the markedly

151

reduced recombination rate and lower marker density in the centromere regions (McMullen et al.

152

2009). These positions were converted into genetic map positions required for simulating meiosis

153

events (File S1). In total we obtained 37,286 SNPs distributed over the 10 chromosomes of length 276,

154

200, 193, 188, 221, 171, 203, 173, 151 and 137 cM (1913 cM in total), corresponding to an average

155

marker density of 24.4 SNPs cM-1. All subsequent meiosis events were simulated using the count-

156

location model without crossover-interference, where the number of chiasmata was drawn from a

157

Poisson distribution with parameter 𝜆 equal to the chromosome length in Morgan, and where

158

crossover positions were sampled from a uniform distribution across the chromosome.

159 160

Simulation of ancestral populations: Two ancestral populations that differed substantially in their

161

level and decay of LD (LDA, Figure 1), were simulated with the software QMSim (Sargolzaei and

162

Schenkel 2009). Ancestral population LR displayed extensive long-range LDA, whereas SR displayed only

163

short-range LDA. The simulation of LR was carried out by closely following Habier et al. (2013) and

164

involved the following steps (Figure S1): First, we generated an initial population of 1,500 diploid

165

individuals by sampling alleles at each (biallelic) locus independently from a Bernoulli distribution with

166

probability 0.5. Second, 5,000 loci were randomly sampled from all SNPs and henceforth interpreted

167

as QTL; all remaining loci were considered as SNP markers. Third, these individuals were randomly

168

mated for 3,000 generations using a constant population size of 1,500 and a mutation rate of 2.5×10-

169

5

170

individuals, followed by 15 more generations of random mating to generate extensive long-range LDA.

171

Fifth, we conducted three more generations of random mating with a population size of 10,000

172

individuals to eliminate close pedigree relationships in the ancestral population LR. To produce SR, we

. Fourth, a severe bottleneck was introduced by reducing population size to 30 randomly chosen

7

173

randomly mated LR for 100 more generations at a population size of 10,000 individuals to remove

174

long-range LDA. Thus, LR and SR strongly differed in their LDA structure, but only marginally in their

175

allele frequencies (Table S2). Always in the last generation, a single gamete was randomly sampled per

176

individual from both SR and LR and treated as completely homozygous doubled haploid line. These

177

10,000 lines represented the final ancestral population used for production of the synthetics. All lines

178

were considered unrelated when calculating pedigree relationships among their progeny.

179 180

Simulation of synthetics: We generated synthetics differing in 𝑁𝑃 by sampling 𝑁𝑃 ∈

181

{2, 3, 4, 6, 8, 12, 16, 24, 32} parent lines from the same ancestral population. From these parents, we

182

produced all possible (𝑁2𝑃 ) combinations of single crosses (Syn-1 generation, Figure S1), where the

183

number of Syn-1 progenies per cross was chosen to obtain at least 1,000 individuals in total. For

184

production of the Syn-2 generation, the Syn-1 individuals were intermated at random, allowing also

185

for selfing. Finally, a single doubled haploid line was derived from each of the 1,000 individuals of the

186

Syn-2 generation to obtain the genotypes of the final synthetic. This approach was chosen to avoid

187

additional full-sib relationships among doubled haploid lines that arise when deriving them from the

188

same Syn-2 individual.

189 190

Genetic model: For simulating the polygenic target trait, we sampled a subset of 1,000 of the 5,000

191

QTL in each simulation replicate. Following Meuwissen et al. (2001), the corresponding QTL effects

192

were drawn from a gamma distribution with scale and shape parameter 0.4 and 1.66, respectively.

193

Signs of QTL effects were sampled from a Bernoulli distribution with probability parameter 0.5.

194

The vector 𝒖 of true breeding values for all individuals in the synthetic was calculated as 𝒖 =

195

𝑾𝒂, where 𝑾 is the matrix of genotypic scores at QTL coded {2,0} depending on whether an individual

196

was homozygous for the 1 or 0 allele, respectively, that were adjusted for twice the frequency of the

197

1 allele in the ancestral population (cf. Figure S1), and 𝒂 is the vector of QTL effects. The corresponding

198

vector 𝒚 of phenotypes was obtained as 𝒚 = 𝒖 + 𝒆 (Goddard et al. 2011; de los Campos et al. 2013;

199

Habier et al. 2013), i.e., assuming a null mean and adding a vector of independent normally distributed

8

200

environmental noise variables 𝒆, where variance 𝜎𝑒2 was chosen to be identical for the two ancestral

201

populations and all choices of 𝑁𝑃 , assuming that environmental effects affect phenotypes

202

independently of additive-genetic variance 𝜎𝑢2 in the synthetic. The value of 𝜎𝑒2 was therefore set equal

203

2 to the additive-genetic variance 𝜎𝐴𝑃 in ancestral population LR averaged across 1,000 simulation

204

replicates. The heritability ℎ² of the target trait was then on average equal to 0.5 for LR and SR due to

205

2 nearly identical allele frequencies at QTL, but lower in the synthetics, because 𝜎𝑢2 < 𝜎𝐴𝑃 (Table S2).

206

We restricted our simulations to a single level of heritability, because preliminary analyses showed

207

that changing ℎ² resulted in fairly relatively constant shift of PA.

208 209

Analysis of the sources of information exploited in genomic prediction: We conceived eight scenarios

210

to evaluate to what extent the three sources of information contribute to PA in synthetics (Figure 2),

211

when actual relationships at QTL are estimated by marker-derived genomic relationships. The

212

scenarios can be differentiated by three factors.

213

First, individuals in the TS and prediction set (PS) were either related (“Re”-scenarios) or

214

unrelated (“Un”- scenarios), depending on whether the parents of the TS (𝑃𝑇𝑆 ) and of the PS (𝑃𝑃𝑆 )

215

were identical (i.e., 𝑃𝑇𝑆 = 𝑃𝑃𝑆 ) or disjoint (i.e., 𝑃𝑇𝑆 ∩ 𝑃𝑃𝑆 = ∅). For the “Re”-scenarios, we sampled

216

individuals for the TS and PS from the same synthetic, whereas for the “Un”- scenarios, individuals

217

were sampled from two different synthetics produced from disjoint sets 𝑃𝑇𝑆 and 𝑃𝑃𝑆 , each of size 𝑁𝑃 .

218

Both sets of parents originated always from the same ancestral population.

219

Second, pairs of QTL and SNPs were either in LD (“LDA”-scenarios) as found in the ancestral

220

population, or in linkage equilibrium (“LEA”-scenarios). To achieve the latter, we permuted complete

221

QTL haplotypes among the 𝑁𝑃 parents (for “Un”-scenarios separately in each set 𝑃𝑇𝑆 and 𝑃𝑃𝑆 ), while

222

keeping their SNP haplotypes unchanged (i.e., conserving their LDA). This procedure eliminates any

223

systematic association between QTL and SNP alleles originating from the ancestral population, but

224

maintains allele frequencies and polymorphic states at QTL, as well as LDA between them. In contrast

225

to previous approaches (cf. Habier et al. 2013), this approach avoids influencing PA by altering actual

226

relationships at QTL. Importantly, after removal of LDA, there is still LD between QTL and SNPs in

9

227

the parents, but this LD is purely due to the limited sample size and thus subsequently referred to as

228

sample LD.

229

Third, four different types of data were used to calculate the relationship matrix 𝑲 used in

230

BLUP: (i) For the “SNP”- scenarios, we used SNP genotypes to calculate the marker-derived genomic

231

relationship matrix 𝑲 ≜ 𝑮 = (𝑔𝑖𝑗 ) as 𝑔𝑖𝑗 = ∑𝑚(𝑥𝑖𝑚 − 2𝑝𝑚 )(𝑥𝑗𝑚 − 2𝑝𝑚 )⁄2 ∑𝑚 𝑝𝑚 (1 − 𝑝𝑚 ) (Habier

232

et al. 2007; VanRaden 2008), where 𝑥𝑖𝑚 is the genotype of the 𝑖-th individual at the 𝑚-th locus coded

233

{2,0} depending on whether this individual was homozygous for the 1 or 0 allele, respectively, and 𝑝𝑚

234

is the frequency of the 1 allele at the 𝑚-th SNP marker in the ancestral population. (ii) For the “QTL”-

235

scenarios, the QTL genotypes were used to calculate the actual relationship matrix 𝑲 ≜ 𝑸 =

236

(𝑞𝑖𝑗 ) using the same formula. (iii) For the “Ped”-scenario, pedigree records were used to calculate the

237

pedigree relationship matrix 𝑲 ≜ 𝑨 = (𝑓𝑖𝑗 ) with elements 𝑓𝑖𝑗 being equal to expected IBD

238

relationships (i.e., twice the coefficient of co-ancestry). (iv) For the “Tag”-scenario, tag markers

239

labeling the origin of QTL alleles at each locus from the 𝑁𝑃 parents were used to calculate the actual

240

IBD relationship matrix 𝑲 ≜ 𝑻 = (𝜏𝑖𝑗 ) with elements 𝜏𝑖𝑗 being equal to twice the proportion of

241

identical tag marker alleles between each pair of individuals. Tag markers label each QTL allele,

242

regardless of its state, uniquely with a number є {1, . . , 𝑁𝑃 ) in the parents and thus, they allow tracking

243

the segregation process during intermating and identifying the parental origin of each QTL allele in the

244

synthetic.

245

Scenario Re-LDA-SNP reflects the situation mostly encountered in practical applications of GP

246

and used information from pedigree relationships among individuals in the TS and PS captured by

247

SNPs, deviations from pedigree relationships due to (i) Mendelian sampling at QTL captured by co-

248

segregation of QTL and SNPs and (ii) ancestral LD between QTL and SNPs. Scenario Re-LDA-Ped used

249

only pedigree relationships, but ignored deviations due to Mendelian sampling, whereas Re-LDA-Tag

250

accounted for both pedigree relationships and Mendelian sampling. Both scenarios ignored actual

251

relationships among parents by assuming unrelated founders, and thus, did not account for alleles that

252

are IBS but not IBD in the synthetic. Scenario Re-LEA-SNP was artificial, with the goal of determining

10

253

the influence of ancestral LD on PA in scenario Re-LDA-SNP. Scenario Re-LDA-QTL was employed to

254

determine for the “Re”-scenarios the maximum PA achievable with GBLUP (cf. de los Campos et al.

255

2013), when assuming that each QTL explains an equal proportion of the additive-genetic variance.

256

The purpose was thus to quantify the reduction in PA for all other “Re”-scenarios when using a

257

different data type to estimate actual relationships.

258

Scenarios Un-LDA-SNP and Un-LDA-QTL (“Un”-scenarios) represent the conceptual counter-

259

parts to Re-LDA-SNP and Re-LDA-QTL (Figure 2). Un-LDA-SNP reflects the practical situation of predicting

260

the genetic merit of individuals unrelated to the TS, whereas Un-LDA-QTL provides the corresponding

261

upper bound of PA. For both scenarios, alleles in the TS and PS had IBD probability equal to zero and,

262

thus, the only remaining source of information contributing to PA in Un-LDA-SNP was ancestral LD

263

between QTL and SNPs to track actual relationships among parents. Scenario Un-LEA-SNP was

264

employed as negative-control scenario to validate the simulation designs. As expected, PA for Un-LEA-

265

SNP fluctuated around zero for all investigated settings (results not shown), confirming that there are

266

only three sources of information contributing to PA when using 𝑲 ≜ 𝑮.

267 268

Analysis of linkage disequilibrium and linkage phase similarity: We calculated LD as the squared

269

correlation coefficient (𝑟 2 , Hill and Robertson 1968) between all pairs of QTL and SNPs in (i) each

270

ancestral population (LDA), (ii) the set of 𝑁𝑃 parents sampled from the ancestral population, and (iii)

271

the synthetic generated from the parents. Furthermore, we computed the linkage phase similarity of

272

QTL-SNP pairs in the TS and PS. Here, we adopted a similar approach as de Roos et al. (2008), but

273

replaced the correlation by the cosine similarity

274

𝐿𝑖𝑛𝑘𝑎𝑔𝑒 𝑝ℎ𝑎𝑠𝑒 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 =

𝑇𝑆 𝑃𝑆 ∑𝑛 𝑖 𝑟𝑖 𝑟𝑖 𝑇𝑆 2 𝑛 𝑃𝑆 2 √∑𝑛 𝑖 (𝑟𝑖 ) √∑𝑖 (𝑟𝑖 )

,

(1)

275

where 𝑖 refers to the index of the QTL-SNP pair and 𝑛 is the number of pairs for which linkage phase

276

similarity is calculated. The reason was to account not only for the ranking but also for the absolute

277

size of the 𝑟 statistics in the two data sets (see File S2 for details). Linkage phase similarity was

11

278

calculated for all QTL-SNP pairs falling into consecutive bins of 0.5 cM width. LD was first averaged

279

within each bin and subsequently, both LD and linkage phase similarity statistics were averaged across

280

chromosomes and simulation replicates.

281 282 283

Genomic prediction: The statistical model used for predicting breeding values can be written as 𝒚 = 𝟏𝜇 + 𝒁𝒖 + 𝜺,

(2)

284

where 𝒁 is the incidence matrix linking phenotypes with breeding values, 𝒖 is the vector of random

285

breeding values with mean zero and variance-covariance matrix var(𝒖) = 𝑲𝜎𝑢2 , where 𝑲 is a

286

relationship matrix, calculated from different data types as described above, and 𝜎𝑢2 is the additive-

287

genetic variance in the synthetic. Residuals 𝜺 are random with mean zero and var(𝜺) = 𝑰𝜎𝜀2 , where 𝑰

288

is an identity matrix and 𝜎𝜀2 is the residual variance. Estimates of variance components 𝜎𝑢2 and 𝜎𝜀2 were

289

̂ were predicted using the obtained by restricted maximum likelihood and estimated breeding values 𝒖

290

mixed.solve function from R-package rrBLUP (Endelman 2011). PA was always calculated as the

291

̂ for the PS in each simulation replicate. correlation between 𝒖 and 𝒖

292

Following previous studies (Goddard et al. 2011; de los Campos et al. 2013), we also

293

investigated how well estimated relationships 𝑘𝑖𝑗 (i.e., 𝑔𝑖𝑗 , 𝑓𝑖𝑗 , 𝜏𝑖𝑗 ) between individuals 𝑖 and 𝑗 in the

294

TS and PS reflect the corresponding actual relationships 𝑞𝑖𝑗 at QTL. We therefore calculated the

295

2 coefficient of determination 𝑅𝑘,𝑞 of the regression of 𝑘𝑖𝑗 on 𝑞𝑖𝑗 in each simulation replicate and all

296

2 scenarios (except for Re-LDA-QTL and Un-LDA-QTL, where 𝑅𝑘,𝑞 = 1.0).

297

In order to assess the effect of TS size on PA, we sampled 𝑁𝑇𝑆 = 125, 250, 500 or 750

298

individuals from the 1,000 lines of the synthetic, where 250 was used as default when another factor

299

(e.g., marker density) was varied. For the PS, we always sampled 𝑁𝑃𝑆 = 100 individuals from (i) the

300

remaining individuals that were not part of the TS in the “Re”-scenarios or (ii) the second synthetic in

301

the “Un”-scenarios. For all “SNP”-scenarios, the effect of marker density on PA was evaluated for two

302

values of 5 and 0.25 SNPs cM-1, the former being used as default. The number of randomly sampled

303

SNPs per chromosome in each simulation replicate was proportional to the respective chromosome

12

304

length. The two marker densities of 5 and 0.25 SNPs cM-1 resulted in an average genetic map distance

305

between each QTL and its closest nearby SNP of 0.18 cM and 2.02 cM, respectively (Figure 1).

306

All reported results are arithmetic means over 1,000 simulation replicates, which were

307

stochastically independent conditional on the ancestral populations. A simulation replicate comprises

308

(i) random sampling of 1,000 QTL from the 5,000 initial QTL and sampling of QTL effects, (ii) sampling

309

of the parents from the ancestral population and, in the case of the “LEA”-scenarios, additionally

310

permuting QTL haplotypes, (iii) creation of synthetics from each set of parents, (iv) sampling of the

311

individuals for the TS and PS, (v) sampling of the noise variable 𝑒 and calculation of the breeding and

312

phenotypic values, and (vi) training of the prediction equation and calculation of estimated breeding

313

values as well as PA in the PS (Figure S1). All computations were carried out in the R statistical

314

environment (R Core Team 2012).

13

315

RESULTS

316

Linkage disequilibrium in the ancestral populations: For ancestral population LR, LDA showed a steep

317

decline extending to a genetic map distance ∆ = 0.5 cM and approached an asymptote of about 0.08

318

for ∆ > 1 cM (Figure 1), reflecting the presence of long-range LDA. By comparison, LDA in ancestral

319

population SR started at slightly smaller values for closely linked loci and showed a similar decline for

320

∆ < 1 cM. It levelled off at about ∆ = 2 cM, where it almost reached its asymptotic value of zero due

321

to absence of long-range LDA resulting from the 100 additional generations of random mating.

322 323

Linkage disequilibrium in the parents and the synthetics: Figure 3A shows the distribution of LD

324

between QTL-SNP pairs in the parents, measured as 𝑟 2 , as a function of ∆. LD in the parents takes on

325

only a limited number of values in the interval [0,1], because only a finite number of genotype

326

configurations is possible for two biallelic loci, which depends exclusively on 𝑁𝑃 . For 𝑁𝑃 = 2, all LD

327

values are equal to 1. For 𝑁𝑃 = 3 and 4, possible LD values are { , 1} and {0, , , 1}, respectively,

328

whereas for 𝑁𝑃 = 16, more than 100 values are possible, resulting in a nearly continuous distribution

329

of LD values in the parents. Under LEA (i.e., ancestral linkage equilibrium due to permutation of QTL

330

haplotypes), the frequency of LD values in the parents was thus almost independent of ∆, except for

331

some small residual deviations due to similarity of ancestral allele frequencies at closely linked loci (see

332

File S4 for details). Under LEA, the high frequencies of pairs of loci in high LD for 𝑁𝑃 = 3 and 4

333

demonstrate the magnitude of sample LD (Figure 3A, left column). If additionally, ancestral LD was

334

present, large parental LD values occurred more frequently for tightly linked loci (∆ < 1 cM) for both

335

ancestral populations. Under short-range LDA in SR, the frequencies of high parental LD values were

336

almost identical to those found under LEA for ∆ > 1 cM, regardless of 𝑁𝑃 . Conversely, under long-range

337

LDA in LR, the frequency of high parental LD values was considerably elevated also for ∆ > 1 cM.

338

Altogether, the distribution of LD values in the parents was much stronger influenced by 𝑁𝑃 than by

1 4

1 1 9 3

14

339

ancestral LD. The proportion of QTL-SNP pairs in high LD diminished as 𝑁𝑃 increased, but grew when

340

shifting from short- to long-range LDA (Figure 3A, SR vs. LR).

341

Figure 3B shows the average LD between QTL-SNP pairs in synthetics as a function of ∆ and

342

𝑁𝑃 . The level of the LD curve dropped rapidly as 𝑁𝑃 increased from 2 to 8 and approached the curve

343

of ancestral LD. Under LEA, LD in synthetics was still substantial for 𝑁𝑃 = 4 due to sample LD, yet

344

successively approached zero as 𝑁𝑃 was increased further. For 𝑁𝑃 > 2, the presence of ancestral LD

345

resulted in elevated LD in the synthetics, where the increment was large between tightly linked QTL-

346

SNP pairs (∆ < 1 cM) for both ancestral populations and moderate between loosely linked loci (∆ > 1

347

cM) for LR.

348 349

Linkage phase similarity between training and prediction set: For scenario Re-LDA-SNP (𝑃𝑇𝑆 = 𝑃𝑃𝑆 ),

350

linkage phase similarity between TS and PS exceeded 0.8 up to ∆ = 20 cM, regardless of the ancestral

351

population (Figure 4). By comparison, values were much lower for Un-LDA-SNP (𝑃𝑇𝑆 ∩ 𝑃𝑃𝑆 = ∅).

352

Increasing 𝑁𝑃 reduced linkage phase similarity only marginally for Re-LDA-SNP even for ∆ = 20 cM, but

353

resulted in a substantial increase for Un-LDA-SNP. The higher ancestral LD in LR resulted only in a minor

354

increase in linkage phase similarity in Re-LDA-SNP, but in a large increase for Un-LDA-SNP, irrespective

355

of 𝑁𝑃 . Since permuting QTL haplotypes eliminated ancestral LD in scenario Re-LEA-SNP, linkage phase

356

similarity was identical for SR and LR and showed similar results as Re-LDA-SNP for SR (results not

357

shown).

358 359

Influence of ancestral linkage disequilibrium and number of parents on prediction accuracy: PA

360

declined for all “Re”-scenarios (except Re-LDA-Ped), but increased for all “Un”-scenarios with an

361

increasing number of parents 𝑁𝑃 (Figure 5), where the strongest changes occurred between 𝑁𝑃 = 2

362

and 8 for all scenarios. The highest PA was always achieved by scenario Re-LDA-QTL, closely followed

363

by Re-LDA-SNP for small 𝑁𝑃 , with an increasing difference for larger 𝑁𝑃 . PA increased when shifting

364

from low (SR) to high (LR) ancestral LD for scenario Re-LDA-SNP, but decreased for Re-LEA-SNP. For

15

365

scenario Re-LDA-Tag, PA was always intermediate between Re-LDA-SNP and Re-LEA-SNP. For Re-LDA-

366

Ped, PA concavely increased from 𝑁𝑃 = 2 up to its maximum value of 0.4 for 𝑁𝑃 = 8, followed by a

367

minor decrease. Re-LDA-Ped and Re-LEA-SNP approached identical PA for large 𝑁𝑃 under long-range

368

LDA in LR, whereas Re-LEA-SNP retained superior PA under short-range LDA in SR. For Un-LDA-QTL, PA

369

strongly increased for both ancestral populations, especially from 𝑁𝑃 = 2 to 8, followed by a moderate

370

increase. For Un-LDA-SNP, the overall level of PA was much lower, but showed a similarly increasing

371

curvature as Un-LDA-QTL for long-range LDA, whereas under short-range LDA, PA was almost

372

consistently < 0.2 without sizeable increase for all values of 𝑁𝑃 .

373 374

Influence of training set size and marker density on prediction accuracy: Increasing TS size (𝑁𝑇𝑆 ) from

375

125 to 750 individuals was overall most beneficial for all “Re”-scenarios, except for Re-LDA-Ped (Figure

376

S3). Conversely, Re-LDA-Ped, as well as Un-LDA-SNP under short-range LDA, showed only a minor

377

increase in PA for larger 𝑁𝑇𝑆 . However, for Un-LDA-SNP under long-range LDA and for Un-LDA-QTL under

378

both short- and long-range LDA, the increase in PA along with 𝑁𝑇𝑆 was notable, especially for 𝑁𝑃 > 8.

379

Reducing the marker density from 5 SNPs cM-1 to 0.25 SNPs cM-1 resulted in a substantial

380

reduction of PA for all “SNP”-scenarios (Figure S4). This reduction was reinforced for scenarios utilizing

381

ancestral LD (Re-LDA-SNP and Un-LDA-SNP), especially in the presence of long-range LDA and for large

382

values of 𝑁𝑃 .

16

383

DISCUSSION

384

In plant breeding, GP has been applied to various types of populations such as single or multiple

385

biparental families or diversity panels of inbred lines. These materials differ fundamentally in their

386

pedigree structure, the number of founder individuals involved in their development, as well as the LD

387

in the ancestral population from which they were taken. Synthetics are especially suited for

388

systematically assessing the influence of these factors on prediction accuracy, because the variable

389

number of parents used for generating synthetics leads to (i) different pedigree relationships among

390

individuals and (ii) a trade-off between ancestral LD and sample LD arising in the parents. Thus, our

391

approach provides new insights into how these factors influence the ability of molecular markers to

392

capture actual relationships at causal loci, which determines the accuracy in various applications of

393

GP.

394 395

Influence of the number of parents and ancestral LD on actual relationships at causal loci and

396

prediction accuracy: The accuracy of GP relies on the distribution of actual relationships 𝑞𝑖𝑗 at causal

397

loci (QTL) between individuals in the TS and PS and (ii) the quality of the approximation of 𝑞𝑖𝑗 by

398

marker-derived genomic relationships 𝑔𝑖𝑗 (Goddard et al. 2011; Habier et al. 2013). We first

399

investigated PA using the actual relationship matrix 𝑸, which provides an upper bound of PA given

400

fixed values for 𝑁𝑃 , 𝑁𝑇𝑆 and ℎ² (de los Campos et al. 2013). Subsequently, we estimated 𝑸 by the

401

marker-derived genomic relationship matrix 𝑮 and inferred how the three sources of information

402

contributed to PA.

403

Actual relationships 𝑞𝑖𝑗 between two individuals 𝑖 and 𝑗 can be factorized into

404

𝑞𝑖𝑗 = 𝑓𝑖𝑗 + 𝑚𝑖𝑗 + 𝜉𝑖𝑗 ,

405

where 𝑓𝑖𝑗 is their expected IBD relationship at QTL, 𝑚𝑖𝑗 = 𝜏𝑖𝑗 − 𝑓𝑖𝑗 is the deviation of the actual from

406

the expected IBD relationship due to Mendelian sampling, and 𝜉𝑖𝑗 is the deviation of the actual (IBS)

407

relationship from the actual IBD relationship. Whereas 𝑓𝑖𝑗 and 𝑚𝑖𝑗 provide information solely with

(3)

17

408

respect to the parents (i.e., the founders of the pedigree), 𝜉𝑖𝑗 accounts also for actual relationships

409

among the parents (Powell et al. 2010; Vela-Avitúa et al. 2015).

410

If TS and PS are related (“Re”-scenarios), the distribution of 𝑓𝑖𝑗 depends on 𝑁𝑃 (Figure 6A) and

411

on the mating scheme employed for production of the synthetic (Figure S1). For small 𝑁𝑃 , this

412

distribution is dominated by full-sib and half-sib relationships, whereas distantly related and unrelated

413

individuals dominate for larger 𝑁𝑃 . The closer the pedigree relationships between individuals, the

414

longer are the chromosome segments they inherit from common ancestors and the larger is the

415

conditional variance in actual IBD relationships, i.e., var(𝑚𝑖𝑗 |𝑓𝑖𝑗 ) (Figure 6B, Hill and Weir 2011;

416

Goddard et al. 2011). In other words, var(𝑚𝑖𝑗 |𝑓𝑖𝑗 ) is inversely proportional to the number of

417

independently segregating chromosome segments and, hence, the length and number of

418

chromosomes must be taken into account when transferring our results to other species. For example,

419

in bread wheat (2n = 42), var(𝑚𝑖𝑗 |𝑓𝑖𝑗 ) – and consequently PA attributable to the Mendelian sampling

420

term – are expected to be smaller than in maize (2n = 20).

421

The contribution of 𝜉𝑖𝑗 to 𝑞𝑖𝑗 depends on the level of ancestral LD. Elevated LDA increases

422

var(𝑞𝑖𝑗 ) in the ancestral population (Figure S2, LR vs. SR) and in turn increases the variation in similarity

423

of haplotypes among parents sampled therefrom (Habier et al. 2013). Consequently, var(𝑞𝑖𝑗 |𝑓𝑖𝑗 ) in

424

synthetics increases with ancestral LD (Figure 6B and S2), on top of the variance var(𝑚𝑖𝑗 |𝑓𝑖𝑗 ) caused by

425

Mendelian sampling. Assuming known actual relationships and fixed TS size, PA therefore decreases if

426

(i) 𝑁𝑃 increases and (ii) ancestral LD decreases (Figure 5). This is because both factors reduce the

427

absolute frequency of close actual relationships among TS and PS (Figure S5). If actual relationships

428

among the parents were not accounted for, the decline in PA was reinforced as 𝑁𝑃 increased (Figure

429

5, scenario Re-LDA-Tag). The reason for this follows from the factorization (Eq. 3): the larger 𝑁𝑃 , the

430

more frequent are pairs of individuals with small or zero pedigree relationship (Figure 6A) and the

431

more important it is to account for actual relationships among parents. Conversely, PA was consistently

432

higher for small 𝑁𝑃 due to strong pedigree relationships and Mendelian sampling, despite the

18

433

accompanying negative effects of reduced heritability in the TS and the reduced additive-genetic

434

variance in the PS on PA (Table S2).

435

Restricting predictive information to pedigree relationships (scenario Re-LDA-Ped), resulted in

436

only moderate PA, unless for 𝑁𝑃 = 2 (Figure 5). In this case, all individuals in the TS and PS were full-

437

sibs (Figure 6A), which resulted in identical estimated breeding values by pedigree BLUP, so that PA

438

could not be calculated (indicated as PA = 0 in Figure 5). For 𝑁𝑃 > 2, there was variation in pedigree

439

relationships in synthetics (Figure 6A) and thus, PA > 0. Further research is warranted on the

440

importance of variation in pedigree relationships for GP in the presence of Mendelian sampling and

441

ancestral LD, e.g., by considering mating schemes such as MAGIC, which reduce or even entirely avoid

442

variation in pedigree relationships.

443

If the TS and PS are unrelated (“Un”-scenarios), only 𝜉𝑖𝑗 contributes to variation in 𝑞𝑖𝑗 , because

444

𝑓𝑖𝑗 and 𝑚𝑖𝑗 are equal to zero. Moreover, if 𝑁𝑃 is small, QTL in the TS and PS can (i) be fixed for different

445

alleles (Table S2) or (ii) differ in their LD structure due to sample LD. This limits the occurrence of close

446

actual relationships between TS and PS (Figure S5, Un-LDA-QTL) and reduces the upper bounds of PA

447

compared with the corresponding “Re”-scenarios (Figure 5, Un-LDA-QTL vs. Re-LDA-QTL). As 𝑁𝑃

448

increases, allele frequencies and LD between loci converge towards those in the ancestral population

449

in both “Re”- and “Un”-Scenarios (Table S2). In turn, the closest actual relationships between TS and

450

PS converge as well (Figure S5), resulting ultimately in similar PA for Re-LDA-QTL and Un-LDA-QTL

451

(Figure 5). In conclusion, the difference in predicting related and unrelated genotypes vanishes as 𝑁𝑃

452

increases for a given TS size, because it is then primarily ancestral information that drives the accuracy

453

of GP.

454 455

Sample LD and co-segregation – crucial factors for prediction accuracy in synthetics: LD in the parents

456

represents a combination of LD carrying over from the ancestral population and LD generated anew

457

due to limited 𝑁𝑃 . The latter LD, herein referred to as sample LD, results from a bottleneck in

458

population size similar to that used in our simulations for generating long-range LD in the ancestral

459

population (cf. Figure S1), but can be much stronger if 𝑁𝑃 is small (e.g. 4). Co-segregation is defined as

19

460

the co-inheritance of alleles at linked loci on the same gamete and thus describes the process that

461

prevents parental LD between them from being rapidly eroded by recombination (Figure S6). Together,

462

sample LD and co-segregation result in high LD in synthetics, which for small 𝑁𝑃 exceeds by far the

463

level of ancestral LD (Figure 3B, see File S3 for details). The crucial property of sample LD , however, is

464

that it is specific to a set of parents and thus provides predictive information only for their descendants.

465

Hence, using co-segregation as “source of information” in GP relies on the presence of pedigree

466

relationships (Habier et al. 2013). Conversely, the fraction of parental LD that stems from ancestral LD

467

is a commonality among all descendants of the ancestral population, irrespective of pedigree

468

relationships. The particularly small number of parents used in synthetics makes sample LD and co-

469

segregation crucial factors contributing to PA, a situation that differs greatly from previously

470

investigated scenarios (e.g., Habier et al. 2007, 2013; Wientjes et al. 2013). Hence, knowledge of how

471

ancestral LD and sample LD contribute to parental LD, depending of 𝑁𝑃 , is essential for evaluating the

472

applicability of training data to prediction of both related and unrelated genotypes.

473

The influence of sample LD on parental LD and PA in the “Re”-scenarios is illustrated best by

474

considering different values of 𝑁𝑃 : For 𝑁𝑃 = 2, sample LD in the parents is maximized, because all

475

pairs of polymorphic loci are in complete LD (r² = 1.0), irrespective of ancestral LD, linkage or genetic

476

map distance. Co-segregation of linked QTL and SNPs during intermating largely conserves LD, even

477

for loosely linked loci (Figure S6), so that LD in synthetics remained at high levels (Figure 3B). Therefore,

478

replacing 𝑸 with 𝑮 resulted in merely a marginal reduction of PA (Figure 5, Re-LDA-QTL vs. Re-LDA-SNP).

479

Previous studies claimed that PA in biparental populations is the maximum obtainable for given TS size

480

(Riedelsheimer et al. 2013; Lehermeier et al. 2014), despite absence of variation in pedigree

481

relationships. Our results demonstrate that this is exclusively attributable to the efficient utilization of

482

sample LD via co-segregation. For 𝑁𝑃 = 3 and 4, LD can take two and four discrete values, respectively

483

(see Results). Thus, sample LD still takes up a large share of parental LD (Figure 3A). However, the

484

occurrence of different LD values (in contrast to 𝑁𝑃 = 2) introduces a dependency on ancestral LD:

485

the frequency of loci with high parental LD increases in the presence of ancestral LD compared with

486

ancestral linkage equilibrium. This difference carries over during intermating and resulted in increased

20

487

LD in the synthetics, especially under long-range ancestral LD (Figure 3B, LR). However, the increment

488

in PA was only marginal (Figure 5, Re-LDA-SNP vs. Re-LEA-SNP) owing to the overriding contribution of

489

sample LD to parental LD for 𝑁𝑃 = 3 and 4. Nevertheless, the reduction in sample LD for 𝑁𝑃 = 3 or 4,

490

compared with 𝑁𝑃 = 2, impaired co-segregation information and reinforced the decline in PA when

491

relying on markers (Figure 5, Re-LDA-QTL vs. Re-LDA-SNP). For 𝑁𝑃 ≥ 16, sample LD becomes negligible

492

(Figure 3A) so that parental LD hardly differed from ancestral LD. This led to (i) reinforced reduction in

493

PA, when using markers rather than known QTL genotypes (Figure 5, Re-LDA-QTL vs. Re-LDA-SNP),

494

especially for short-range ancestral LD, and (ii) convergence of PA of GBLUP and pedigree BLUP in the

495

absence of ancestral LD (Figure 5, Re-LEA-SNP vs. Re-LDA-Ped). The reason for the latter is that under

496

marginal contribution of co-segregation, PA stems primarily from capturing pedigree relationships by

497

SNPs.

498

For the “Un”-scenarios, sample LD is manifested independently in 𝑃𝑇𝑆 and 𝑃𝑃𝑆 , which

499

results in different co-segregation “patterns” in TS and PS that cannot reliably be exploited in GP.

500

Therefore, the ancestral LD that is common to both sets of parents – measured by linkage phase

501

similarity in the synthetics (Figure 4) – provides the only source of information connecting the TS and

502

PS. This constraint resulted in a much larger drop in PA when replacing 𝑸 with 𝑮 in the “Un”-scenarios

503

(Figure 5, Un-LDA-QTL vs. Un-LDA-SNP) compared with the corresponding “Re”-scenarios (Figure 5, Re-

504

LDA-QTL vs. Re-LDA-SNP), especially under short-range ancestral LD. This decline in PA when predicting

505

the genetic merit of unrelated instead of related genotypes corroborates previous findings on GP

506

across populations in both animal and plant breeding (Hayes et al. 2009b; de Roos et al. 2009; Technow

507

et al. 2013; Riedelsheimer et al. 2013; Albrecht et al. 2014; Heslot and Jannink 2015).

508

Variation in linkage phase similarity between TS and PS caused by sample LD affects GP of

509

unrelated genotypes in an unforeseeable manner: while identical and reversed QTL-SNP linkage phases

510

manifested by sample LD cancel out on average, individual TS-PS combinations can show above or

511

below average linkage phase similarity and thus, co-segregation “patterns”. This translates into large

512

variation of PA among different TS-PS combinations. Additional simulations using unequal 𝑁𝑃 to derive

513

the TS and PS showed that variation in PA was even higher when using small 𝑁𝑃 to generate the PS

21

514

than for the TS (Figure S7). A possible explanation might be that regardless of the TS composition, small

515

𝑁𝑃 for the PS drastically reduces the frequency of polymorphic loci (Table S2) and thereby increases

516

the variation in linkage phase similarity with the TS for the remaining loci, which in turn increases the

517

variability of prediction. Considering the practical relevance of such prediction scenarios, further

518

research is needed to investigate this finding in greater detail.

519 520

Influence of LD on capturing pedigree relationships: The ability to capture pedigree relationships by

521

SNPs increases with the effective number of independently segregating SNPs in the model (Habier et

522

al. 2007). Higher LD between SNPs reduces this number and thus, reduces the contribution of pedigree

523

relationships captured by SNPs to PA. Scenario Re-LEA-SNP demonstrates this fact for large values of

524

𝑁𝑃 , where LD between QTL and SNPs in synthetics was small (Figure 3B) and hence, PA mainly relied

525

on capturing pedigree relationships. In line with this reasoning, PA decreased from SR to LR (Figure 5,

526

Re-LEA-SNP) as well as when marker density was reduced from 5 to only 0.25 SNPs cM-1, because similar

527

to increasing LD, using low marker density reduced the number of independently segregating SNPs

528

(Figure S4, Re-LEA-SNP).

529

In GBLUP, the consequences of an imprecise estimation of pedigree relationships by SNPs

530

due to strong LD are limited, because the loss in PA compared with pedigree BLUP is mostly

531

overcompensated for by capturing either co-segregation (Figure 5; small 𝑁𝑃 , Re-LEA-SNP vs. Re-LDA-

532

Ped) or long-range ancestral LD (Figure 5; large 𝑁𝑃 , Re-LEA-SNP vs. Re-LDA-Ped). An exception is the

533

combination of large 𝑁𝑃 and short-range ancestral LD, where the comparatively small contribution of

534

ancestral LD to PA does not necessarily compensate for that loss, so that GBLUP might not provide the

535

desired advantage over pedigree BLUP. Alternative models employing variable-selection (e.g., BayesB),

536

which capitalize more on LD rather than pedigree relationships (Habier et al. 2007; Zhong et al. 2009;

537

Jannink et al. 2010), might help to improve the prospects of GP in such cases.

538 539

Influence of training set size on prediction accuracy: In this study, we varied training set size 𝑁𝑇𝑆 for

540

given values of 𝑁𝑃 , because resources devoted to the TS differ between breeding programs and do not

22

541

necessarily depend on 𝑁𝑃 . Under fixed 𝑁𝑃 , the absolute frequency of individuals with close actual

542

relationship among TS and PS increases with 𝑁𝑇𝑆 (Figure S5), which led to similar benefits in PA for all

543

“Re”-scenarios (Figure S3, except Re-LDA-Ped). However, the general decline of PA in these scenarios

544

with increasing NP was only slightly attenuated even when using 750 instead of 125 individuals in the

545

TS. This is because the need for larger TS size increases rapidly as pedigree relationships with the PS

546

decrease (Habier et al. 2010), which in turn shifts the distribution of actual relationships toward lower

547

values (Figure 6B and S2). Thus, 𝑁𝑇𝑆 must generally be increased along with 𝑁𝑃 to counteract as much

548

as possible the expected decline in PA.

549

According to Habier et al. (2013), altering TS size affects the contributions of the three

550

sources of information to PA, but this inference is based on the assumption that TS size was increased

551

by adding new families to the TS (unrelated to the initially included families), which is comparable to

552

increasing 𝑁𝑃 in our study. De los Campos et al. (2013) showed that the estimation of actual

553

2 relationships by SNPs is sufficiently characterized by 𝑅𝑘,𝑞 (Figure S8) and thus, largely independent of

554

𝑁𝑇𝑆 , apart from estimation error. In synthetics, the distribution of actual relationships 𝑞𝑖𝑗 is defined

555

by 𝑁𝑃 and LDA (Figure 6 and S2). Thus, increasing 𝑁𝑇𝑆 increases the chances for each individual in the

556

PS to have several individuals with close actual relationships 𝑞𝑖𝑗 in the TS, which was previously found

557

to be crucial for achieving high PA (Jannink et al. 2010; Clark et al. 2012). Therefore, the contributions

558

to PA from co-segregation and ancestral LD increase with 𝑁𝑇𝑆 , because they are required to capture

559

deviations from pedigree relationships. Conversely, using small 𝑁𝑇𝑆 will tend to hamper the

560

occurrence of high 𝑞𝑖𝑗 values and hence, increase the reliance on pedigree relationships.

561

If TS and PS are unrelated, the absolute frequency of close actual relationships is low, even

562

if 𝑁𝑇𝑆 is large (Figure S5). Additionally, actual relationships are rather poorly estimated by SNPs when

563

relying solely on ancestral LD (Figure S8, Un-LDA-SNP). Consequently, huge 𝑁𝑇𝑆 (>> 750) and high

564

marker density would be required to substantially elevate PA, especially if there is only short-range

565

ancestral LD (cf. de los Campos et al. 2013).

23

566

Influence of marker density on prediction accuracy: High marker density is especially important if LD

567

between QTL and SNPs extends only to short map distances (Solberg et al. 2008; Zhong et al. 2009;

568

Hickey et al. 2014). This applies in our study if either sample LD was negligible (Figure S4; large 𝑁𝑃 , Re-

569

LDA-SNP vs. Re-LEA-SNP) or if TS and PS were unrelated (Figure S4, Un-LDA-SNP), so that PA relied

570

heavily on ancestral LD. Our results also show that in the latter case, using high marker density strongly

571

improved PA for both ancestral populations, implying that capturing LD between tightly linked loci

572

(∆ < 1 cM) is beneficial even if long-range ancestral LD prevails. With low marker density, capturing

573

only the “long-range part” of ancestral LD (Figure 1, LR) still provided moderate PA (Figure S4, LR), but

574

PA dropped below < 0.1 for short-range ancestral LD (Figure S4, SR). This was likely because most SNPs

575

were no longer in LD with QTL and thus contributed mostly noise to the prediction equation. These

576

results are in agreement with former studies (de los Campos et al. 2013; Habier et al. 2013; Hickey et

577

al. 2014; Lorenz and Smith 2015) reporting that under insufficient marker density, adding individuals

578

unrelated to the PS to the TS can even decrease PA.

579

In summary, the required marker density for 𝑁𝑃 ≥ 16 should be chosen in compliance

580

with the extent of ancestral LD. While in this case, high density is mandatory if TS and PS are unrelated,

581

moderate PA can still be achieved under low marker density if TS and PS are related due to pedigree

582

relationships contributing to PA. For small 𝑁𝑃 , extensive LD in synthetics (due to sample LD and co-

583

segregation) lowers the requirements on marker density. Although co-segregation is captured

584

optimally if SNPs and QTL are as tightly linked as possible, medium marker density (≥ 1 SNPs cM-1,

585

depending on 𝑁𝑃 ) is likely sufficient to reach PA near the optimum.

586 587

Expected impact of ancestral LD on GP in synthetics: In GP of genetic predisposition in humans or

588

breeding values of bulls, the availability of several thousand training individuals, in conjunction with

589

high marker densities, allows for efficient use of rather low levels of ancestral LD, as usually observed

590

in these species (de Roos et al. 2008; Goddard and Hayes 2009; de los Campos et al. 2013). We showed

591

that short-range ancestral LD is generally less valuable in plant breeding, where TS usually comprise

592

only hundreds or fewer individuals. Ancestral LD can differ substantially among crops and different

24

593

germplasm within crops (Flint-Garcia et al. 2003). Usually, low levels of ancestral LD are found in

594

diversity panels that encompass lines from different breeding programs and/or geographic origin as

595

well as in materials largely unselected by breeders, such as landraces or gene bank accessions (Hyten

596

et al. 2007; Delourme et al. 2013; Romay et al. 2013). Recently, Gorjanc et al. (2016) proposed GP for

597

recurrent selection of synthetics generated from doubled haploid lines derived from landraces. In the

598

light of our findings, such an approach generally requires large TS size and high marker density to

599

outperform pedigree BLUP, unless one chooses small 𝑁𝑃 to ensure satisfactory PA due to co-

600

segregation.

601

In contrast, extensive long-range ancestral LD is usually found in elite breeding germplasm of

602

major crops such as maize (Windhausen et al. 2012; Unterseer et al. 2014), wheat (Maccaferri et al.

603

2005), barley (Zhong et al. 2009), soybean (Hyten et al. 2007) or sugar beet (Würschum et al. 2013). If

604

synthetics were derived from such germplasm, ancestral LD is expected to contribute substantially to

605

PA, as shown by our results. However, LD determined from biallelic SNPs might overestimate ancestral

606

LD between QTL-SNP pairs, because their allele frequencies can deviate due to ascertainment bias in

607

discarding SNPs with low minor allele frequencies for the construction of SNP arrays (Ganal et al. 2011;

608

Goddard et al. 2011). Such an overestimation would impair the advantage of GP approaches over

609

pedigree BLUP.

610 611

Implications for other scenarios relevant in plant breeding: Research on GP in plant breeding has so

612

far focused primarily on the use of single (e.g., Lorenzana and Bernardo 2009; Riedelsheimer et al.

613

2013) and multiple segregating biparental families (BF) (e.g., Heffner et al. 2011; Albrecht et al. 2011;

614

Schulz-Streeck et al. 2012; Habier et al. 2013; Lehermeier et al. 2014). For 𝑁𝑃 = 2, our scenarios Re-

615

LDA-SNP and Un-LDA-SNP correspond exactly to GP within and between BF derived from unrelated

616

parents. In practice, breeders mostly derive lines directly from F1 crosses (Mikel and Dudley 2006),

617

whereas we applied a further generation of intermating (Figure S1). This additional meiosis slightly

618

reduces LD in synthetics (see File S3 for details) and in turn, PA (results not shown). While GP within BF

619

generally works well, predicting an unrelated BF can be risky and unreliable (Riedelsheimer et al. 2013)

25

620

as underlined by our results for scenario Un-LDA-SNP (Figure S7, 𝑁𝑃 = 2). Similar uncertainties might

621

be encountered if new lines from an untested BF are predicted based on pre-existing data from

622

multiple BF (Heffner et al. 2011), diversity panels (Würschum et al. 2013) or populations of

623

experimental hybrids (Massman et al. 2013) to obtain predicted breeding values prior to partially

624

phenotyping the new cross (Figure S7, 𝑁𝑃 > 2 in TS and 𝑁𝑃 = 2 in PS). The risk of such approaches is

625

likely attenuated in advanced breeding cycles, where putatively “unrelated” BF usually share more

626

recent common ancestors than a TS comprising truly unrelated material, as would be the case in an

627

“ideal” diversity panel. However, Hickey et al. (2014) showed that if two BF share only a grand-parent

628

as their most recent common ancestor, PA was not substantially higher than for unrelated BF. This

629

underpins the need for close relatives in the TS (e.g., full-sibs or half-sibs) to warrant high and robust

630

PA across different prediction targets. Accordingly, previous studies on GP in diversity panels

631

concluded that the observed medium to high PAs were partially attributable to latent groups of related

632

germplasm (e.g., Rincent et al. 2012; Schopp et al. 2015).

633

If a BF is too small for training the prediction equation, multiple BF can be alternatively pooled

634

together (Heffner et al. 2011; Technow and Totir 2015). Such a combined TS can be constructed by

635

sampling lines from each BF to predict the remainder in each BF (“within”) or by using some BF to

636

predict other BF (“across”) (cf. Albrecht et al. 2011). Our scenarios Re-LDA-SNP and Un-LDA-SNP are

637

similar to these “within” and “across” situations for 𝑁𝑃 > 2, but - besides the additional meiosis

638

discussed above - show another important difference to F1-derived multiple BF: generating synthetics

639

by random mating of the Syn-1 generation breaks up the clear pedigree structure in full-sib, half-sib

640

and unrelated families (Figure S9). This reduces both the mean and variance of pedigree relationships,

641

which in turn reduces PA (results not shown). As discussed above, capturing pedigree relationships

642

plays a major role in GP of both synthetics and multiple BF if TS and PS are related, especially if 𝑁𝑃 is

643

large. This is because in both situations, co-segregation is barely used to obtain “accuracy within

644

families” (cf. Habier et al. 2013). In practical breeding programs using multiple BF, the situation might

645

be slightly different, if some parents are overrepresented compared with others and introduce a

646

predominant linkage phase patterns that can be exploited in GP. Moreover, one has the opportunity

26

647

to improve information from co-segregation by (i) clustering related BF into the TS to reflect the co-

648

segregation pattern of the PS or (ii) explicit modelling of co-segregation (cf. Habier et al. 2013) or

649

family-specific effects using hierarchical models (Technow and Totir 2015). However, both of these

650

strategies are not easily accessible in synthetics, unless one replaces random by controlled mating in

651

order to keep track of pedigree relationships. Since ancestral LD persists well over generations (Habier

652

et al. 2007), its contribution to PA is expected to be only marginally affected by additional intermating

653

generations. Thus, ancestral LD can generally be considered of great importance for GP of material

654

related or unrelated to the TS, particularly if NP is large.

655

In the present study, we considered the two most extreme situations of relatedness or

656

unrelatedness of the TS and PS, because their parents were either identical or entirely different.

657

Further research is warranted for situations of partial overlapping of parents among families, which

658

occurs frequently in practice, e.g., when proven inbred lines contribute to multiple crosses in

659

subsequent breeding cycles. Moreover, we focused here exclusively on PA, but the genetic gain from

660

genomic selection, which is of ultimate interest to breeders, depends additionally on the genetic

661

variance in the population. Since both parameters are influenced by the choice of 𝑁𝑃 , the potential of

662

recurrent genomic selection in synthetics needs to be examined for different values of 𝑁𝑃 and different

663

levels of ancestral LD, ideally across multiple selection cycles.

27

664

ACKNOWLEDGMENTS

665

We thank Chris-Carolin Schön, Tobias Würschum, José Marulanda, Willem Molenaar and three

666

anonymous reviewers for valuable suggestions to improve the content of the manuscript. PS

667

acknowledges Syngenta for partially funding this research by a Ph.D. fellowship and AEM the financial

668

contribution of CIMMYT/GIZ through the CRMA Project 15.78600.8-001-00.

669 670

DATA AVAILABILITY STATEMENT

671

The authors state that all simulated data and results necessary for confirming the conclusions

672

presented in the article are represented fully within the article and data supplements. Figure S1

673

provides a detailed overview over the entire simulation scheme and assumptions underlying all results

674

presented herein.

675

28

676

LITERATURE CITED

677 678 679

Albrecht, T., H.-J. Auinger, V. Wimmer, J. O. Ogutu, C. Knaak et al., 2014 Genome-based prediction of maize hybrid performance across genetic groups, testers, locations, and years. Theor. Appl. Genet. 127: 1375–1386.

680 681

Albrecht, T., V. Wimmer, H. Auinger, M. Erbe, C. Knaak et al., 2011 Genome-based prediction of testcross values in maize. Theor. Appl. Genet. 123: 339–350.

682 683

Bandillo, N., C. Raghavan, and P. Muyco, 2013 Multi-parent advanced generation inter-cross (MAGIC) populations in rice: progress and potential for genetics research and breeding. Rice 6: 1–15.

684

Bradshaw, J. E., 2016 Plant Breeding: Past, Present and Future. Springer International Publishing.

685 686

Cavanagh, C., M. Morell, I. Mackay, and W. Powell, 2008 From mutations to MAGIC: resources for gene discovery, validation and delivery in crop plants. Curr. Opin. Plant Biol. 11: 215–221.

687 688 689

Clark, S. a, J. M. Hickey, H. D. Daetwyler, and J. H. J. van der Werf, 2012 The importance of information on relatives for the prediction of genomic breeding values and the implications for the makeup of reference data sets in livestock breeding schemes. Genet. Sel. Evol. 44: 4.

690 691 692

Delourme, R., C. Falentin, B. F. Fomeju, M. Boillot, G. Lassalle et al., 2013 High-density SNP-based genetic map development and linkage disequilibrium assessment in Brassica napus L. BMC Genomics 14: 120.

693 694

Endelman, J. B., 2011 Ridge Regression and Other Kernels for Genomic Selection with R Package rrBLUP. Plant Genome 4: 250–255.

695 696

Falconer, D. F., and T. S. C. Mackay, 1996 Introduction to Quantitative Genetics (1996 Longman, Ed.). Pearson, Essex.

697 698

Flint-Garcia, S. a, J. M. Thornsberry, and E. S. Buckler, 2003 Structure of linkage disequilibrium in plants. Annu. Rev. Plant Biol. 54: 357–74.

699 700 701

Ganal, M. W., G. Durstewitz, A. Polley, A. Bérard, E. S. Buckler et al., 2011 A large maize (Zea mays L.) SNP genotyping array: development and germplasm genotyping, and genetic mapping to compare with the B73 reference genome. PLoS One 6: e28334.

702 703

Goddard, M. E., and B. J. Hayes, 2009 Mapping genes for complex traits in domestic animals and their use in breeding programmes. Nat. Rev. Genet. 10: 381–391.

704 705

Goddard, M. E., B. J. Hayes, and T. H. E. Meuwissen, 2011 Using the genomic relationship matrix to predict the accuracy of genomic selection. J. Anim. Breed. Genet. 128: 409–421.

706 707 708

Gorjanc, G., J. Jenko, S. J. Hearne, and J. M. Hickey, 2016 Initiating maize pre-breeding programs using genomic selection to harness polygenic variation from landrace populations. BMC Genomics 17: 30.

709 710

Habier, D., R. L. Fernando, and J. C. M. Dekkers, 2007 The impact of genetic relationship information on genome-assisted breeding values. Genetics 177: 2389–2397.

711 712

Habier, D., R. L. Fernando, and D. J. Garrick, 2013 Genomic BLUP Decoded: A Look into the Black Box of Genomic Prediction. Genetics 194: 597–607.

713 714

Habier, D., J. Tetens, F. Seefried, P. Lichtner, and G. Thaller, 2010 The impact of genetic relationship information on genomic breeding values in German Holstein cattle. Genet. Sel. Evol. 42: 5.

715 716

Hagdorn, S., K. Lamkey, M. Frisch, G. P. E. O., and M. A. E., 2003 Molecular genetic diversity among progenitors and derived elite lines of BSSS and BSCB1 maize populations. Crop Sci. 43: 474–482.

29

717 718

Hallauer, A. R., M. J. Carena, and J. de M. Filho, 2010 Quantitative genetics in maize breeding. Springer.

719

Hartl, D. L., and A. G. Clark, 2007 Principles of Population Genetics. Sinauer Associates, Inc.

720 721

Hayes, B. J., P. J. Bowman, A. J. Chamberlain, and M. E. Goddard, 2009a Genomic selection in dairy cattle: progress and challenges. J. Dairy Sci. 92: 433–443.

722 723

Hayes, B. J., P. J. Bowman, A. C. Chamberlain, K. Verbyla, and M. E. Goddard, 2009b Accuracy of genomic breeding values in multi-breed dairy cattle populations. Genet. Sel. Evol. 41: 51.

724 725

Hayes, B. J., P. M. Visscher, and M. E. Goddard, 2009c Increased accuracy of artificial selection by using the realized relationship matrix. Genet. Res. Cambridge 91: 47–60.

726 727

Heffner, E. L., J. Jannink, and M. E. Sorrells, 2011 Genomic Selection Accuracy using Multifamily Prediction Models in a Wheat Breeding Program. Plant Genome 4: 65–75.

728

Henderson, C., 1984 Applications of linear models in animal breeding. University of Guelph, ON.

729 730

Heslot, N., and J.-L. Jannink, 2015 An alternative covariance estimator to investigate genetic heterogeneity in populations. Genet. Sel. Evol. 47: 93.

731 732 733

Hickey, J. M., S. Dreisigacker, J. Crossa, S. Hearne, R. Babu et al., 2014 Evaluation of genomic selection training population designs and genotyping strategies in plant breeding programs using simulation. Crop Sci. 54: 1476–1488.

734 735

Hill, W. G., 1981 Estimation of effective population size from data on linkage disequilibrium. Genet. Res. 38: 209–216.

736 737

Hill, W. G., and A. Robertson, 1968 Linkage disequilibrium in finite populations. Theor. Appl. Genet. 38: 226–231.

738 739

Hill, W. G., and B. S. Weir, 2011 Variation in actual relationship as a consequence of Mendelian sampling and linkage. Genet. Res. Cambridge 93: 47–64.

740 741

Hyten, D. L., I. Y. Choi, Q. Song, R. C. Shoemaker, R. L. Nelson et al., 2007 Highly variable patterns of linkage disequilibrium in multiple soybean populations. Genetics 175: 1937–1944.

742 743

Jannink, J.-L., A. J. Lorenz, and H. Iwata, 2010 Genomic selection in plant breeding: from theory to practice. Briefings Funct. genomics proteomics 9: 166–177.

744

de Koning, D.-J., 2016 Meuwissen et al. on Genomic Selection. Genetics 203: 5–7.

745 746

Lehermeier, C., N. Krämer, E. Bauer, C. Bauland, C. Camisan et al., 2014 Usefulness of multi-parental populations of maize (Zea mays L.) for genome-based prediction. Genetics 198: 3–16.

747 748

Lin, Z., B. J. Hayes, and H. D. Daetwyler, 2014 Genomic selection in crops, trees and forages: A review. Crop Pasture Sci. 65: 1177–1191.

749 750

Lorenzana, R. E., and R. Bernardo, 2009 Accuracy of genotypic value predictions for marker-based selection in biparental plant populations. Theor. Appl. Genet. 120: 151–161.

751 752

Lorenz, A. J., and K. P. Smith, 2015 Adding genetically distant individuals to training populations reduces genomic prediction accuracy in Barley. Crop Sci. 55: 2657–2667.

753 754

de los Campos, G., A. I. Vazquez, R. Fernando, Y. C. Klimentidis, and D. Sorensen, 2013 Prediction of Complex Human Traits Using the Genomic Best Linear Unbiased Predictor. PLoS Genet. 9: 7.

755 756

Maccaferri, M., M. C. Sanguineti, E. Noli, and R. Tuberosa, 2005 Population structure and long-range linkage disequilibrium in a durum wheat elite collection. Mol. Breed. 15: 271–289.

757 758

Mackay, I., E. Ober, and J. Hickey, 2015 GplusE: beyond genomic selection. Food Energy Secur. 4: 25– 35.

30

759 760

Massman, J. M., A. Gordillo, R. E. Lorenzana, and R. Bernardo, 2013 Genomewide predictions from maize single-cross data. Theor. Appl. Genet. 126: 13–22.

761 762

Mcmullen, M. D., S. Kresovich, H. S. Villeda, P. Bradbury, H. Li et al., 2009 Genetic Properties of the Maize Nested AssociationMapping Population. Science (80-. ). 325: 737–740.

763 764

Meuwissen, T. H. E., B. J. Hayes, and M. E. Goddard, 2001 Prediction of total genetic value using genome-wide dense marker maps. Genetics 157: 1819–1829.

765 766

Mikel, M. A., and J. W. Dudley, 2006 Evolution of North American dent corn from public to proprietary germplasm. Crop Sci. 46: 1193–1205.

767 768

Powell, J. E., P. M. Visscher, and M. E. Goddard, 2010 Reconciling the analysis of IBD and IBS in complex trait studies. Nat. Rev. Genet. 11: 800–805.

769

R Core Team, 2012 R: A language and environment for statistical computing. ISBN 3-900051-07-0.

770 771

Riedelsheimer, C., J. B. Endelman, M. Stange, M. E. Sorrells, J. L. Jannink et al., 2013 Genomic predictability of interconnected biparental maize populations. Genetics 194: 493–503.

772 773 774

Rincent, R., D. Laloë, S. Nicolas, T. Altmann, D. Brunel et al., 2012 Maximizing the reliability of genomic selection by optimizing the calibration set of reference individuals: comparison of methods in two diverse groups of maize inbreds (Zea mays L.). Genetics 192: 715–728.

775 776

Romay, M. C., M. J. Millard, J. C. Glaubitz, J. a Peiffer, K. L. Swarts et al., 2013 Comprehensive genotyping of the USA national maize inbred seed bank. Genome Biol. 14: R55.

777 778

de Roos, a P. W., B. J. Hayes, and M. E. Goddard, 2009 Reliability of genomic predictions across multiple populations. Genetics 183: 1545–1553.

779 780

de Roos, a P. W., B. J. Hayes, R. J. Spelman, and M. E. Goddard, 2008 Linkage disequilibrium and persistence of phase in Holstein-Friesian, Jersey and Angus cattle. Genetics 179: 1503–1512.

781 782

Sargolzaei, M., and F. S. Schenkel, 2009 QMSim: a large-scale genome simulator for livestock. Bioinformatics 25: 680–681.

783 784 785

Schopp, P., C. Riedelsheimer, H. F. Utz, C.-C. Schön, and A. E. Melchinger, 2015 Forecasting the accuracy of genomic prediction with different selection targets in the training and prediction set as well as truncation selection. Theor. Appl. Genet. 128: 2189–2201.

786 787

Schulz-Streeck, T., J. O. Ogutu, Z. Karaman, C. Knaak, and H. P. Piepho, 2012 Genomic Selection using Multiple Populations. Crop Sci. 52: 2453–2461.

788 789

Solberg, T. R., a K. Sonesson, J. a Woolliams, and T. H. E. Meuwissen, 2008 Genomic selection using different marker types and densities. J. Anim. Sci. 86: 2447–2454.

790

Suneson, C. A., 1956 An Evolutionary Plant Breeding Method. Agron. J. 6: 1–4.

791 792 793

Technow, F., A. Bürger, and A. E. Melchinger, 2013 Genomic prediction of northern corn leaf blight resistance in maize with combined or separated training sets for heterotic groups. G3 3: 197– 203.

794 795

Technow, F., and L. R. Totir, 2015 Using Bayesian Multilevel Whole Genome Regression Models for Partial Pooling of Training Sets in Genomic Prediction. G3 5: 1603–1612.

796 797 798

Unterseer, S., E. Bauer, G. Haberer, M. Seidel, C. Knaak et al., 2014 A powerful tool for genome analysis in maize: development and evaluation of the high density 600 k SNP genotyping array. BMC Genomics 15: 823.

799 800

VanRaden, P. M., 2008 Efficient methods to compute genomic predictions. J. Dairy Sci. 91: 4414– 4423.

31

801 802 803

Vela-Avitúa, S., T. H. Meuwissen, T. Luan, and J. Ødegård, 2015 Accuracy of genomic selection for a sib-evaluated trait using identity-by-state and identity-by-descent relationships. Genet. Sel. Evol. 47: 9.

804 805

Wientjes, Y. C. J., R. F. Veerkamp, and M. P. L. Calus, 2013 The Effect of Linkage Disequilibrium and Family Relationships on the Reliability of Genomic Prediction. Genetics 193: 621–631.

806 807 808

Windhausen, V. S., G. N. Atlin, J. M. Hickey, J. Crossa, J.-L. Jannink et al., 2012 Effectiveness of genomic prediction of maize hybrid performance in different breeding populations and environments. G3 2: 1427–1436.

809

Wright, S., 1922 Coefficients of Inbreeding and Relationship. Am. Nat. 56: 330–338.

810 811

Würschum, T., J. C. Reif, T. Kraft, G. Janssen, and Y. Zhao, 2013 Genomic selection in sugar beet breeding populations. BMC Genet. 14: 85.

812 813 814

Zhong, S., J. C. M. Dekkers, R. L. Fernando, and J.-L. Jannink, 2009 Factors Affecting Accuracy From Genomic Selection in Populations Derived From Multiple Inbred Lines: A Barley Case Study. Genetics 182: 355–364.

815

32

816 817

818 819 820 821 822 823 824 825

826 827 828 829 830 831 832 833 834 835

FIGURES

Figure 1 Linkage disequilibrium (LDA) between pairs of loci plotted against their genetic map distance ∆ in centimorgans (cM), for the two ancestral populations SR (shortrange LD) and LR (long-range LD). The two vertical lines represent the average distance between QTL and its closest nearby SNP for the two marker densities investigated in our study.

Figure 2 Flowchart of the eight scenarios analyzed in this study. Training set and prediction set were either related (“Re”-scenarios) or unrelated (“Un”-scenarios). The arrows represent the changes made between scenarios, e.g., removal of ancestral LD between QTL and SNPs (LDA  LEA) or replacing the relationship matrix (𝑮 → 𝑸). The background texture indicates whether identity-by-state or identityby-descent information was used. The green circles show for the SNP-based scenarios the sources of information that contributed to prediction accuracy (cf. Habier et al. 2013), where in addition to LDA, RS refers to pedigree relationships at QTL captured by SNPs and CS refers to co-segregation of QTL and SNPs.

33

836 837 838 839 840 841 842 843 844 845

Figure 3 (A) Frequency of QTL-SNP pairs falling into 8 disjoint intervals of linkage disequilibrium (LD) in the parents of synthetics, plotted against their genetic map distance ∆, for three different numbers of parents 𝑁𝑃 . (B) Average LD between QTL-SNP pairs, plotted against their genetic map distance ∆, for synthetics generated from different 𝑁𝑃 . The mean LD in the respective ancestral population (LDA) is shown for comparison (red graphs). The left column in A and B refers to scenarios Re-LEA-SNP and UnLEA-SNP (independent of the ancestral population), where ancestral LD between QTL and SNPs was eliminated, whereas the other two columns correspond to all other scenarios, for the ancestral populations SR (short-range LD) and LR (long-range LD), respectively.

34

846

847 848 849 850 851 852 853 854

855 856 857 858 859 860

Figure 4 Linkage phase similarity of QTL-SNP pairs in the training set (TS) and prediction set (PS) for scenarios Re-LDA-SNP and Un-LDA-SNP, plotted against the number of parents 𝑁𝑃 used to generate synthetics, for the two ancestral populations SR (short-range LD) and LR (long-range LD), and for different genetic map distances ∆ (0.5, 5 and 20 cM ± 0.5 cM) between QTL and SNPs.

Figure 5 Prediction accuracy for seven scenarios (scenario Un-LEA-SNP not shown), plotted against the number of parents 𝑁𝑃 used to generate synthetics, for the two ancestral populations SR (short-range LD) and LR (long-range LD). Results refer to a training set size of 𝑁𝑇𝑆 = 250 doubled haploid lines and a marker density of 5 SNPs cM-1.

35

861

862 863 864 865 866 867

Figure 6 (A) Frequency of the seven possible values 𝑓𝑖𝑗 of pedigree relationships for different numbers of unrelated inbred parents 𝑁𝑃 used to generate synthetics. (B) Conditional distributions 𝑞𝑖𝑗 |𝑓𝑖𝑗 of actual relationships 𝑞𝑖𝑗 conditional on their pedigree relationship 𝑓𝑖𝑗 between individuals 𝑖 and 𝑗 in the training set and prediction set, respectively, for the two ancestral populations SR (short-range LD) and LR (long-range LD).

36

Suggest Documents