Genetics: Early Online, published on January 8, 2019 as 10.1534/genetics.118.301742
1
Deleterious Mutation Burden and its Association with Complex Traits in
2
Sorghum (Sorghum bicolor)
3 4
Ravi Valluru1*, Elodie E. Gazave2, Samuel B. Fernandes3, John N. Ferguson3, Roberto Lozano2,
5
Pradeep Hirannaiah3, Tao Zuo1,4, Patrick J. Brown5, Andrew D.B. Leakey3, Michael A. Gore2,
6
Edward S. Buckler1,2,6, Nonoy Bandillo1*
7 8
1
9
14853, USA. 2Plant Breeding and Genetics Section, School of Integrative Plant Science, Cornell
10
University, Ithaca, New York 14853, USA. 3Departments of Plant Biology and Crop Sciences,
11
Institute for Genomic Biology, 1402 Institute for Genomic Biology, University of Illinois at Urbana-
12
Champaign, Illinois, USA. 5Section of Agricultural Plant Biology, Department of Plant Sciences,
13
University of California Davis, Davis, CA 95616, USA. 6USDA-ARS, R. W. Holley Center, 538
14
Tower Road, Ithaca, New York 14853, USA. 4Present address: Monsanto Company, St. Louis,
15
Missouri 63167, USA.
Institute for Genomic Diversity, 175 Biotechnology Building, Cornell University, Ithaca, New York
16 17
ORCID IDs:
18
0000-0001-5725-5766 (RV);
19
0000-0001-8269-535X (SBF);
20
0000-0003-3603-9997 (JNF);
21
0000-0003-0760-4977 (RL);
22
0000-0001-9654-4253 (PH);
23
0000-0002-6581-1192 (TZ);
24
0000-0003-1332-711X (PJB);
25
0000-0001-6251-024X (ABDL);
26
0000-0001-6896-8024 (MAG);
27
0000-0002-3100-371X (ESB);
28
0000-0002-5941-9047 (NB);
29 30 31 32 33 34
1 Copyright 2019.
35
SHORT RUNNING TITLE: Deleterious mutations in sorghum
36 37 38
KEYWORDS: deleterious mutations, genetic load, genome-wide predictions, mutation burden,
39
sorghum
40 41 42
*CORRESPONDANCE:
43
Ravi Valluru
44
Institute for Genomic Diversity,
45
175 Biotechnology Building
46
Cornell University
47
Ithaca 14853, USA
48
[email protected];
[email protected]
49 50 51 52
Nonoy Bandillo
53
Institute for Genomic Diversity,
54
175 Biotechnology Building
55
Cornell University
56
Ithaca 14853, USA
57
[email protected]
58 59 60 61 62 63 64 65 66 67 68 69
2
70 71
ABSTRACT
72
Sorghum (Sorghum bicolor L.) is a major food cereal for millions of people worldwide. The sorghum
73
genome, like other species, accumulates deleterious mutations, likely impacting its fitness. The
74
lack of recombination, drift, and the coupling with favorable loci impede the removal of deleterious
75
mutations from the genome by selection. To study how deleterious variants impact phenotypes, we
76
identified putative deleterious mutations among ~5.5M segregating variants of 229 diverse biomass
77
sorghum lines. We provide the whole-genome estimate of the deleterious burden in sorghum,
78
showing that about 33 percent of nonsynonymous substitutions are putatively deleterious. The
79
pattern of mutation burden varies appreciably among racial groups. Across racial groups, the
80
mutation burden correlated negatively with biomass, plant height, specific leaf area (SLA), and
81
tissue starch content (TSC), suggesting deleterious burden decreases trait fitness. Putatively
82
deleterious variants explain roughly half of the genetic variance. However, there is only moderate
83
improvement in total heritable variance explained for biomass (7.6%) and plant height (average of
84
3.1% across all stages). There is no advantage in total heritable variance for SLA and TSC. The
85
contribution of putatively deleterious variants to phenotypic diversity therefore appears to be
86
dependent on the genetic architecture of traits. Overall, these results suggest that incorporating
87
putatively deleterious variants into genomic models slightly improve prediction accuracy because
88
of extensive linkage. Knowledge of deleterious variants could be leveraged for sorghum breeding
89
through either genome editing and/or conventional breeding that focus on the selection of progeny
90
with fewer deleterious alleles.
91 92 93 94 95 96 97 98 99 100 101 102 103 104
3
105 106
Introduction
107
Plant genomes continually accumulate new mutations due to population demographic history
108
(Brandvain et al. 2013), random drift (Lynch and Gabriel 1990), the mating system (Hartfield and
109
Glémin 2014), domestication (Lu et al. 2006; Ramu et al. 2017), and linked selection due to genetic
110
interactions (Felsenstein 1974). While a sizeable portion of such new mutations are neutral (Shaw
111
et al. 2002; Covert et al. 2013), a small portion of new mutations are likely to be deleterious
112
because they disrupt evolutionarily conserved sites, protein function (Yampolsky et al. 2005;
113
Doniger et al. 2008), or gene expression (Kremling et al. 2018) in a way that results in negative
114
impacts on fitness. The elimination of deleterious mutations from breeding populations has
115
therefore been suggested as a prospective avenue for crop improvement (Morrell et al. 2012;
116
Moyers et al. 2018).
117
Sorghum (Sorghum bicolor (L.), 2n = 20) is an important and versatile crop that is grown for food,
118
forage, and fuel. It was domesticated from its wild ancestor about 8,000 years ago in Africa
119
(Wendorf et al. 1992). Five major morphological forms have traditionally been recognized: bicolor,
120
caudatum, durra, guinea, and kafir. While these races are widespread in distinct regions of Africa
121
reflecting the diverse agro-eco-environments (Dillon et al. 2007; Evans et al. 2013), sorghum has
122
maintained minimal genome redundancy due to the absence of any whole genome duplication for
123
over 70 million years (Paterson et al. 2004, 2009). However, inbreeding sorghum is likely to
124
accumulate more weakly deleterious mutations when compared to an outcrossing species, which
125
accumulates strong recessive deleterious mutations that reduce the mean fitness of the species
126
over time (Moyers et al. 2018). Nonetheless, there is accumulating evidence showing that
127
enhanced homozygosity (Kumaravadivel and Rangasamy 1994), relaxed selection (Arunkumar et
128
al. 2015), and low levels of outcrossing (Pamilo et al. 1987; Nakayama et al. 2012) can act to
129
purge deleterious mutations leading to lower mutation burden in selfing populations. Though the
130
relative contributions of these processes to mutation burden has long been debated, both
131
theoretical and experimental evidence suggests that reduced population size effects usually
132
outcompete processes that enhance purging of deleterious mutations caused by selfing
133
(Bustamante et al. 2002; Slotte et al. 2010, 2013; Arunkumar et al. 2015) leading to an influx of
134
deleterious mutations into selfing species.
135
Modern breeding and domestication results in an increased mutation burden in domesticates when
136
compared to their wild progenitors, and a decreased mutation burden in elite cultivars when
137
compared to landraces (Gaut et al. 2015; Ramu et al. 2017; Yang et al. 2017). The demographic
4
138
history and inbreeding allow deleterious variants of weaker effect to reach appreciable frequencies
139
owing to random drift, which can contribute significantly to mutation burden and affect fitness-
140
related traits (Kono et al. 2016). An estimated 20 to 30% of nonsynonymous variants are
141
deleterious in rice (Lu et al. 2006), Arabidopsis (Günther and Schmid 2010), maize (Mezmouk and
142
Ross-Ibarra 2014), and cassava (Ramu et al. 2017). Renaut and Rieseberg (2015) identified an
143
excess of nonsynonymous Single Nucleotide Polymorphisms (SNPs) segregating in domesticated
144
sunflower and globe artichoke relative to natural populations. Similarly, about 20 to 40% of protein-
145
coding SNPs are predicted to have a deleterious allele in maize (Mezmouk and Ross-Ibarra 2014).
146
Indeed, deleterious mutations are predicted to be enriched near regions of strong selection (Chun
147
and Fay, 2011; Gaut et al. 2015; Kono et al. 2016), pointing to a potentially important role for
148
deleterious variants in shaping agronomic phenotypes.
149
Genomic Selection (GS) can help to accelerate crop breeding when compared to conventional
150
phenotype-based selection approaches. In Genome-Wide Prediction (GWP) models employed in
151
GS, the genetic variance is modeled by accounting for either the biological additive and dominant
152
effects of the markers that can potentially improve the prediction accuracy of phenotypic traits
153
(Vitezica et al. 2013, 2016). Genes associated with complex traits carry an uncertain number of
154
deleterious mutations distributed across the genome, and such a mutation burden contributes
155
significantly to the total phenotypic variation of traits (Yang et al. 2017). Because deleterious
156
mutations can occur in both homozygous and heterozygous states depending on the genetic
157
context, trait-specific and genetic-context based GWP models could be expected to capture the
158
phenotypic effects of deleterious mutations. Therefore, GWP models encompassing deleterious
159
variants are expected to account for the total genetic contribution to and improve the prediction
160
accuracy of complex traits (Yang et al. 2017). However, the improvement of GWP will depend on
161
how strongly correlated deleterious variants are to all other variants.
162
In this study, we examine the contribution of putatively deleterious variants to phenotypic variation
163
in sorghum. We used a racially, geographically, and phenotypically diverse biomass sorghum
164
population that represents the ancestry of four major sorghum types (Brenton et al. 2016). All
165
accessions were phenotyped for two agronomic traits, dry biomass (DBM) and plant height (PH),
166
and for two physiological traits, specific leaf area (SLA) and tissue starch content (TSC) under field
167
conditions. We performed whole-genome resequencing (WGS) on 229 sorghum lines and
168
identified putative deleterious mutations in the genome. The main objectives of this study were to
169
determine (1) whether empirical patterns of deleterious mutation burden differ among sorghum
170
racial groups; and (2) whether deleterious variants improve prediction accuracy of complex traits,
5
171
and if so, whether such accuracy differs among phenotypic traits that have different genetic
172
architecture. To address these questions, we first identified the putative deleterious mutations and
173
their biological effect sizes and then, estimated an individual mutation burden and its relationship
174
with phenotypic traits. Taking advantage of a Bayesian genomic selection framework (Habier et al.
175
2011), we tested the biological significance of deleterious variants on prediction of DBM, PH, SLA,
176
and TSC.
177 178
Materials and methods
179
Plant material, field experiments and phenotypic data
180
A biomass sorghum diversity panel assembled for the TERRA-MEPP and TERRA-WEST projects
181
was used in this study. This panel was composed of 869 lines, 339 lines coming from Fernandes et
182
al. (2018), 117 lines coming from Brenton et al. (2016), 273 lines coming from Yu et al. (2016), and
183
140 additional lines obtained from John Burke (USDA - Lubbock, TX). Although phenotypic data for
184
the entire panel was collected, only a subset of 229 lines for which WGS data were available were
185
used in the study. These 229 lines belong to four major races of sorghum (caudatum, durra,
186
guinea, and kafir) with representatives from the African continent, Asia, and the Americas (Fig. S1).
187 188
Field experiments were conducted in Illinois during 2016 in an augmented block design that
189
consisted of 960 four-row plots with a row length of 3 m, 1.5 m alleys and 0.76 m row spacing. All
190
plots were arranged in 40 rows and 24 columns. Target density of plant population was
191
approximately 270,368 plants ha-1 and experiments were planted in late May and harvested in
192
early October. Plant height was measured from the ground to the uppermost leaf whorl at 7
193
developmental stages starting 4 weeks after planting (WAP) up to 16 WAP with an interval of 2
194
weeks (7 stages) and averaged across the plot. Biomass data was collected at harvesting using a
195
four-row Kemper head attached to the John Deere 5830 tractor. A plot sampler equipment with
196
near infra-red sensor (model 130S, RCI engineering) was used to measure the wet weight of total
197
biomass (lb) and to quantify biomass moisture (%) and starch (%) contents of plants (Li et al. 2015)
198
in the 2 middle rows of each four-row plot. Biomass yield in dry US tons per acre was calculated
199
as: dry US tons per acre = total plot wet weight (lb) x (1 − plot moisture) / (plot area in acre) x
200
0.0005. Because some accessions had flowered (38 accessions), flowering data were recorded in
201
2018 (flowering data was not available for 2016). We conducted an additional set of analyses that
202
had excluded these 38 accessions to assess the potential confounding effect of flowering time on
203
plant height.
204
6
205
To estimate specific leaf area (SLA), the youngest fully expanded leaf from two randomly selected
206
plants of the middle two rows of each plot were excised just above the ligule 60 to 70 days after
207
planting. Damaged leaves were avoided. Excised leaves were then re-cut under water, and the cut
208
surface kept immersed. In the laboratory, three 1.6 cm leaf discs were collected from the middle of
209
each leaf whilst avoiding the mid-rib. Leaf discs were immediately transferred to an oven set at
210
60°C for two weeks. The dry mass of leaf discs was determined, and SLA was expressed as the
211
ratio of fresh leaf area to dry leaf mass (cm2 g-1). Considering 10 days interval among the SLA
212
sampling, we used ‘date of sampling’ as a term in the model to generate BLUPs.
213 214
Statistical analysis of phenotypic data
215
Phenotypic data analysis was conducted according to experimental design, which consisted of a
216
series of incomplete blocks connected through common checks. The following model was used to
217
get best linear unbiased prediction (BLUPs) for all genotypes included in the field trial:
218
yijk = μ + gi + ej + b + geij + ε k(j) ijk
219
where μ is the overall mean, gi is the random effect of the ith genotype, ej is the random effect of
220
the jth location, bk(j) is the random effect of the kth incomplete block nested within the jth location, geij
221
represents the effect of genotype-by-environment interaction, and εijk is the residual error for the ith
222
genotype in the kth incomplete block in the jth location.
223 224
For specific leaf area (SLA), we fit another model that accounted for the sampling date:
225
yijkl = μ + gi + ej + b + d + geij + ε k(j) l(kj) ijkl
226
where μ is the overall mean, gi is the random effect of the ith genotype, ej is the random effect of
227
the jth location, bk(j) is the random effect of the kth incomplete block nested within the jth location, dl(j)
228
is the random effect of the lth sampling date nested within kth incomplete block and the jth location,
229
geij represents the effect of genotype-by-environment interaction, and εijkl is the residual error for
230
the ith genotype in the kth incomplete block and lth sampling date in the jth location.
231 232
For the purpose of estimating the broad-sense heritability (H2) of each phenotype, we estimated
233
variance components using the restricted maximum likelihood. All effects were assumed to be
234
random. Broad-sense heritability on an entry-mean basis was calculated as H2 = σ2G / (σ2G + σ2GXE /
235
number of locations + σ2e / number of locations x number of replicates), where σ2G is the variance
236
among accessions, σ2GXE is the accession-by-environment variance, and σ2e is the error variance.
237
All analyses were conducted in R software (R Development Core Team, 2015). 7
238 239 240 241 242
Genotyping
243
Genomic DNA (gDNA) was extracted using the CTAB method and quantified using picogreen
244
(Molecular Probes, Eugene Oregon, USA) on a microplate reader of Synergy HT (BioTek,
245
Vermont, USA). After preprocessing steps of the genomic DNA samples, ten libraries were
246
prepared (24 samples in each library) and sequenced on HiSeq 4000 (PE_2x150) using
247
sequencing kit version 1. Fastq files were demultiplexed with the bcl2fastq v2.17.1.14 conversion
248
software of Illumina. We used Sentieon DNAseq (Freed et al. 2017) and a series of custom bash
249
scripts to process the raw reads. Briefly, fastq files were aligned to the Sorghum bicolor reference
250
genome version 3.1 (https://phytozome.jgi.doe.gov). PCR duplicates were removed, base quality
251
was recalibrated based on a ‘known SNPs’ file, and recalibrated files were processed through the
252
Haplotype Caller (HC). No realignment around indels was performed. The dataset therefore
253
contains 239 samples, corresponding to 229 unique accessions, of which 7 had 1 or 2 replicates.
254
To create a list of “known SNPs” for the recalibration step, the HC pipeline was run without
255
recalibration on the list of 239 BAM files. The output was filtered removing SNPs that had a
256
number of heterozygote genotypes across all accessions greater than 10% and/or a number of
257
heterozygote genotypes greater than two times the number of minor alleles (hereafter referred to
258
as “homozygosity-based filter” (Chia et al. 2012)). In addition, “SNP clusters”, defined as three or
259
more SNPs located within five base pairs (bp) were also filtered out. Clusters of SNPs are often
260
generated by misalignment and were conservatively considered as spurious. The filtered list of
261
SNPs was used as “known SNPs” to recalibrate the BAM files and to generate a final list of SNPs.
262
The vcf file generated by the HC contained biallelic SNPs (n=22,359,733) and were further filtered
263
to only retain SNPs with at least 4X coverage (n=21,865,512), and with a non-missing genotype in
264
at least 40% of the samples (n=14,535,156). After removing SNP clusters and applying
265
homozygosity-based filters, the final dataset contained 5,512,653 SNPs, which were used for
266
further analyses.
267 268
Identifying putatively deleterious mutations
269
The substitution of amino acid effect on protein function was predicted with the SIFT algorithm
270
(Vaser et al. 2016). A nonsynonymous mutation with a SIFT score 2) (Davydov et al. 2010) estimated from a multi-species
273
whole-genome alignment of six species including Zea mays, Oryza sativa, Setaria italica,
274
Brachypodium distachyon, Hordeum vulgare, and Musa acuminate. We therefore used both an
275
estimate of sequence conservation (GERP>2) and protein conservation (SIFT