Deleterious Mutation Burden and its Association with

1 downloads 0 Views 3MB Size Report
Jan 8, 2019 - Tower Road, Ithaca, New York 14853, USA. .... wild progenitors, and a decreased mutation burden in elite cultivars ..... quantitative measures of an observed variant for its long-term fitness ..... low statistical power (Park et al.
Genetics: Early Online, published on January 8, 2019 as 10.1534/genetics.118.301742

1

Deleterious Mutation Burden and its Association with Complex Traits in

2

Sorghum (Sorghum bicolor)

3 4

Ravi Valluru1*, Elodie E. Gazave2, Samuel B. Fernandes3, John N. Ferguson3, Roberto Lozano2,

5

Pradeep Hirannaiah3, Tao Zuo1,4, Patrick J. Brown5, Andrew D.B. Leakey3, Michael A. Gore2,

6

Edward S. Buckler1,2,6, Nonoy Bandillo1*

7 8

1

9

14853, USA. 2Plant Breeding and Genetics Section, School of Integrative Plant Science, Cornell

10

University, Ithaca, New York 14853, USA. 3Departments of Plant Biology and Crop Sciences,

11

Institute for Genomic Biology, 1402 Institute for Genomic Biology, University of Illinois at Urbana-

12

Champaign, Illinois, USA. 5Section of Agricultural Plant Biology, Department of Plant Sciences,

13

University of California Davis, Davis, CA 95616, USA. 6USDA-ARS, R. W. Holley Center, 538

14

Tower Road, Ithaca, New York 14853, USA. 4Present address: Monsanto Company, St. Louis,

15

Missouri 63167, USA.

Institute for Genomic Diversity, 175 Biotechnology Building, Cornell University, Ithaca, New York

16 17

ORCID IDs:

18

0000-0001-5725-5766 (RV);

19

0000-0001-8269-535X (SBF);

20

0000-0003-3603-9997 (JNF);

21

0000-0003-0760-4977 (RL);

22

0000-0001-9654-4253 (PH);

23

0000-0002-6581-1192 (TZ);

24

0000-0003-1332-711X (PJB);

25

0000-0001-6251-024X (ABDL);

26

0000-0001-6896-8024 (MAG);

27

0000-0002-3100-371X (ESB);

28

0000-0002-5941-9047 (NB);

29 30 31 32 33 34

1 Copyright 2019.

35

SHORT RUNNING TITLE: Deleterious mutations in sorghum

36 37 38

KEYWORDS: deleterious mutations, genetic load, genome-wide predictions, mutation burden,

39

sorghum

40 41 42

*CORRESPONDANCE:

43

Ravi Valluru

44

Institute for Genomic Diversity,

45

175 Biotechnology Building

46

Cornell University

47

Ithaca 14853, USA

48

[email protected]; [email protected]

49 50 51 52

Nonoy Bandillo

53

Institute for Genomic Diversity,

54

175 Biotechnology Building

55

Cornell University

56

Ithaca 14853, USA

57

[email protected]

58 59 60 61 62 63 64 65 66 67 68 69

2

70 71

ABSTRACT

72

Sorghum (Sorghum bicolor L.) is a major food cereal for millions of people worldwide. The sorghum

73

genome, like other species, accumulates deleterious mutations, likely impacting its fitness. The

74

lack of recombination, drift, and the coupling with favorable loci impede the removal of deleterious

75

mutations from the genome by selection. To study how deleterious variants impact phenotypes, we

76

identified putative deleterious mutations among ~5.5M segregating variants of 229 diverse biomass

77

sorghum lines. We provide the whole-genome estimate of the deleterious burden in sorghum,

78

showing that about 33 percent of nonsynonymous substitutions are putatively deleterious. The

79

pattern of mutation burden varies appreciably among racial groups. Across racial groups, the

80

mutation burden correlated negatively with biomass, plant height, specific leaf area (SLA), and

81

tissue starch content (TSC), suggesting deleterious burden decreases trait fitness. Putatively

82

deleterious variants explain roughly half of the genetic variance. However, there is only moderate

83

improvement in total heritable variance explained for biomass (7.6%) and plant height (average of

84

3.1% across all stages). There is no advantage in total heritable variance for SLA and TSC. The

85

contribution of putatively deleterious variants to phenotypic diversity therefore appears to be

86

dependent on the genetic architecture of traits. Overall, these results suggest that incorporating

87

putatively deleterious variants into genomic models slightly improve prediction accuracy because

88

of extensive linkage. Knowledge of deleterious variants could be leveraged for sorghum breeding

89

through either genome editing and/or conventional breeding that focus on the selection of progeny

90

with fewer deleterious alleles.

91 92 93 94 95 96 97 98 99 100 101 102 103 104

3

105 106

Introduction

107

Plant genomes continually accumulate new mutations due to population demographic history

108

(Brandvain et al. 2013), random drift (Lynch and Gabriel 1990), the mating system (Hartfield and

109

Glémin 2014), domestication (Lu et al. 2006; Ramu et al. 2017), and linked selection due to genetic

110

interactions (Felsenstein 1974). While a sizeable portion of such new mutations are neutral (Shaw

111

et al. 2002; Covert et al. 2013), a small portion of new mutations are likely to be deleterious

112

because they disrupt evolutionarily conserved sites, protein function (Yampolsky et al. 2005;

113

Doniger et al. 2008), or gene expression (Kremling et al. 2018) in a way that results in negative

114

impacts on fitness. The elimination of deleterious mutations from breeding populations has

115

therefore been suggested as a prospective avenue for crop improvement (Morrell et al. 2012;

116

Moyers et al. 2018).

117

Sorghum (Sorghum bicolor (L.), 2n = 20) is an important and versatile crop that is grown for food,

118

forage, and fuel. It was domesticated from its wild ancestor about 8,000 years ago in Africa

119

(Wendorf et al. 1992). Five major morphological forms have traditionally been recognized: bicolor,

120

caudatum, durra, guinea, and kafir. While these races are widespread in distinct regions of Africa

121

reflecting the diverse agro-eco-environments (Dillon et al. 2007; Evans et al. 2013), sorghum has

122

maintained minimal genome redundancy due to the absence of any whole genome duplication for

123

over 70 million years (Paterson et al. 2004, 2009). However, inbreeding sorghum is likely to

124

accumulate more weakly deleterious mutations when compared to an outcrossing species, which

125

accumulates strong recessive deleterious mutations that reduce the mean fitness of the species

126

over time (Moyers et al. 2018). Nonetheless, there is accumulating evidence showing that

127

enhanced homozygosity (Kumaravadivel and Rangasamy 1994), relaxed selection (Arunkumar et

128

al. 2015), and low levels of outcrossing (Pamilo et al. 1987; Nakayama et al. 2012) can act to

129

purge deleterious mutations leading to lower mutation burden in selfing populations. Though the

130

relative contributions of these processes to mutation burden has long been debated, both

131

theoretical and experimental evidence suggests that reduced population size effects usually

132

outcompete processes that enhance purging of deleterious mutations caused by selfing

133

(Bustamante et al. 2002; Slotte et al. 2010, 2013; Arunkumar et al. 2015) leading to an influx of

134

deleterious mutations into selfing species.

135

Modern breeding and domestication results in an increased mutation burden in domesticates when

136

compared to their wild progenitors, and a decreased mutation burden in elite cultivars when

137

compared to landraces (Gaut et al. 2015; Ramu et al. 2017; Yang et al. 2017). The demographic

4

138

history and inbreeding allow deleterious variants of weaker effect to reach appreciable frequencies

139

owing to random drift, which can contribute significantly to mutation burden and affect fitness-

140

related traits (Kono et al. 2016). An estimated 20 to 30% of nonsynonymous variants are

141

deleterious in rice (Lu et al. 2006), Arabidopsis (Günther and Schmid 2010), maize (Mezmouk and

142

Ross-Ibarra 2014), and cassava (Ramu et al. 2017). Renaut and Rieseberg (2015) identified an

143

excess of nonsynonymous Single Nucleotide Polymorphisms (SNPs) segregating in domesticated

144

sunflower and globe artichoke relative to natural populations. Similarly, about 20 to 40% of protein-

145

coding SNPs are predicted to have a deleterious allele in maize (Mezmouk and Ross-Ibarra 2014).

146

Indeed, deleterious mutations are predicted to be enriched near regions of strong selection (Chun

147

and Fay, 2011; Gaut et al. 2015; Kono et al. 2016), pointing to a potentially important role for

148

deleterious variants in shaping agronomic phenotypes.

149

Genomic Selection (GS) can help to accelerate crop breeding when compared to conventional

150

phenotype-based selection approaches. In Genome-Wide Prediction (GWP) models employed in

151

GS, the genetic variance is modeled by accounting for either the biological additive and dominant

152

effects of the markers that can potentially improve the prediction accuracy of phenotypic traits

153

(Vitezica et al. 2013, 2016). Genes associated with complex traits carry an uncertain number of

154

deleterious mutations distributed across the genome, and such a mutation burden contributes

155

significantly to the total phenotypic variation of traits (Yang et al. 2017). Because deleterious

156

mutations can occur in both homozygous and heterozygous states depending on the genetic

157

context, trait-specific and genetic-context based GWP models could be expected to capture the

158

phenotypic effects of deleterious mutations. Therefore, GWP models encompassing deleterious

159

variants are expected to account for the total genetic contribution to and improve the prediction

160

accuracy of complex traits (Yang et al. 2017). However, the improvement of GWP will depend on

161

how strongly correlated deleterious variants are to all other variants.

162

In this study, we examine the contribution of putatively deleterious variants to phenotypic variation

163

in sorghum. We used a racially, geographically, and phenotypically diverse biomass sorghum

164

population that represents the ancestry of four major sorghum types (Brenton et al. 2016). All

165

accessions were phenotyped for two agronomic traits, dry biomass (DBM) and plant height (PH),

166

and for two physiological traits, specific leaf area (SLA) and tissue starch content (TSC) under field

167

conditions. We performed whole-genome resequencing (WGS) on 229 sorghum lines and

168

identified putative deleterious mutations in the genome. The main objectives of this study were to

169

determine (1) whether empirical patterns of deleterious mutation burden differ among sorghum

170

racial groups; and (2) whether deleterious variants improve prediction accuracy of complex traits,

5

171

and if so, whether such accuracy differs among phenotypic traits that have different genetic

172

architecture. To address these questions, we first identified the putative deleterious mutations and

173

their biological effect sizes and then, estimated an individual mutation burden and its relationship

174

with phenotypic traits. Taking advantage of a Bayesian genomic selection framework (Habier et al.

175

2011), we tested the biological significance of deleterious variants on prediction of DBM, PH, SLA,

176

and TSC.

177 178

Materials and methods

179

Plant material, field experiments and phenotypic data

180

A biomass sorghum diversity panel assembled for the TERRA-MEPP and TERRA-WEST projects

181

was used in this study. This panel was composed of 869 lines, 339 lines coming from Fernandes et

182

al. (2018), 117 lines coming from Brenton et al. (2016), 273 lines coming from Yu et al. (2016), and

183

140 additional lines obtained from John Burke (USDA - Lubbock, TX). Although phenotypic data for

184

the entire panel was collected, only a subset of 229 lines for which WGS data were available were

185

used in the study. These 229 lines belong to four major races of sorghum (caudatum, durra,

186

guinea, and kafir) with representatives from the African continent, Asia, and the Americas (Fig. S1).

187 188

Field experiments were conducted in Illinois during 2016 in an augmented block design that

189

consisted of 960 four-row plots with a row length of 3 m, 1.5 m alleys and 0.76 m row spacing. All

190

plots were arranged in 40 rows and 24 columns. Target density of plant population was

191

approximately 270,368 plants ha-1 and experiments were planted in late May and harvested in

192

early October. Plant height was measured from the ground to the uppermost leaf whorl at 7

193

developmental stages starting 4 weeks after planting (WAP) up to 16 WAP with an interval of 2

194

weeks (7 stages) and averaged across the plot. Biomass data was collected at harvesting using a

195

four-row Kemper head attached to the John Deere 5830 tractor. A plot sampler equipment with

196

near infra-red sensor (model 130S, RCI engineering) was used to measure the wet weight of total

197

biomass (lb) and to quantify biomass moisture (%) and starch (%) contents of plants (Li et al. 2015)

198

in the 2 middle rows of each four-row plot. Biomass yield in dry US tons per acre was calculated

199

as: dry US tons per acre = total plot wet weight (lb) x (1 − plot moisture) / (plot area in acre) x

200

0.0005. Because some accessions had flowered (38 accessions), flowering data were recorded in

201

2018 (flowering data was not available for 2016). We conducted an additional set of analyses that

202

had excluded these 38 accessions to assess the potential confounding effect of flowering time on

203

plant height.

204

6

205

To estimate specific leaf area (SLA), the youngest fully expanded leaf from two randomly selected

206

plants of the middle two rows of each plot were excised just above the ligule 60 to 70 days after

207

planting. Damaged leaves were avoided. Excised leaves were then re-cut under water, and the cut

208

surface kept immersed. In the laboratory, three 1.6 cm leaf discs were collected from the middle of

209

each leaf whilst avoiding the mid-rib. Leaf discs were immediately transferred to an oven set at

210

60°C for two weeks. The dry mass of leaf discs was determined, and SLA was expressed as the

211

ratio of fresh leaf area to dry leaf mass (cm2 g-1). Considering 10 days interval among the SLA

212

sampling, we used ‘date of sampling’ as a term in the model to generate BLUPs.

213 214

Statistical analysis of phenotypic data

215

Phenotypic data analysis was conducted according to experimental design, which consisted of a

216

series of incomplete blocks connected through common checks. The following model was used to

217

get best linear unbiased prediction (BLUPs) for all genotypes included in the field trial:

218

yijk = μ + gi + ej + b + geij + ε k(j) ijk

219

where μ is the overall mean, gi is the random effect of the ith genotype, ej is the random effect of

220

the jth location, bk(j) is the random effect of the kth incomplete block nested within the jth location, geij

221

represents the effect of genotype-by-environment interaction, and εijk is the residual error for the ith

222

genotype in the kth incomplete block in the jth location.

223 224

For specific leaf area (SLA), we fit another model that accounted for the sampling date:

225

yijkl = μ + gi + ej + b + d + geij + ε k(j) l(kj) ijkl

226

where μ is the overall mean, gi is the random effect of the ith genotype, ej is the random effect of

227

the jth location, bk(j) is the random effect of the kth incomplete block nested within the jth location, dl(j)

228

is the random effect of the lth sampling date nested within kth incomplete block and the jth location,

229

geij represents the effect of genotype-by-environment interaction, and εijkl is the residual error for

230

the ith genotype in the kth incomplete block and lth sampling date in the jth location.

231 232

For the purpose of estimating the broad-sense heritability (H2) of each phenotype, we estimated

233

variance components using the restricted maximum likelihood. All effects were assumed to be

234

random. Broad-sense heritability on an entry-mean basis was calculated as H2 = σ2G / (σ2G + σ2GXE /

235

number of locations + σ2e / number of locations x number of replicates), where σ2G is the variance

236

among accessions, σ2GXE is the accession-by-environment variance, and σ2e is the error variance.

237

All analyses were conducted in R software (R Development Core Team, 2015). 7

238 239 240 241 242

Genotyping

243

Genomic DNA (gDNA) was extracted using the CTAB method and quantified using picogreen

244

(Molecular Probes, Eugene Oregon, USA) on a microplate reader of Synergy HT (BioTek,

245

Vermont, USA). After preprocessing steps of the genomic DNA samples, ten libraries were

246

prepared (24 samples in each library) and sequenced on HiSeq 4000 (PE_2x150) using

247

sequencing kit version 1. Fastq files were demultiplexed with the bcl2fastq v2.17.1.14 conversion

248

software of Illumina. We used Sentieon DNAseq (Freed et al. 2017) and a series of custom bash

249

scripts to process the raw reads. Briefly, fastq files were aligned to the Sorghum bicolor reference

250

genome version 3.1 (https://phytozome.jgi.doe.gov). PCR duplicates were removed, base quality

251

was recalibrated based on a ‘known SNPs’ file, and recalibrated files were processed through the

252

Haplotype Caller (HC). No realignment around indels was performed. The dataset therefore

253

contains 239 samples, corresponding to 229 unique accessions, of which 7 had 1 or 2 replicates.

254

To create a list of “known SNPs” for the recalibration step, the HC pipeline was run without

255

recalibration on the list of 239 BAM files. The output was filtered removing SNPs that had a

256

number of heterozygote genotypes across all accessions greater than 10% and/or a number of

257

heterozygote genotypes greater than two times the number of minor alleles (hereafter referred to

258

as “homozygosity-based filter” (Chia et al. 2012)). In addition, “SNP clusters”, defined as three or

259

more SNPs located within five base pairs (bp) were also filtered out. Clusters of SNPs are often

260

generated by misalignment and were conservatively considered as spurious. The filtered list of

261

SNPs was used as “known SNPs” to recalibrate the BAM files and to generate a final list of SNPs.

262

The vcf file generated by the HC contained biallelic SNPs (n=22,359,733) and were further filtered

263

to only retain SNPs with at least 4X coverage (n=21,865,512), and with a non-missing genotype in

264

at least 40% of the samples (n=14,535,156). After removing SNP clusters and applying

265

homozygosity-based filters, the final dataset contained 5,512,653 SNPs, which were used for

266

further analyses.

267 268

Identifying putatively deleterious mutations

269

The substitution of amino acid effect on protein function was predicted with the SIFT algorithm

270

(Vaser et al. 2016). A nonsynonymous mutation with a SIFT score 2) (Davydov et al. 2010) estimated from a multi-species

273

whole-genome alignment of six species including Zea mays, Oryza sativa, Setaria italica,

274

Brachypodium distachyon, Hordeum vulgare, and Musa acuminate. We therefore used both an

275

estimate of sequence conservation (GERP>2) and protein conservation (SIFT