Predicting gene expression using DNA methylation in two ... - PeerJ

0 downloads 0 Views 2MB Size Report
Jul 26, 2018 - gene expression at population level has not been well studied. ... 50 sites (CpG site), adding a methyl group to the cytosine residue(You & Jones, 2012; Wagner et ... 98 inner-city children with 97 cases of atopy and persistent asthma and .... 245 To make sure that the prediction R2 is larger than thoses from ...
Predicting gene expression using DNA methylation in two human populations Huan Zhong 1 2 3

1

, Soyeon Kim

2

, Degui Zhi

2

, Xiangqin Cui Corresp.

3

Department of Biology, Hong Kong Baptist University, Hong Kong, China School of Biomendical Informatics, University of Texas Health Center at Houston, Houston, Texas, United States Department of Biostatistics and Bioinformatics, Emory University, Atlanta, Georgia, United States

Corresponding Author: Xiangqin Cui Email address: [email protected]

Background. DNA methylation, an important epigenetic mark, is well known for its regulatory role in gene expression, especially the negative regulation in the promoter region. However, its correlation with gene expression at population level has not been well studied. In particular, it is unclear if genome-wide DNA methylation profile of an individual can predict her/his gene expression profile. Previous studies were mostly limited to association analyses between single CpG site methylation and gene expression. It is not known whether DNA methylation of a gene has enough prediction power to serve as a surrogate for gene expression in existing human study cohorts with DNA samples but not RNA samples. Results. We studied two human population datasets, Multiple Tissue Human Expression Resource Projects (MuTHER)’s Adipose tissue as well as asthma and normal peoples’ peripheral blood mononuclear cell (PBMC), for predicting gene expression using methylation of all CpG sites from the gene region. Three prediction models were investigated; single linear regression, multiple linear regression, and least absolute shrinkage and selection operator (LASSO) penalized regression. Our results showed that LASSO regression has superior performance among these methods. However, even with LASSO regression, very small prediction R2 was obtained for the majority of genes and only about one thousand genes had prediction R2 greater than 0.1. GO term and pathway analyses of these more predictable genes showed that they are enriched for immune and defense genes. Conclusion. In human populations, DNA methylation of CpG sites at gene region have weak prediction power for gene expression. The relatively more predictable genes tend to be defense and immune genes.

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018

1

Predicting Gene Expression Using DNA methylation in Two

2

Human Populations

3 4

Huan Zhong1, Soyeon Kim2, Degui Zhi2, Xiangqin Cui3

5 6

1

Department of Biology, Hong Kong Baptist University, HONG KONG

7

2

School of Biomedical Informatics, University of Texas Health Science Center at Houston,

8

Houston, TX, USA

9

3

Department of Biostatistics and Bioinformatics, Emory University, Atlanta, Georgia, USA

10 11

Corresponding author:

Xiangqin Cui

12

Email: [email protected]

13

14

Keywords

15 16

DNA methylation, Methylation Microarray, transcriptome, LASSO

17

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018

18

Abstract

19

Background

20

DNA methylation, an important epigenetic mark, is well known for its regulatory role in gene

21

expression, especially the negative regulation in the promoter region. However, its correlation

22

with gene expression at population level has not been well studied. In particular, it is unclear if

23

genome-wide DNA methylation profile of an individual can predict her/his gene expression

24

profile. Previous studies were mostly limited to association analyses between single CpG site

25

methylation and gene expression. It is not known whether DNA methylation of a gene has

26

enough prediction power to serve as a surrogate for gene expression in existing human study

27

cohorts with DNA samples but not RNA samples.

28

Results

29

We studied two human population datasets, Multiple Tissue Human Expression Resource

30

Projects (MuTHER)’s Adipose tissue and Asthma and normal peoples’ peripheral blood

31

mononuclear cell (PBMC), for predicting gene expression using methylation of all CpG sites

32

from the gene region. Three prediction models were investigated; single linear regression,

33

multiple linear regression, and least absolute shrinkage and selection operator (LASSO)

34

penalized regression. Our results showed that LASSO regression has superior performance

35

among these methods. However, even with LASSO regression, very small prediction R2 was

36

obtained for the majority of genes and only about one thousand genes had prediction R2 greater

37

than 0.1. GO term and pathway analyses of these more predictable genes showed that they are

38

enriched for immune and defense genes.

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018

39

Conclusion

40

In human populations, DNA methylation of CpG sites at gene region have weak prediction

41

power for gene expression. The relatively more predictable genes tend to be defense and immune

42

genes.

43 44 45

46

Background

47 48

DNA methylation has long been recognized as an important epigenetic modification in

49

regulating gene expression (Razin & Riggs, 1980). This process often occurs at CG dinucleotides

50

sites (CpG site), adding a methyl group to the cytosine residue(You & Jones, 2012; Wagner et

51

al., 2015). In mammals, more than 70% of CpG sites are methylated.

52 53

The regulatory role of DNA methylation has been mostly studied with a small number of CpG

54

sites in a limited number of genes. The more recent application of microarrays and next

55

generation sequencing enables Large-scale analysis of DNA methylation and gene expression

56

across the whole genome(Krueger et al., 2012). However, most genome-wide methylation and

57

expression studies have small sample sizes for comparing controlled groups. Only a few studies

58

profiled both genome-wide DNA methylation and gene expression in larger human populations

59

and examined their relationship. Del Rey et al., (Del Rey et al., 2013)studied the genome-wide

60

DNA methylation and gene expression in 83 low-risk subtypes of Myelodysplastic syndrome

61

(MDS) patients and 36 controls using microarrays. While they found negative correlations

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018

62

between methylation and gene expression in a large proportion of differentially expressed and

63

differentially methylated genes, they also found substantial positive correlations in them. In

64

another study of 648 twins, overall negative correlations were found in the adipose tissue,

65

promoter region (-0.018), gene body (-0.013) and 3-prime UTR (-0.007)(Grundberg et al., 2013).

66

More recently, Wagner et al.(Wagner et al., 2014) profiled the genome wide DNA methylation

67

and gene expression in forearm skin fibroblast among 62 unrelated individuals. They found that

68

the association between gene expression and methylation is not always negative in promoter

69

region or positive in gene body.

70

In this study, we examine the DNA methylation and gene expression relationship in two large

71

human datasets. One is from the MuTHER project that includes 856 female European individuals

72

from TwinsUK Adult Twin Registry(Spector & Williams, 2006). Another dataset is the

73

Childhood Asthema study that has 194 inner city children(Yang et al., 2015), which we call

74

PBMC dataset for convenience throughout the paper. We determine the overall relationship

75

between DNA methylation and gene expression for each gene and evaluate the predictive

76

potential of DNA methylation for gene expression. We demonstrate that a penalized regression

77

improves the overall prediction. However, the prediction power is still low for most genes. For

78

only a small set of genes, DNA methylation has substantial prediction power, especially the

79

immune genes. Similar to PrediXcan’s prediction of gene expression using genotype

80

variations(Gamazon et al., 2015), we wrapped our predictions based on DNA methylation into a

81

package named “MethylXcan”.

82

Methods

83

1) Datasets:

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018

84

Adipose Dataset

85

This dataset is from the MuTHER study, consisting of 856 female European-descent individuals

86

enrolled in the TwinsUK Adult Twin Registry. The quartile normalized gene expression and

87

DNA methylation data from subcutaneous fat were downloaded from ArrayExpress

88

(http://www.ebi.ac.uk/arrayexpress/).

89

1140) were generated for 25160 genes using Illumina HumanHT-12 v3.0 on 825 individuals.

90

The log2-transformed signals were quartile normalized for each tissue followed by quartile

91

normalized across the whole population(Grundberg et al., 2012). The DNA methylation data

92

(accession number E-MTAB-1866) were generated using Illumina Infinium

93

HumanMethylation450 from 649 female twins. The methylation beta values were already

94

quantile normalized for each type of probe, ranging from 0 (unmethylated) to 1 (total-

95

methylated) [6].

96

PBMC Dataset

97

This dataset was downloaded from Gene Expression Ominbus (GSE40736). It includes 194

98

inner-city children with 97 cases of atopy and persistent asthma and 97 healthy controls. All the

99

study subjects were 6 to 12 years old from African American, Dominican-Hispanic and Haitian-

The gene expression data (accession number E-TABM-

100

Hispanic background [9]. DNA methylation data were generated using Illumina’s Infinium

101

Human Methylation450k BeadChip. The normalized data matrix was downloaded from

102

ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE40nnn/GSE40576/matrix/ and the methylation M-

103

values were converted to beta values ranging from 0 to 1. Gene expression data were generated

104

for 23612 genes using Nimblegen Human Gene Expression arrays (12x135k). The normalized

105

data matrix was downloaded from

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018

106

ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE40nnn/GSE40732/matrix/. According to the

107

publication, one outlier sample has been removed after principle component analysis, SWAN

108

normalization was used for methylation data. Log2 transformation and RMA normalization were

109

used for gene expression data. For each gene, expression level was standardized across samples.

110

2) Dataset cleaning and filtering

111

To assess the DNA methylation effect in prediction gene expression, we defined the

112

“methylation probes” as the 344,303 probes in Table S1 of the Grundberg’s paper. The probes on

113

the methylation array but excluded from the Table S1, which have potential SNP effects or cross

114

hybridization effects, are termed “S&C probes”. The combinations of these two types of probes

115

are termed “all probes”.

116

The Adipose dataset has 32,478 missing values in the DNA methylation data. Samples with

117

missing values were excluded from regression analysis. Among the 485679 probes in the dataset,

118

344201 probes remained in Adipose dataset after filtering. For the PBMC dataset, 344180 out of

119

the 485461 probes in the dataset remained after filtering. For method comparisons and the

120

consistency of the analysis, only genes that have the LASSO models were used, 8121 genes and

121

150349 CpG sites in Adipose dataset and 4251 genes and 73553 CpG sites in PBMC datasets.

122

(Supplementary Table 1).

123 124

3) Modeling the relationship between gene expression and DNA methylation.

125

CpG probes were mapped to genes using UCSC RefGene annotation. Gene expression and DNA

126

methylation data for each gene were extracted using in-house perl script. Since there was no

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018

127

missing value for methylation of the PBMC dataset, all samples were used in the regression

128

analysis.

129 130

We used three types of regressions, single linear regression, multiple regression, and least

131

absolute shrinkage and selection operator (LASSO) regression(Tibshirani, 1996), to model the

132

linear relationship between gene expression and DNA methylation. Squared correlation (R2)

133

between predicted and observed data was used to compare the three types of regressions. In the

134

single linear regression, each CpG site was modeled separately to predict gene expression level.

135

The CpG site that provides maximum R2 was used to represent each gene. In multiple

136

regressions, all the CpG sites in each gene were used as predictors and the R2 was calculated. In

137

the LASSO regression, all CpG sites were used to predict the gene expression. We used the

138

GLMNET package in R to fit the LASSO model in which penalized parameters were obtained

139

using 10-fold cross-validation to minimize the mean squared error, while the predictors and

140

response variables were all standardized.

141 142

4) Cross validation

143

In addition to calculating the R2 from fitting the models (fitting R2), we also conducted five-fold

144

cross validation to compare the prediction power of the three regression models using the

145

validation R2 (R2.cv). Specifically, the samples were randomly separated into training set (4/5 of

146

data) and testing set (1/5 of data). The procedure was iterated 10 times and the mean R2 of the 10

147

five-fold cross validations was used as our final cross validation R2 for each model. For single

148

regression, cross validation was conducted for each CpG site and the maximum R2 was used for

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018

149

each gene. For LASSO regression analysis, we first obtained the optimal penalty parameter using

150

ten-fold cross validation and then used another five-fold cross validation to evaluate the

151

predictive performance of the model.

152

Note that we calculated fitting R2 in the LASSO cross validation models. We used the entire

153

datasets as testing in the LASSO cross validation models in order to obtain the fitting R2 in a

154

fashion consistent with the multiple and single cross-validation models. In this case, all the R2

155

values in the paper are squared correlation of the predicted and the true values in the training set.

156

5) Model comparisons on significant genes

157

We first identified genes that showed overall model prediction p values less than 0.05 in Lasso

158

regressions and then compared the three regression models on these genes.

159

6) Gene Ontology (GO) and pathway enrichment analysis.

160

For genes with R2 greater than 0.1, we use The Database for Annotation, Visualization and

161

Integrated Discovery (DAVID ) at https://david.ncifcrf.gov/ (Huang, Sherman & Lempicki,

162

2008) to conduct GO term enrichment analysis based on modified Fisher Exact Test. The

163

background genes were set to be the genes on the expression array, HumanHT-12_V3_0_

164

R2_11283641_A. The significantly overrepresented GO terms were selected based on the EASE

165

Score, which is the geometric mean of p-values on logarithm scale for the member terms. We

166

applied medium classification stringency in the DAVID website to our data.

167

“GOTERM_BP_FAT” was used to obtain more information in biological processes of the Gene

168

Ontology enrichment analysis. “KEGG_PATHWAY” was selected for pathway enrichment

169

analysis in the same fashion. The most enriched GO terms and pathways with low p-value or

170

FDR were shown in the results.

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018

171

7) Gene expression prediction using different type of probes on the methylation microarray

172

The probes excluded by Table S1 of the Grundberg’s paper were treated as probes with SNP

173

and/or hybridization effects (S&C probes). We compared these probes, the methylation probes,

174

and the combination of these two types of probes in predicting gene expression.

175

8) Software Package: MethylXcan

176

We created a software package, MethylXcan, for predicting gene expression using DNA

177

methylation array data in the analyses of this paper. This package includes all three regressions

178

and calculates the squared correlation for each model. The program was written in R and Perl,

179

and has been tested under linux or MACSOX system. Users can use this package on their own

180

data after formatting their methylation data, expression profiling data, and annotation files as

181

specified by the package. URL: https://github.com/dorothyzh/MethylXcan

182

183

Results

184 185

Most DNA methylation studies analyze one CpG at a time in explaining gene expression. In this

186

study we set out to find whether combining all CpG sites in a gene can better predict the gene

187

expression in a human population. We obtained two human datasets, the Adipose dataset

188

generated from subcutaneous fat tissue and the Childhood Asthma dataset generated from

189

PBMC. To evaluate the predicting power of DNA methylation on gene expression, we conducted

190

three types of linear regression analyses, single regression, multiple regression, and LASSO

191

regression for each gene. The squared correlations (R2) were used for model comparisons. To

192

focus on DNA methylation effect, we first left out CpG probes that overlap SNPs or cross-

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018

193

hybridize to multiple locations. In addition, since some genes fail to establish a LASSO model,

194

we only focus on genes with LASSO models for comparing different regression methods.

195 196

Multiple regressions using all CpGs from a gene predict gene expression the best in model

197

fitting.

198 199

For the Adipose dataset, the single regression identified a large number of genes with significant

200

CpG sites, about half (3509) with at least one CpG site significant at significance level of 0.0001

201

and most of them (7423) have at least one CpG site significant at significance level of 0.05.

202

However, the prediction power represented by the maximum R2 for these CpG sites is generally

203

low. Only 20 genes have maximum R2 greater than 0.3 and 491 (3.0%) genes have maximum R2

204

larger than 0.1 (Table 1). Since multiple CpG sites from each gene were assayed on the

205

methylation microarray, we applied multiple linear regression to utilize all CpG sites as

206

predictors simultaneously. The R2 explained by the regression model did improve substantially

207

for the majority of genes compared with that of the single linear regression model (Figure 1A).

208

The significant genes (red color) have relatively higher R2 compared with the non-significant

209

genes (Figure 1). For the PBMC data, a much smaller number of significant CpG sites were

210

identifies at the same significance level for the 4251 genes analyzed (582 with p value less than

211

0.0001 and 3112 with p value less than at 0.05 for single regression) potentially due to the

212

smaller sample size. However, the numbers of genes with larger R2 values are comparable with

213

those from the Adipose dataset (Table 1). The improvement of R2 from the multiple regressions

214

over single regress in the PBMC dataset is also similar to that in the Adipose dataset (Figure

215

1A).

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018

216 217

Compared with multiple regressions, LASSO regression did not generate R2 as high as that from

218

multiple regressions in both datasets (Figure 1B and D). For example, the number of genes with

219

R2 greater than 0.2 decreased from 472 to 343 for the Adipose dataset and from 1163 to 561 for

220

the childhood asthma PBMC datasets (Table 2). With LASSO regression, only about 19.7% of

221

the genes (1608 out of 8123) from the Adipose dataset and about 56% of the genes (2382 out of

222

4251) from the PBMC dataset have prediction R2 greater than 0.1.

223 224

LASSO regression shows best prediction in cross-validation.

225 226

To better assess the accuracy of the predictive models, we performed 5-fold cross validation on

227

single, multiple regressions, and LASSO regressions to estimate the R2. For comparison purpose,

228

we present R2 with genes for which the LASSO models are available. The results showed that the

229

LASSO regression produced much larger R2 values than the single regression and less dramatic

230

but discernable increases over multiple regressions (Figure 2). These differences are also

231

reflected in the number of genes with R2 exceeding certain thresholds. For example, 893 genes

232

(11%) from the Adipose dataset have R2 greater than 0.1 from LASSO regression, while 791

233

genes (9.7%) and 125 genes (1.5%) have R2 greater than 0.1 from multiple regression and single

234

regression, respectively (Table1). For genes with R2 greater than 0.3, LASSO regression has 44

235

genes (0.54%) while multiple regression and single regression have 40 (0.49%) and 2 genes

236

(0.03%), respectively. These results indicate that penalized regression has better prediction than

237

multiple or single regressions for cross-validation test datasets. Cross validation tends to

238

overcome bias and over-fitting issues. As expected, cross-validation R2 values are generally

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018

239

lower than those from the model fittings reflected by the smaller number of genes with R2 values

240

greater than certain cut-offs (Table 1).

241

regression fitting have relatively higher cross-validation R2 than non-significant genes (red dots

242

in Figure 2) as expected. Similar results were obtained from the PBMC dataset (Table 1 and

243

Figure 2 C and D).

In addition, the significant genes from multiple

244 245

To make sure that the prediction R2 is larger than thoses from random chance, we compared the

246

cumulative R2 from the two real datasets with those from the the null distribution of correlations

247

based on Fisher z-transfromation in quantile-quantile plots (Figure 3). Both datasets show that

248

the observed R2 values are much larger than the expected R2 values from random chance (Figure

249

3A and 3C). In addition, the Adipose dataset has a larger departure than the PBMC dataset,

250

indicating that the LASSO model could capture more proportion of the transcriptome variability

251

in the Adipose dataset. This is potentially due to the larger sample size or the regulative nature of

252

the different tissues.

253 254

Using all probes improves prediction power for gene expression

255 256

In order to evaluate DNA methylation power in predicting gene expression, we first left out a

257

large proportion of probes potentially affected by genetic or cross-hybridization effects

258

(Supplementary Table 1). However, using all probes on the array is preferred if our goal is to

259

achieve better prediction accuracy of gene expression. To evaluate the prediction power from all

260

probes, we included all available probes in LASSO regression and found that the overall

261

prediction power did increase compared to the models using only the methylation probes (Figure

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018

262

3B and 3D). We observed more genes with R2 values exceeding the thresholds (Table 1). In

263

addition, the largest R2 value is much larger when all probes are used compared that from only

264

the methylation probes. For example, the largest R2 is 0.92 from all probes compared to 0.54

265

from only methylation probes in the Adipose dataset (Figure 3B and Figure 3A). Likewise, the

266

largest R2 is 0.88 from all probes compared to 0.46 from methylation probes alone in the PBMC

267

dataset (Figure 3C and 3D). Furthermore, LASSO models are available for more genes when all

268

probes are used in both datasets (Supplementary Figure 2).

269 270

The increase of prediction power on gene expression from using all probes on the methylation

271

microarray suggests that there is contribution from the probes with potential SNP effects or cross

272

hybridization effects (S&C probes) we initially excluded in estimating methylation prediction.

273

To further examine the size and nature of their contribution, we separately estimated the

274

prediction power of the methylation probes, S&C probes, and the combination of them (all

275

probes). The results showed that the S&C probes have independent prediction power from the

276

methylation probes and the combination of both has increased prediction power over the

277

methylation probes alone (red points vs black line in the left two panels in Figure 4). The

278

prediction power from the S&C probes was estimated for genes with enough SNP probes to form

279

a LASSO model and their prediction power are mostly above zero (blue points in Figure 4). The

280

fact that the blue points are randomly distributed instead of following the black line suggests that

281

the two sources of R2 are not correlated; therefore, the larger genetic effect and larger epigenetic

282

effect do not seem to coexist in the same genes.

283

large prediction powers from either methylation probes or S&C probes.

284

methylation probes show continuous methylation values while the S&C probes show categorical

285

values due to limited genotypes of the samples.

Figure 5 shows some examples of genes with As expected, the

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018

286 287

Methylation probes tend to better predict the expression of defense and immune gene

288

expressions

289 290

To examine the potential biological function of the genes showing good predictability of gene

291

expression by DNA methylation, we conducted gene ontology (GO) and pathway enrichment

292

analysis using DAVID. For the Adipose dataset, we examined the 1608 genes with R2 greater

293

than 0.1 from LASSO regression. With false discovery rate (FDR) of 0.1, the significantly

294

enriched biological process (GOTERM_BP_FAT) GO terms are response to inflammatory,

295

defense activities and wounding, as well as antigen (Table 2), which seem to be consistent with

296

the previous findings for subcutaneous fat cells (Berg and Scherer 2005). For the PBMC dataset,

297

we conducted the same gene set enrichment analysis on the 2382 genes with R2 greater than 0.1

298

and found that the significantly enriched terms are mostly related to defense and immune

299

functions, lymphocyte activation, leukocyte activation, and immune response (Supplementary

300

Table 2). These results seem to be reasonable for atopy and persistent asthma blood cells.

301

Similar GO terms and pathways were enriched in results obtained from all probes

302

(Supplementary Tables 3 and 4).

303

304

Discussion

305 306

We examined the relationship between gene expression and DNA methylation across the genome

307

using data from two large human studies. We explored three linear regression models for

308

predicting gene expression and found that shrinkage based LASSO multiple regression provides

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018

309

the best prediction. However, even with LASSO regression, the methylation probes can predict

310

expression in only a small proportion of genes with moderate prediction power. These genes are

311

mostly immune and defense genes. Using all probes on the methylation array does improve

312

prediction power to some degree.

313 314

We used squared correlation (R2) to assess predictive power in regression models. The single

315

linear regression is based only on the best predictive CpG in each gene while the multiple

316

regression models uses all CpG sites in each gene. However, the multiple regression model has

317

substantial over-fitting problem for genes with large number of CpGs. The shrinkage based

318

LASSO regression model overcomes the over-fitting problem without losing predictability.

319

LASSO imposes sparsity among the coefficients and puts constraint on the overall absolute

320

values of the regression coefficients, which forces certain coefficients to be zero. This property is

321

beneficial in variable selection and improves model interpretability. LASSO is not the only

322

shrinkage-based regression method. There are other penalty regression models, such as the Ridge

323

(Hoerl & Kennard, 1970) and elastic net(Zou & Hastie, 2005). However, ridge regression scales

324

all coefficients by a constant without setting some of them to zero, which is less ideal for

325

variable selection. The elastic net method is a hybrid of the ridge and LASSO. There are also

326

other advanced shrinkage methods, such as elastic net with rescaled-coefficients and grouped

327

lasso(Yuan & Lin, 2006; Meier, Van De Geer & Bühlmann, 2008), as well as other variants of

328

LASSO(Zou et al., 2013). Further evaluation is needed for their merits in improving prediction

329

of gene expression in this setting.

330

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018

331

One potential reason for low prediction power from DNA methylation on gene expression is the

332

complex mechanisms of gene expression regulation. In addition to DNA methylation,

333

transcription factors, histone modification(Verdin & Ott, 2015), and non-coding RNAs

334

(Janowski et al., 2005; Ting et al., 2005; Ting, McGarvey & Baylin, 2006; Kaikkonen, Lam &

335

Glass, 2011) all play critical roles in gene transcription regulation(Jones, 2015). Another

336

potential reason for low prediction power of methylation is that the landscape of DNA

337

methylation differs dramatically across cell types, tissues(Lokk et al., 2014), ages(Teschendorff

338

et al., 2010), and races(Song et al., 2015). The relationship between gene expression and DNA

339

methylation could also vary substantially across these factors. The correlation between gene

340

expression and DNA methylation from bulk studies at population level encompasses all these

341

variability; therefore, it is not surprising to see low prediction power based on DNA methylation

342

alone. It is likely to be limited in general that the potential of DNA methylation alone as

343

surrogate for gene expression. The combination of DNA methylation and other predictors, such

344

as genotype, may be more powerful for this purpose.

345 346

Although the overall prediction power of DNA methylation on gene expression was low in our

347

analyses of both datasets, much better prediction power was found with the immune and defense

348

genes. This result is consistent with the extensive evidence linking aberrant DNA methylation

349

patterns with genes involved in immune deficiencies, autoimmune disorders, and inflammatory

350

disease(Bayarsaihan, 2011). Some key immune genes, such as mb1 (CD79a)(Gao et al., 2009)

351

and MHCII, (HLA-DR, DQ, DP) (Seguín-Estévez et al., 2009; Suárez-Álvarez et al., 2010;

352

Majumder & Boss, 2011)have been shown to have strong association between DNA methylation

353

and their key immune functions. On cellular level, hypomethylation in T cell promoter regions

354

plays an important role in lupus(Deng et al., 1998; Gorelik & Richardson, 2010). In addition,

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018

355

DNA methylation remodeling is critical in the differentiation of naïve immune cells into antigen-

356

specific effector cells in the response to the pathogen or virus(Scharer et al., 2013). By altering

357

DNA methylation pattern, cells can adapt to external stimuli via changes in gene expression.

358

Given these evidences and the results in our study, DNA methylation could be more useful in

359

predicting the gene activities in the immune and defense system.

360

361

Conclusions

362

We explored three regression methods to predict gene expression using DNA methylation, single

363

regressions, multiple regressions, and LASSO penalized regression. LASSO regression reduces

364

over-fitting and improved the prediction power. Both datasets we analyzed shows overall low

365

prediction power. The better predictive genes (with squared correlations larger than 0.1 in

366

LASSO model) are mostly involved in immune, inflammatory, and wound responses. Overall,

367

we will recommend caution for using one’s methylation profile to predict one’s transcriptome.

368 369

List of Abbreviations

370

SNP: single nucleotide polymorphisms

371

LASSO: least absolute shrinkage and selection operator

372

R2: squared correlation

373

S&C: the probes have potential SNP effects or cross hybridization effects

374

MuTHER: Multiple Tissue Human Expression Resource Project

375

PBMC: peripheral blood mononuclear cell

376

MDS: myelodysplastic syndrome

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018

377

GO: gene ontology

378 379

Availability of data and material

380

For Adipose Dataset, both gene expression and DNA methylation data can be found at

381

http://www.ebi.ac.uk/arrayexpress/

382

For PBMC Dataset, gene expression data are at

383

ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE40nnn/GSE40732/matrix/., while DNA methylation

384

data are at ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE40nnn/GSE40576/matrix/.

385

Software package, ‘MethylXcan’ are at https://github.com/dorothyzh/MethylXcan.

386 387

388

References

389 390

Bayarsaihan D. 2011. Epigenetic mechanisms in inflammation. Journal of dental research 90:9–17.

391

Deng C., Yang J., Scott J., Hanash S., Richardson BC. 1998. Role of the ras-MAPK signaling pathway in

392

the DNA methyltransferase response to DNA hypomethylation. Biological chemistry 379:1113–

393

1120.

394

Gamazon ER., Wheeler HE., Shah KP., Mozaffari S V., Aquino-Michaels K., Carroll RJ., Eyler AE.,

395

Denny JC., Nicolae DL., Cox NJ. 2015. A gene-based association method for mapping traits using

396

reference transcriptome data. Nat Genet 47:1091–1098.

397

Gao H., Lukin K., Ramírez J., Fields S., Lopez D., Hagman J. 2009. Opposing effects of SWI/SNF and

398

Mi-2/NuRD chromatin remodeling complexes on epigenetic reprogramming by EBF and Pax5.

399

Proceedings of the National Academy of Sciences 106:11258–11263.

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018

400

Gorelik G., Richardson B. 2010. Key role of ERK pathway signaling in lupus. Autoimmunity 43:17–22.

401

Grundberg E., Meduri E., Sandling JK., Hedman ÅK., Keildson S., Buil A., Busche S., Yuan W., Nisbet J.,

402

Sekowska M. 2013. Global analysis of DNA methylation variation in adipose tissue from twins

403

reveals links to disease-associated variants in distal regulatory elements. The American Journal

404

of Human Genetics 93:876–890.

405

Grundberg E., Small KS., Hedman ÅK., Nica AC., Buil A., Keildson S., Bell JT., Yang T-P., Meduri E.,

406

Barrett A. 2012. Mapping cis-and trans-regulatory effects across multiple tissues in twins. Nat

407

Genet 44:1084–1089.

408 409 410 411 412

Hoerl AE., Kennard RW. 1970. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12:55–67. Huang DW., Sherman BT., Lempicki RA. 2008. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nature protocols 4:44. Janowski BA., Huffman KE., Schwartz JC., Ram R., Hardy D., Shames DS., Minna JD., Corey DR. 2005.

413

Inhibiting gene expression at transcription start sites in chromosomal DNA with antigene

414

RNAs. Nat Chem Biol 1:216–222.

415 416 417 418 419 420 421

Jones B. 2015. Gene expression: Layers of gene regulation. Nat Rev Genet 16:128–129. DOI: 10.1038/nrg3918. Kaikkonen MU., Lam MTY., Glass CK. 2011. Non-coding RNAs as regulators of gene expression and epigenetics. Cardiovascular research 90:430–440. Krueger F., Kreck B., Franke A., Andrews SR. 2012. DNA methylome analysis using short bisulfite sequencing data. Nat Methods 9:145–151. Lokk K., Modhukur V., Rajashekar B., Märtens K., Mägi R., Kolde R., Koltšina M., Nilsson TK., Vilo J.,

422

Salumets A. 2014. DNA methylome profiling of human tissues identifies global and tissue-

423

specific methylation patterns. Genome Biol 15:3248.

424

Majumder P., Boss JM. 2011. DNA methylation dysregulates and silences the HLA-DQ locus by

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018

425 426 427 428 429 430

altering chromatin architecture. Genes and immunity 12:291–299. Meier L., Van De Geer S., Bühlmann P. 2008. The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70:53–71. Razin A., Riggs AD. 1980. DNA methylation and gene function. Science 210:604–610. DOI: 10.1126/science.6254144. Del Rey M., O’Hagan K., Dellett M., Aibar S., Colyer HAA., Alonso ME., Diez-Campelo M., Armstrong

431

RN., Sharpe DJ., Gutierrez NC. 2013. Genome-wide profiling of methylation identifies novel

432

targets with aberrant hypermethylation and reduced expression in low-risk myelodysplastic

433

syndromes. Leukemia 27:610–618.

434

Scharer CD., Barwick BG., Youngblood BA., Ahmed R., Boss JM. 2013. Global DNA methylation

435

remodeling accompanies CD8 T cell effector function. The Journal of Immunology 191:3419–

436

3429.

437

Seguín-Estévez Q., De Palma R., Krawczyk M., Leimgruber E., Villard J., Picard C., Tagliamacco A.,

438

Abbate G., Gorski J., Nocera A. 2009. The transcription factor RFX protects MHC class II genes

439

against epigenetic silencing by DNA methylation. The Journal of Immunology 183:2545–2553.

440

Song M-A., Brasky TM., Marian C., Weng D., Taslim C., Dumitrescu RG., Llanos AA., Freudenheim JL.,

441

Shields PG. 2015. Racial differences in genome-wide methylation profiling and gene

442

expression in breast tissues from healthy women. Epigenetics:0.

443 444 445

Spector TD., Williams FMK. 2006. The UK adult twin registry (TwinsUK). Twin Research and Human Genetics 9:899–906. Suárez-Álvarez B., Rodriguez RM., Calvanese V., Blanco-Gelaz MA., Suhr ST., Ortega F., Otero J.,

446

Cibelli JB., Moore H., Fraga MF. 2010. Epigenetic mechanisms regulate MHC and antigen

447

processing molecules in human embryonic and induced pluripotent stem cells. PLoS One

448

5:e10192.

449

Teschendorff AE., Menon U., Gentry-Maharaj A., Ramus SJ., Weisenberger DJ., Shen H., Campan M.,

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018

450

Noushmehr H., Bell CG., Maxwell AP. 2010. Age-dependent DNA methylation of genes that are

451

suppressed in stem cells is a hallmark of cancer. Genome Res 20:440–446.

452 453 454 455 456

Tibshirani R. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological):267–288. Ting AH., McGarvey KM., Baylin SB. 2006. The cancer epigenome—components and functional correlates. Genes Dev 20:3215–3231. Ting AH., Schuebel KE., Herman JG., Baylin SB. 2005. Short double-stranded RNA induces

457

transcriptional gene silencing in human cancer cells in the absence of DNA methylation. Nat

458

Genet 37:906–910.

459 460 461

Verdin E., Ott M. 2015. 50 years of protein acetylation: from gene regulation to epigenetics, metabolism and beyond. Nature Reviews Molecular Cell Biology 16:258–264. Wagner JR., Busche S., Ge B., Kwan T., Pastinen T., Blanchette M. 2014. The relationship between

462

DNA methylation, genetic and expression inter-individual variation in untransformed human

463

fibroblasts. Genome Biol 15:R37.

464

Wagner JR., Busche S., Ge B., Kwan T., Pastinen T., Blanchette M., Grundberg E., Meduri E., Sandling

465

JK., Hedman ÅK., Keildson S., Buil A., Busche S., Yuan W., Nisbet J., Sekowska M., Hoerl AE.,

466

Kennard RW., Huang DW., Sherman BT., Tan Q., Kir J., Liu D., Bryant D., Guo Y., Stephens R.,

467

Baseler MW., Lane HC., Ting AH., McGarvey KM., Baylin SB., Razin A., Riggs AD., Zou H., Hastie

468

T., Tibshirani R., Del Rey M., O’Hagan K., Dellett M., Aibar S., Colyer HAA., Alonso ME., Diez-

469

Campelo M., Armstrong RN., Sharpe DJ., Gutierrez NC., Bayarsaihan D., Deng C., Yang J., Scott J.,

470

Hanash S., Richardson BC., Gamazon ER., Wheeler HE., Shah KP., Mozaffari S V., Aquino-

471

Michaels K., Carroll RJ., Eyler AE., Denny JC., Nicolae DL., Cox NJ., Gao H., Lukin K., Ramírez J.,

472

Fields S., Lopez D., Hagman J., Gorelik G., Richardson BC., Grundberg E., Small KS., Hedman ÅK.,

473

Nica AC., Buil A., Keildson S., Bell JT., Yang T-P., Meduri E., Barrett A., Janowski BA., Huffman

474

KE., Schwartz JC., Ram R., Hardy D., Shames DS., Minna JD., Corey DR., Jones B., Kaikkonen MU.,

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018

475

Lam MTY., Glass CK., Krueger F., Kreck B., Franke A., Andrews SR., Lokk K., Modhukur V.,

476

Rajashekar B., Märtens K., Mägi R., Kolde R., Koltšina M., Nilsson TK., Vilo J., Salumets A.,

477

Majumder P., Boss JM., Meier L., Van De Geer S., Bühlmann P., Robertson KD., Scharer CD.,

478

Barwick BG., Youngblood BA., Ahmed R., Boss JM., Seguín-Estévez Q., De Palma R., Krawczyk

479

M., Leimgruber E., Villard J., Picard C., Tagliamacco A., Abbate G., Gorski J., Nocera A., Song M-A.,

480

Brasky TM., Marian C., Weng D., Taslim C., Dumitrescu RG., Llanos AA., Freudenheim JL.,

481

Shields PG., Spector TD., Williams FMK., Suárez-Álvarez B., Rodriguez RM., Calvanese V.,

482

Blanco-Gelaz MA., Suhr ST., Ortega F., Otero J., Cibelli JB., Moore H., Fraga MF., Teschendorff

483

AE., Menon U., Gentry-Maharaj A., Ramus SJ., Weisenberger DJ., Shen H., Campan M.,

484

Noushmehr H., Bell CG., Maxwell AP., Tibshirani R., Ting AH., Schuebel KE., Herman JG., Baylin

485

SB., Verdin E., Ott M., Yang I V., Pedersen BS., Liu A., O’Connor GT., Teach SJ., Kattan M., Misiak

486

RT., Gruchalla R., Steinbach SF., Szefler SJ., You JS., Jones PA., Yuan M., Lin Y. 2015. DNA

487

methylation and human disease. Journal of the Royal Statistical Society: Series B (Statistical

488

Methodology) 12:906–910. DOI: 10.1038/nrg3918.

489

Yang I V., Pedersen BS., Liu A., O’Connor GT., Teach SJ., Kattan M., Misiak RT., Gruchalla R., Steinbach

490

SF., Szefler SJ. 2015. DNA methylation and childhood asthma in the inner city. Journal of Allergy

491

and Clinical Immunology 136:69–80.

492 493 494 495 496 497 498 499

You JS., Jones PA. 2012. Cancer genetics and epigenetics: two sides of the same coin? Cancer cell 22:9–20. Yuan M., Lin Y. 2006. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68:49–67. Zou H., Hastie T. 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67:301–320. Zou H., Hastie T., Wagner JR., Busche S., Ge B., Kwan T., Pastinen T., Blanchette M., Ting AH., McGarvey KM., Baylin SB., Tibshirani R., Razin A., Riggs AD., Hoerl AE., Kennard RW.,

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018

500

Grundberg E., Meduri E., Sandling JK., Hedman ÅK., Keildson S., Buil A., Busche S., Yuan W.,

501

Nisbet J., Sekowska M., Del Rey M., O’Hagan K., Dellett M., Aibar S., Colyer HAA., Alonso ME.,

502

Diez-Campelo M., Armstrong RN., Sharpe DJ., Gutierrez NC. 2013. Regression shrinkage and

503

selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 12:R37.

504

DOI: 10.1126/science.6254144.

505

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018

Figure 1 R 2 comparison among three regression models Multiple regression is compared to single (left two panels) and lasso (right two panels) regressions in two datasets, Adipose and PBMC. R 2 shown here are on cubic root scale for visualization clarity. “single”, single linear regression with the most significant CpG site as predictor; “multiple”, multiple regression with all methylation CpG sites in a gene as predictors. Red points represent significant genes from multiple regressions at significance level of 0.05. Blue solid line is the identity line and the dashed lines represent R 2 of 0.1.

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018

Figure 2 R 2 comparisonamong regression models in cross validation. LASSO regression is compared to single (left two panels) and multiple (right two panels) regressions for two datasets, Adipose and PBMC. Five-fold validation was used for all regression models. The red points represent the significant genes from multiple regressions (p =0.1

>=0.1

>=0.2

>=0.2

>=0.3

>=0.3

NO

491

125

85

18

20

2

YES

591

137

114

25

34

7

NO

2211

791

472

177

114

40

YES

3105

966

765

232

217

67

NO

1608

893

343

197

110

44

YES

2189

1094

538

272

195

75

NO

851

381

108

33

14

4

YES

1330

666

212

90

58

34

NO

3358

746

1163

109

382

30

YES

4455

994

1902

197

694

64

NO

2382

1022

561

165

142

30

YES

3207

1465

870

289

235

66

SNP-probe

3

Dataset

Method Inclusion

4 Single

5 6 7

Multi Adipose

8 LASSO

9 10

Single

11 12

PBMC

Multi

13 14

LASSO

15 16

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018

Table 2(on next page) Gene Ontology andKEGG pathway enrichment analysis for more predictive genes in the Adipose dataset.

Genes with R2 greater than 0.1 (1680) were selected for conducting the GO and Pathway enrichment analyses. Enriched and representative terms with FDR less than 0.01 are included in the table.

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018

1 Fold Term

P-value

FDR

Enrich GO:0006952~defense response

2.28

1.50E-08

2.70E-05

GO:0006954~inflammatory response

2.69

2.72E-07

4.80E-04

GO:0009611~response to wounding

2.22

5.79E-07

0.001

8.93

8.43E-07

0.0015

peptide or polysaccharide antigen via MHC class II

7.58

3.94E-06

0.007

KEGG_PATHWAY:hsa05330~Allograft rejection

5.91

8.58E-06

0.01

3.52

9.17E-06

0.011

presentation

3.73

1.65E-05

0.02

KEGG_PATHWAY:hsa05332~Graft-versus-host disease

5.45

1.87E-05

0.02

KEGG_PATHWAY:hsa04940~Type I diabetes mellitus

5.06

3.77E-05

0.046

GO:0048002~antigen processing and presentation of peptide antigen GO:0002504~antigen processing and presentation of

KEGG_PATHWAY:hsa05322~Systemic lupus erythematosus KEGG_PATHWAY:hsa04612~Antigen processing and

2

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018