Predicting gene expression using DNA methylation in two human populations Huan Zhong 1 2 3
1
, Soyeon Kim
2
, Degui Zhi
2
, Xiangqin Cui Corresp.
3
Department of Biology, Hong Kong Baptist University, Hong Kong, China School of Biomendical Informatics, University of Texas Health Center at Houston, Houston, Texas, United States Department of Biostatistics and Bioinformatics, Emory University, Atlanta, Georgia, United States
Corresponding Author: Xiangqin Cui Email address:
[email protected]
Background. DNA methylation, an important epigenetic mark, is well known for its regulatory role in gene expression, especially the negative regulation in the promoter region. However, its correlation with gene expression at population level has not been well studied. In particular, it is unclear if genome-wide DNA methylation profile of an individual can predict her/his gene expression profile. Previous studies were mostly limited to association analyses between single CpG site methylation and gene expression. It is not known whether DNA methylation of a gene has enough prediction power to serve as a surrogate for gene expression in existing human study cohorts with DNA samples but not RNA samples. Results. We studied two human population datasets, Multiple Tissue Human Expression Resource Projects (MuTHER)’s Adipose tissue as well as asthma and normal peoples’ peripheral blood mononuclear cell (PBMC), for predicting gene expression using methylation of all CpG sites from the gene region. Three prediction models were investigated; single linear regression, multiple linear regression, and least absolute shrinkage and selection operator (LASSO) penalized regression. Our results showed that LASSO regression has superior performance among these methods. However, even with LASSO regression, very small prediction R2 was obtained for the majority of genes and only about one thousand genes had prediction R2 greater than 0.1. GO term and pathway analyses of these more predictable genes showed that they are enriched for immune and defense genes. Conclusion. In human populations, DNA methylation of CpG sites at gene region have weak prediction power for gene expression. The relatively more predictable genes tend to be defense and immune genes.
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018
1
Predicting Gene Expression Using DNA methylation in Two
2
Human Populations
3 4
Huan Zhong1, Soyeon Kim2, Degui Zhi2, Xiangqin Cui3
5 6
1
Department of Biology, Hong Kong Baptist University, HONG KONG
7
2
School of Biomedical Informatics, University of Texas Health Science Center at Houston,
8
Houston, TX, USA
9
3
Department of Biostatistics and Bioinformatics, Emory University, Atlanta, Georgia, USA
10 11
Corresponding author:
Xiangqin Cui
12
Email:
[email protected]
13
14
Keywords
15 16
DNA methylation, Methylation Microarray, transcriptome, LASSO
17
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018
18
Abstract
19
Background
20
DNA methylation, an important epigenetic mark, is well known for its regulatory role in gene
21
expression, especially the negative regulation in the promoter region. However, its correlation
22
with gene expression at population level has not been well studied. In particular, it is unclear if
23
genome-wide DNA methylation profile of an individual can predict her/his gene expression
24
profile. Previous studies were mostly limited to association analyses between single CpG site
25
methylation and gene expression. It is not known whether DNA methylation of a gene has
26
enough prediction power to serve as a surrogate for gene expression in existing human study
27
cohorts with DNA samples but not RNA samples.
28
Results
29
We studied two human population datasets, Multiple Tissue Human Expression Resource
30
Projects (MuTHER)’s Adipose tissue and Asthma and normal peoples’ peripheral blood
31
mononuclear cell (PBMC), for predicting gene expression using methylation of all CpG sites
32
from the gene region. Three prediction models were investigated; single linear regression,
33
multiple linear regression, and least absolute shrinkage and selection operator (LASSO)
34
penalized regression. Our results showed that LASSO regression has superior performance
35
among these methods. However, even with LASSO regression, very small prediction R2 was
36
obtained for the majority of genes and only about one thousand genes had prediction R2 greater
37
than 0.1. GO term and pathway analyses of these more predictable genes showed that they are
38
enriched for immune and defense genes.
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018
39
Conclusion
40
In human populations, DNA methylation of CpG sites at gene region have weak prediction
41
power for gene expression. The relatively more predictable genes tend to be defense and immune
42
genes.
43 44 45
46
Background
47 48
DNA methylation has long been recognized as an important epigenetic modification in
49
regulating gene expression (Razin & Riggs, 1980). This process often occurs at CG dinucleotides
50
sites (CpG site), adding a methyl group to the cytosine residue(You & Jones, 2012; Wagner et
51
al., 2015). In mammals, more than 70% of CpG sites are methylated.
52 53
The regulatory role of DNA methylation has been mostly studied with a small number of CpG
54
sites in a limited number of genes. The more recent application of microarrays and next
55
generation sequencing enables Large-scale analysis of DNA methylation and gene expression
56
across the whole genome(Krueger et al., 2012). However, most genome-wide methylation and
57
expression studies have small sample sizes for comparing controlled groups. Only a few studies
58
profiled both genome-wide DNA methylation and gene expression in larger human populations
59
and examined their relationship. Del Rey et al., (Del Rey et al., 2013)studied the genome-wide
60
DNA methylation and gene expression in 83 low-risk subtypes of Myelodysplastic syndrome
61
(MDS) patients and 36 controls using microarrays. While they found negative correlations
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018
62
between methylation and gene expression in a large proportion of differentially expressed and
63
differentially methylated genes, they also found substantial positive correlations in them. In
64
another study of 648 twins, overall negative correlations were found in the adipose tissue,
65
promoter region (-0.018), gene body (-0.013) and 3-prime UTR (-0.007)(Grundberg et al., 2013).
66
More recently, Wagner et al.(Wagner et al., 2014) profiled the genome wide DNA methylation
67
and gene expression in forearm skin fibroblast among 62 unrelated individuals. They found that
68
the association between gene expression and methylation is not always negative in promoter
69
region or positive in gene body.
70
In this study, we examine the DNA methylation and gene expression relationship in two large
71
human datasets. One is from the MuTHER project that includes 856 female European individuals
72
from TwinsUK Adult Twin Registry(Spector & Williams, 2006). Another dataset is the
73
Childhood Asthema study that has 194 inner city children(Yang et al., 2015), which we call
74
PBMC dataset for convenience throughout the paper. We determine the overall relationship
75
between DNA methylation and gene expression for each gene and evaluate the predictive
76
potential of DNA methylation for gene expression. We demonstrate that a penalized regression
77
improves the overall prediction. However, the prediction power is still low for most genes. For
78
only a small set of genes, DNA methylation has substantial prediction power, especially the
79
immune genes. Similar to PrediXcan’s prediction of gene expression using genotype
80
variations(Gamazon et al., 2015), we wrapped our predictions based on DNA methylation into a
81
package named “MethylXcan”.
82
Methods
83
1) Datasets:
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018
84
Adipose Dataset
85
This dataset is from the MuTHER study, consisting of 856 female European-descent individuals
86
enrolled in the TwinsUK Adult Twin Registry. The quartile normalized gene expression and
87
DNA methylation data from subcutaneous fat were downloaded from ArrayExpress
88
(http://www.ebi.ac.uk/arrayexpress/).
89
1140) were generated for 25160 genes using Illumina HumanHT-12 v3.0 on 825 individuals.
90
The log2-transformed signals were quartile normalized for each tissue followed by quartile
91
normalized across the whole population(Grundberg et al., 2012). The DNA methylation data
92
(accession number E-MTAB-1866) were generated using Illumina Infinium
93
HumanMethylation450 from 649 female twins. The methylation beta values were already
94
quantile normalized for each type of probe, ranging from 0 (unmethylated) to 1 (total-
95
methylated) [6].
96
PBMC Dataset
97
This dataset was downloaded from Gene Expression Ominbus (GSE40736). It includes 194
98
inner-city children with 97 cases of atopy and persistent asthma and 97 healthy controls. All the
99
study subjects were 6 to 12 years old from African American, Dominican-Hispanic and Haitian-
The gene expression data (accession number E-TABM-
100
Hispanic background [9]. DNA methylation data were generated using Illumina’s Infinium
101
Human Methylation450k BeadChip. The normalized data matrix was downloaded from
102
ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE40nnn/GSE40576/matrix/ and the methylation M-
103
values were converted to beta values ranging from 0 to 1. Gene expression data were generated
104
for 23612 genes using Nimblegen Human Gene Expression arrays (12x135k). The normalized
105
data matrix was downloaded from
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018
106
ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE40nnn/GSE40732/matrix/. According to the
107
publication, one outlier sample has been removed after principle component analysis, SWAN
108
normalization was used for methylation data. Log2 transformation and RMA normalization were
109
used for gene expression data. For each gene, expression level was standardized across samples.
110
2) Dataset cleaning and filtering
111
To assess the DNA methylation effect in prediction gene expression, we defined the
112
“methylation probes” as the 344,303 probes in Table S1 of the Grundberg’s paper. The probes on
113
the methylation array but excluded from the Table S1, which have potential SNP effects or cross
114
hybridization effects, are termed “S&C probes”. The combinations of these two types of probes
115
are termed “all probes”.
116
The Adipose dataset has 32,478 missing values in the DNA methylation data. Samples with
117
missing values were excluded from regression analysis. Among the 485679 probes in the dataset,
118
344201 probes remained in Adipose dataset after filtering. For the PBMC dataset, 344180 out of
119
the 485461 probes in the dataset remained after filtering. For method comparisons and the
120
consistency of the analysis, only genes that have the LASSO models were used, 8121 genes and
121
150349 CpG sites in Adipose dataset and 4251 genes and 73553 CpG sites in PBMC datasets.
122
(Supplementary Table 1).
123 124
3) Modeling the relationship between gene expression and DNA methylation.
125
CpG probes were mapped to genes using UCSC RefGene annotation. Gene expression and DNA
126
methylation data for each gene were extracted using in-house perl script. Since there was no
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018
127
missing value for methylation of the PBMC dataset, all samples were used in the regression
128
analysis.
129 130
We used three types of regressions, single linear regression, multiple regression, and least
131
absolute shrinkage and selection operator (LASSO) regression(Tibshirani, 1996), to model the
132
linear relationship between gene expression and DNA methylation. Squared correlation (R2)
133
between predicted and observed data was used to compare the three types of regressions. In the
134
single linear regression, each CpG site was modeled separately to predict gene expression level.
135
The CpG site that provides maximum R2 was used to represent each gene. In multiple
136
regressions, all the CpG sites in each gene were used as predictors and the R2 was calculated. In
137
the LASSO regression, all CpG sites were used to predict the gene expression. We used the
138
GLMNET package in R to fit the LASSO model in which penalized parameters were obtained
139
using 10-fold cross-validation to minimize the mean squared error, while the predictors and
140
response variables were all standardized.
141 142
4) Cross validation
143
In addition to calculating the R2 from fitting the models (fitting R2), we also conducted five-fold
144
cross validation to compare the prediction power of the three regression models using the
145
validation R2 (R2.cv). Specifically, the samples were randomly separated into training set (4/5 of
146
data) and testing set (1/5 of data). The procedure was iterated 10 times and the mean R2 of the 10
147
five-fold cross validations was used as our final cross validation R2 for each model. For single
148
regression, cross validation was conducted for each CpG site and the maximum R2 was used for
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018
149
each gene. For LASSO regression analysis, we first obtained the optimal penalty parameter using
150
ten-fold cross validation and then used another five-fold cross validation to evaluate the
151
predictive performance of the model.
152
Note that we calculated fitting R2 in the LASSO cross validation models. We used the entire
153
datasets as testing in the LASSO cross validation models in order to obtain the fitting R2 in a
154
fashion consistent with the multiple and single cross-validation models. In this case, all the R2
155
values in the paper are squared correlation of the predicted and the true values in the training set.
156
5) Model comparisons on significant genes
157
We first identified genes that showed overall model prediction p values less than 0.05 in Lasso
158
regressions and then compared the three regression models on these genes.
159
6) Gene Ontology (GO) and pathway enrichment analysis.
160
For genes with R2 greater than 0.1, we use The Database for Annotation, Visualization and
161
Integrated Discovery (DAVID ) at https://david.ncifcrf.gov/ (Huang, Sherman & Lempicki,
162
2008) to conduct GO term enrichment analysis based on modified Fisher Exact Test. The
163
background genes were set to be the genes on the expression array, HumanHT-12_V3_0_
164
R2_11283641_A. The significantly overrepresented GO terms were selected based on the EASE
165
Score, which is the geometric mean of p-values on logarithm scale for the member terms. We
166
applied medium classification stringency in the DAVID website to our data.
167
“GOTERM_BP_FAT” was used to obtain more information in biological processes of the Gene
168
Ontology enrichment analysis. “KEGG_PATHWAY” was selected for pathway enrichment
169
analysis in the same fashion. The most enriched GO terms and pathways with low p-value or
170
FDR were shown in the results.
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018
171
7) Gene expression prediction using different type of probes on the methylation microarray
172
The probes excluded by Table S1 of the Grundberg’s paper were treated as probes with SNP
173
and/or hybridization effects (S&C probes). We compared these probes, the methylation probes,
174
and the combination of these two types of probes in predicting gene expression.
175
8) Software Package: MethylXcan
176
We created a software package, MethylXcan, for predicting gene expression using DNA
177
methylation array data in the analyses of this paper. This package includes all three regressions
178
and calculates the squared correlation for each model. The program was written in R and Perl,
179
and has been tested under linux or MACSOX system. Users can use this package on their own
180
data after formatting their methylation data, expression profiling data, and annotation files as
181
specified by the package. URL: https://github.com/dorothyzh/MethylXcan
182
183
Results
184 185
Most DNA methylation studies analyze one CpG at a time in explaining gene expression. In this
186
study we set out to find whether combining all CpG sites in a gene can better predict the gene
187
expression in a human population. We obtained two human datasets, the Adipose dataset
188
generated from subcutaneous fat tissue and the Childhood Asthma dataset generated from
189
PBMC. To evaluate the predicting power of DNA methylation on gene expression, we conducted
190
three types of linear regression analyses, single regression, multiple regression, and LASSO
191
regression for each gene. The squared correlations (R2) were used for model comparisons. To
192
focus on DNA methylation effect, we first left out CpG probes that overlap SNPs or cross-
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018
193
hybridize to multiple locations. In addition, since some genes fail to establish a LASSO model,
194
we only focus on genes with LASSO models for comparing different regression methods.
195 196
Multiple regressions using all CpGs from a gene predict gene expression the best in model
197
fitting.
198 199
For the Adipose dataset, the single regression identified a large number of genes with significant
200
CpG sites, about half (3509) with at least one CpG site significant at significance level of 0.0001
201
and most of them (7423) have at least one CpG site significant at significance level of 0.05.
202
However, the prediction power represented by the maximum R2 for these CpG sites is generally
203
low. Only 20 genes have maximum R2 greater than 0.3 and 491 (3.0%) genes have maximum R2
204
larger than 0.1 (Table 1). Since multiple CpG sites from each gene were assayed on the
205
methylation microarray, we applied multiple linear regression to utilize all CpG sites as
206
predictors simultaneously. The R2 explained by the regression model did improve substantially
207
for the majority of genes compared with that of the single linear regression model (Figure 1A).
208
The significant genes (red color) have relatively higher R2 compared with the non-significant
209
genes (Figure 1). For the PBMC data, a much smaller number of significant CpG sites were
210
identifies at the same significance level for the 4251 genes analyzed (582 with p value less than
211
0.0001 and 3112 with p value less than at 0.05 for single regression) potentially due to the
212
smaller sample size. However, the numbers of genes with larger R2 values are comparable with
213
those from the Adipose dataset (Table 1). The improvement of R2 from the multiple regressions
214
over single regress in the PBMC dataset is also similar to that in the Adipose dataset (Figure
215
1A).
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018
216 217
Compared with multiple regressions, LASSO regression did not generate R2 as high as that from
218
multiple regressions in both datasets (Figure 1B and D). For example, the number of genes with
219
R2 greater than 0.2 decreased from 472 to 343 for the Adipose dataset and from 1163 to 561 for
220
the childhood asthma PBMC datasets (Table 2). With LASSO regression, only about 19.7% of
221
the genes (1608 out of 8123) from the Adipose dataset and about 56% of the genes (2382 out of
222
4251) from the PBMC dataset have prediction R2 greater than 0.1.
223 224
LASSO regression shows best prediction in cross-validation.
225 226
To better assess the accuracy of the predictive models, we performed 5-fold cross validation on
227
single, multiple regressions, and LASSO regressions to estimate the R2. For comparison purpose,
228
we present R2 with genes for which the LASSO models are available. The results showed that the
229
LASSO regression produced much larger R2 values than the single regression and less dramatic
230
but discernable increases over multiple regressions (Figure 2). These differences are also
231
reflected in the number of genes with R2 exceeding certain thresholds. For example, 893 genes
232
(11%) from the Adipose dataset have R2 greater than 0.1 from LASSO regression, while 791
233
genes (9.7%) and 125 genes (1.5%) have R2 greater than 0.1 from multiple regression and single
234
regression, respectively (Table1). For genes with R2 greater than 0.3, LASSO regression has 44
235
genes (0.54%) while multiple regression and single regression have 40 (0.49%) and 2 genes
236
(0.03%), respectively. These results indicate that penalized regression has better prediction than
237
multiple or single regressions for cross-validation test datasets. Cross validation tends to
238
overcome bias and over-fitting issues. As expected, cross-validation R2 values are generally
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018
239
lower than those from the model fittings reflected by the smaller number of genes with R2 values
240
greater than certain cut-offs (Table 1).
241
regression fitting have relatively higher cross-validation R2 than non-significant genes (red dots
242
in Figure 2) as expected. Similar results were obtained from the PBMC dataset (Table 1 and
243
Figure 2 C and D).
In addition, the significant genes from multiple
244 245
To make sure that the prediction R2 is larger than thoses from random chance, we compared the
246
cumulative R2 from the two real datasets with those from the the null distribution of correlations
247
based on Fisher z-transfromation in quantile-quantile plots (Figure 3). Both datasets show that
248
the observed R2 values are much larger than the expected R2 values from random chance (Figure
249
3A and 3C). In addition, the Adipose dataset has a larger departure than the PBMC dataset,
250
indicating that the LASSO model could capture more proportion of the transcriptome variability
251
in the Adipose dataset. This is potentially due to the larger sample size or the regulative nature of
252
the different tissues.
253 254
Using all probes improves prediction power for gene expression
255 256
In order to evaluate DNA methylation power in predicting gene expression, we first left out a
257
large proportion of probes potentially affected by genetic or cross-hybridization effects
258
(Supplementary Table 1). However, using all probes on the array is preferred if our goal is to
259
achieve better prediction accuracy of gene expression. To evaluate the prediction power from all
260
probes, we included all available probes in LASSO regression and found that the overall
261
prediction power did increase compared to the models using only the methylation probes (Figure
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018
262
3B and 3D). We observed more genes with R2 values exceeding the thresholds (Table 1). In
263
addition, the largest R2 value is much larger when all probes are used compared that from only
264
the methylation probes. For example, the largest R2 is 0.92 from all probes compared to 0.54
265
from only methylation probes in the Adipose dataset (Figure 3B and Figure 3A). Likewise, the
266
largest R2 is 0.88 from all probes compared to 0.46 from methylation probes alone in the PBMC
267
dataset (Figure 3C and 3D). Furthermore, LASSO models are available for more genes when all
268
probes are used in both datasets (Supplementary Figure 2).
269 270
The increase of prediction power on gene expression from using all probes on the methylation
271
microarray suggests that there is contribution from the probes with potential SNP effects or cross
272
hybridization effects (S&C probes) we initially excluded in estimating methylation prediction.
273
To further examine the size and nature of their contribution, we separately estimated the
274
prediction power of the methylation probes, S&C probes, and the combination of them (all
275
probes). The results showed that the S&C probes have independent prediction power from the
276
methylation probes and the combination of both has increased prediction power over the
277
methylation probes alone (red points vs black line in the left two panels in Figure 4). The
278
prediction power from the S&C probes was estimated for genes with enough SNP probes to form
279
a LASSO model and their prediction power are mostly above zero (blue points in Figure 4). The
280
fact that the blue points are randomly distributed instead of following the black line suggests that
281
the two sources of R2 are not correlated; therefore, the larger genetic effect and larger epigenetic
282
effect do not seem to coexist in the same genes.
283
large prediction powers from either methylation probes or S&C probes.
284
methylation probes show continuous methylation values while the S&C probes show categorical
285
values due to limited genotypes of the samples.
Figure 5 shows some examples of genes with As expected, the
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018
286 287
Methylation probes tend to better predict the expression of defense and immune gene
288
expressions
289 290
To examine the potential biological function of the genes showing good predictability of gene
291
expression by DNA methylation, we conducted gene ontology (GO) and pathway enrichment
292
analysis using DAVID. For the Adipose dataset, we examined the 1608 genes with R2 greater
293
than 0.1 from LASSO regression. With false discovery rate (FDR) of 0.1, the significantly
294
enriched biological process (GOTERM_BP_FAT) GO terms are response to inflammatory,
295
defense activities and wounding, as well as antigen (Table 2), which seem to be consistent with
296
the previous findings for subcutaneous fat cells (Berg and Scherer 2005). For the PBMC dataset,
297
we conducted the same gene set enrichment analysis on the 2382 genes with R2 greater than 0.1
298
and found that the significantly enriched terms are mostly related to defense and immune
299
functions, lymphocyte activation, leukocyte activation, and immune response (Supplementary
300
Table 2). These results seem to be reasonable for atopy and persistent asthma blood cells.
301
Similar GO terms and pathways were enriched in results obtained from all probes
302
(Supplementary Tables 3 and 4).
303
304
Discussion
305 306
We examined the relationship between gene expression and DNA methylation across the genome
307
using data from two large human studies. We explored three linear regression models for
308
predicting gene expression and found that shrinkage based LASSO multiple regression provides
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018
309
the best prediction. However, even with LASSO regression, the methylation probes can predict
310
expression in only a small proportion of genes with moderate prediction power. These genes are
311
mostly immune and defense genes. Using all probes on the methylation array does improve
312
prediction power to some degree.
313 314
We used squared correlation (R2) to assess predictive power in regression models. The single
315
linear regression is based only on the best predictive CpG in each gene while the multiple
316
regression models uses all CpG sites in each gene. However, the multiple regression model has
317
substantial over-fitting problem for genes with large number of CpGs. The shrinkage based
318
LASSO regression model overcomes the over-fitting problem without losing predictability.
319
LASSO imposes sparsity among the coefficients and puts constraint on the overall absolute
320
values of the regression coefficients, which forces certain coefficients to be zero. This property is
321
beneficial in variable selection and improves model interpretability. LASSO is not the only
322
shrinkage-based regression method. There are other penalty regression models, such as the Ridge
323
(Hoerl & Kennard, 1970) and elastic net(Zou & Hastie, 2005). However, ridge regression scales
324
all coefficients by a constant without setting some of them to zero, which is less ideal for
325
variable selection. The elastic net method is a hybrid of the ridge and LASSO. There are also
326
other advanced shrinkage methods, such as elastic net with rescaled-coefficients and grouped
327
lasso(Yuan & Lin, 2006; Meier, Van De Geer & Bühlmann, 2008), as well as other variants of
328
LASSO(Zou et al., 2013). Further evaluation is needed for their merits in improving prediction
329
of gene expression in this setting.
330
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018
331
One potential reason for low prediction power from DNA methylation on gene expression is the
332
complex mechanisms of gene expression regulation. In addition to DNA methylation,
333
transcription factors, histone modification(Verdin & Ott, 2015), and non-coding RNAs
334
(Janowski et al., 2005; Ting et al., 2005; Ting, McGarvey & Baylin, 2006; Kaikkonen, Lam &
335
Glass, 2011) all play critical roles in gene transcription regulation(Jones, 2015). Another
336
potential reason for low prediction power of methylation is that the landscape of DNA
337
methylation differs dramatically across cell types, tissues(Lokk et al., 2014), ages(Teschendorff
338
et al., 2010), and races(Song et al., 2015). The relationship between gene expression and DNA
339
methylation could also vary substantially across these factors. The correlation between gene
340
expression and DNA methylation from bulk studies at population level encompasses all these
341
variability; therefore, it is not surprising to see low prediction power based on DNA methylation
342
alone. It is likely to be limited in general that the potential of DNA methylation alone as
343
surrogate for gene expression. The combination of DNA methylation and other predictors, such
344
as genotype, may be more powerful for this purpose.
345 346
Although the overall prediction power of DNA methylation on gene expression was low in our
347
analyses of both datasets, much better prediction power was found with the immune and defense
348
genes. This result is consistent with the extensive evidence linking aberrant DNA methylation
349
patterns with genes involved in immune deficiencies, autoimmune disorders, and inflammatory
350
disease(Bayarsaihan, 2011). Some key immune genes, such as mb1 (CD79a)(Gao et al., 2009)
351
and MHCII, (HLA-DR, DQ, DP) (Seguín-Estévez et al., 2009; Suárez-Álvarez et al., 2010;
352
Majumder & Boss, 2011)have been shown to have strong association between DNA methylation
353
and their key immune functions. On cellular level, hypomethylation in T cell promoter regions
354
plays an important role in lupus(Deng et al., 1998; Gorelik & Richardson, 2010). In addition,
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018
355
DNA methylation remodeling is critical in the differentiation of naïve immune cells into antigen-
356
specific effector cells in the response to the pathogen or virus(Scharer et al., 2013). By altering
357
DNA methylation pattern, cells can adapt to external stimuli via changes in gene expression.
358
Given these evidences and the results in our study, DNA methylation could be more useful in
359
predicting the gene activities in the immune and defense system.
360
361
Conclusions
362
We explored three regression methods to predict gene expression using DNA methylation, single
363
regressions, multiple regressions, and LASSO penalized regression. LASSO regression reduces
364
over-fitting and improved the prediction power. Both datasets we analyzed shows overall low
365
prediction power. The better predictive genes (with squared correlations larger than 0.1 in
366
LASSO model) are mostly involved in immune, inflammatory, and wound responses. Overall,
367
we will recommend caution for using one’s methylation profile to predict one’s transcriptome.
368 369
List of Abbreviations
370
SNP: single nucleotide polymorphisms
371
LASSO: least absolute shrinkage and selection operator
372
R2: squared correlation
373
S&C: the probes have potential SNP effects or cross hybridization effects
374
MuTHER: Multiple Tissue Human Expression Resource Project
375
PBMC: peripheral blood mononuclear cell
376
MDS: myelodysplastic syndrome
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018
377
GO: gene ontology
378 379
Availability of data and material
380
For Adipose Dataset, both gene expression and DNA methylation data can be found at
381
http://www.ebi.ac.uk/arrayexpress/
382
For PBMC Dataset, gene expression data are at
383
ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE40nnn/GSE40732/matrix/., while DNA methylation
384
data are at ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE40nnn/GSE40576/matrix/.
385
Software package, ‘MethylXcan’ are at https://github.com/dorothyzh/MethylXcan.
386 387
388
References
389 390
Bayarsaihan D. 2011. Epigenetic mechanisms in inflammation. Journal of dental research 90:9–17.
391
Deng C., Yang J., Scott J., Hanash S., Richardson BC. 1998. Role of the ras-MAPK signaling pathway in
392
the DNA methyltransferase response to DNA hypomethylation. Biological chemistry 379:1113–
393
1120.
394
Gamazon ER., Wheeler HE., Shah KP., Mozaffari S V., Aquino-Michaels K., Carroll RJ., Eyler AE.,
395
Denny JC., Nicolae DL., Cox NJ. 2015. A gene-based association method for mapping traits using
396
reference transcriptome data. Nat Genet 47:1091–1098.
397
Gao H., Lukin K., Ramírez J., Fields S., Lopez D., Hagman J. 2009. Opposing effects of SWI/SNF and
398
Mi-2/NuRD chromatin remodeling complexes on epigenetic reprogramming by EBF and Pax5.
399
Proceedings of the National Academy of Sciences 106:11258–11263.
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018
400
Gorelik G., Richardson B. 2010. Key role of ERK pathway signaling in lupus. Autoimmunity 43:17–22.
401
Grundberg E., Meduri E., Sandling JK., Hedman ÅK., Keildson S., Buil A., Busche S., Yuan W., Nisbet J.,
402
Sekowska M. 2013. Global analysis of DNA methylation variation in adipose tissue from twins
403
reveals links to disease-associated variants in distal regulatory elements. The American Journal
404
of Human Genetics 93:876–890.
405
Grundberg E., Small KS., Hedman ÅK., Nica AC., Buil A., Keildson S., Bell JT., Yang T-P., Meduri E.,
406
Barrett A. 2012. Mapping cis-and trans-regulatory effects across multiple tissues in twins. Nat
407
Genet 44:1084–1089.
408 409 410 411 412
Hoerl AE., Kennard RW. 1970. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12:55–67. Huang DW., Sherman BT., Lempicki RA. 2008. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nature protocols 4:44. Janowski BA., Huffman KE., Schwartz JC., Ram R., Hardy D., Shames DS., Minna JD., Corey DR. 2005.
413
Inhibiting gene expression at transcription start sites in chromosomal DNA with antigene
414
RNAs. Nat Chem Biol 1:216–222.
415 416 417 418 419 420 421
Jones B. 2015. Gene expression: Layers of gene regulation. Nat Rev Genet 16:128–129. DOI: 10.1038/nrg3918. Kaikkonen MU., Lam MTY., Glass CK. 2011. Non-coding RNAs as regulators of gene expression and epigenetics. Cardiovascular research 90:430–440. Krueger F., Kreck B., Franke A., Andrews SR. 2012. DNA methylome analysis using short bisulfite sequencing data. Nat Methods 9:145–151. Lokk K., Modhukur V., Rajashekar B., Märtens K., Mägi R., Kolde R., Koltšina M., Nilsson TK., Vilo J.,
422
Salumets A. 2014. DNA methylome profiling of human tissues identifies global and tissue-
423
specific methylation patterns. Genome Biol 15:3248.
424
Majumder P., Boss JM. 2011. DNA methylation dysregulates and silences the HLA-DQ locus by
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018
425 426 427 428 429 430
altering chromatin architecture. Genes and immunity 12:291–299. Meier L., Van De Geer S., Bühlmann P. 2008. The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70:53–71. Razin A., Riggs AD. 1980. DNA methylation and gene function. Science 210:604–610. DOI: 10.1126/science.6254144. Del Rey M., O’Hagan K., Dellett M., Aibar S., Colyer HAA., Alonso ME., Diez-Campelo M., Armstrong
431
RN., Sharpe DJ., Gutierrez NC. 2013. Genome-wide profiling of methylation identifies novel
432
targets with aberrant hypermethylation and reduced expression in low-risk myelodysplastic
433
syndromes. Leukemia 27:610–618.
434
Scharer CD., Barwick BG., Youngblood BA., Ahmed R., Boss JM. 2013. Global DNA methylation
435
remodeling accompanies CD8 T cell effector function. The Journal of Immunology 191:3419–
436
3429.
437
Seguín-Estévez Q., De Palma R., Krawczyk M., Leimgruber E., Villard J., Picard C., Tagliamacco A.,
438
Abbate G., Gorski J., Nocera A. 2009. The transcription factor RFX protects MHC class II genes
439
against epigenetic silencing by DNA methylation. The Journal of Immunology 183:2545–2553.
440
Song M-A., Brasky TM., Marian C., Weng D., Taslim C., Dumitrescu RG., Llanos AA., Freudenheim JL.,
441
Shields PG. 2015. Racial differences in genome-wide methylation profiling and gene
442
expression in breast tissues from healthy women. Epigenetics:0.
443 444 445
Spector TD., Williams FMK. 2006. The UK adult twin registry (TwinsUK). Twin Research and Human Genetics 9:899–906. Suárez-Álvarez B., Rodriguez RM., Calvanese V., Blanco-Gelaz MA., Suhr ST., Ortega F., Otero J.,
446
Cibelli JB., Moore H., Fraga MF. 2010. Epigenetic mechanisms regulate MHC and antigen
447
processing molecules in human embryonic and induced pluripotent stem cells. PLoS One
448
5:e10192.
449
Teschendorff AE., Menon U., Gentry-Maharaj A., Ramus SJ., Weisenberger DJ., Shen H., Campan M.,
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018
450
Noushmehr H., Bell CG., Maxwell AP. 2010. Age-dependent DNA methylation of genes that are
451
suppressed in stem cells is a hallmark of cancer. Genome Res 20:440–446.
452 453 454 455 456
Tibshirani R. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological):267–288. Ting AH., McGarvey KM., Baylin SB. 2006. The cancer epigenome—components and functional correlates. Genes Dev 20:3215–3231. Ting AH., Schuebel KE., Herman JG., Baylin SB. 2005. Short double-stranded RNA induces
457
transcriptional gene silencing in human cancer cells in the absence of DNA methylation. Nat
458
Genet 37:906–910.
459 460 461
Verdin E., Ott M. 2015. 50 years of protein acetylation: from gene regulation to epigenetics, metabolism and beyond. Nature Reviews Molecular Cell Biology 16:258–264. Wagner JR., Busche S., Ge B., Kwan T., Pastinen T., Blanchette M. 2014. The relationship between
462
DNA methylation, genetic and expression inter-individual variation in untransformed human
463
fibroblasts. Genome Biol 15:R37.
464
Wagner JR., Busche S., Ge B., Kwan T., Pastinen T., Blanchette M., Grundberg E., Meduri E., Sandling
465
JK., Hedman ÅK., Keildson S., Buil A., Busche S., Yuan W., Nisbet J., Sekowska M., Hoerl AE.,
466
Kennard RW., Huang DW., Sherman BT., Tan Q., Kir J., Liu D., Bryant D., Guo Y., Stephens R.,
467
Baseler MW., Lane HC., Ting AH., McGarvey KM., Baylin SB., Razin A., Riggs AD., Zou H., Hastie
468
T., Tibshirani R., Del Rey M., O’Hagan K., Dellett M., Aibar S., Colyer HAA., Alonso ME., Diez-
469
Campelo M., Armstrong RN., Sharpe DJ., Gutierrez NC., Bayarsaihan D., Deng C., Yang J., Scott J.,
470
Hanash S., Richardson BC., Gamazon ER., Wheeler HE., Shah KP., Mozaffari S V., Aquino-
471
Michaels K., Carroll RJ., Eyler AE., Denny JC., Nicolae DL., Cox NJ., Gao H., Lukin K., Ramírez J.,
472
Fields S., Lopez D., Hagman J., Gorelik G., Richardson BC., Grundberg E., Small KS., Hedman ÅK.,
473
Nica AC., Buil A., Keildson S., Bell JT., Yang T-P., Meduri E., Barrett A., Janowski BA., Huffman
474
KE., Schwartz JC., Ram R., Hardy D., Shames DS., Minna JD., Corey DR., Jones B., Kaikkonen MU.,
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018
475
Lam MTY., Glass CK., Krueger F., Kreck B., Franke A., Andrews SR., Lokk K., Modhukur V.,
476
Rajashekar B., Märtens K., Mägi R., Kolde R., Koltšina M., Nilsson TK., Vilo J., Salumets A.,
477
Majumder P., Boss JM., Meier L., Van De Geer S., Bühlmann P., Robertson KD., Scharer CD.,
478
Barwick BG., Youngblood BA., Ahmed R., Boss JM., Seguín-Estévez Q., De Palma R., Krawczyk
479
M., Leimgruber E., Villard J., Picard C., Tagliamacco A., Abbate G., Gorski J., Nocera A., Song M-A.,
480
Brasky TM., Marian C., Weng D., Taslim C., Dumitrescu RG., Llanos AA., Freudenheim JL.,
481
Shields PG., Spector TD., Williams FMK., Suárez-Álvarez B., Rodriguez RM., Calvanese V.,
482
Blanco-Gelaz MA., Suhr ST., Ortega F., Otero J., Cibelli JB., Moore H., Fraga MF., Teschendorff
483
AE., Menon U., Gentry-Maharaj A., Ramus SJ., Weisenberger DJ., Shen H., Campan M.,
484
Noushmehr H., Bell CG., Maxwell AP., Tibshirani R., Ting AH., Schuebel KE., Herman JG., Baylin
485
SB., Verdin E., Ott M., Yang I V., Pedersen BS., Liu A., O’Connor GT., Teach SJ., Kattan M., Misiak
486
RT., Gruchalla R., Steinbach SF., Szefler SJ., You JS., Jones PA., Yuan M., Lin Y. 2015. DNA
487
methylation and human disease. Journal of the Royal Statistical Society: Series B (Statistical
488
Methodology) 12:906–910. DOI: 10.1038/nrg3918.
489
Yang I V., Pedersen BS., Liu A., O’Connor GT., Teach SJ., Kattan M., Misiak RT., Gruchalla R., Steinbach
490
SF., Szefler SJ. 2015. DNA methylation and childhood asthma in the inner city. Journal of Allergy
491
and Clinical Immunology 136:69–80.
492 493 494 495 496 497 498 499
You JS., Jones PA. 2012. Cancer genetics and epigenetics: two sides of the same coin? Cancer cell 22:9–20. Yuan M., Lin Y. 2006. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68:49–67. Zou H., Hastie T. 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67:301–320. Zou H., Hastie T., Wagner JR., Busche S., Ge B., Kwan T., Pastinen T., Blanchette M., Ting AH., McGarvey KM., Baylin SB., Tibshirani R., Razin A., Riggs AD., Hoerl AE., Kennard RW.,
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018
500
Grundberg E., Meduri E., Sandling JK., Hedman ÅK., Keildson S., Buil A., Busche S., Yuan W.,
501
Nisbet J., Sekowska M., Del Rey M., O’Hagan K., Dellett M., Aibar S., Colyer HAA., Alonso ME.,
502
Diez-Campelo M., Armstrong RN., Sharpe DJ., Gutierrez NC. 2013. Regression shrinkage and
503
selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 12:R37.
504
DOI: 10.1126/science.6254144.
505
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018
Figure 1 R 2 comparison among three regression models Multiple regression is compared to single (left two panels) and lasso (right two panels) regressions in two datasets, Adipose and PBMC. R 2 shown here are on cubic root scale for visualization clarity. “single”, single linear regression with the most significant CpG site as predictor; “multiple”, multiple regression with all methylation CpG sites in a gene as predictors. Red points represent significant genes from multiple regressions at significance level of 0.05. Blue solid line is the identity line and the dashed lines represent R 2 of 0.1.
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018
Figure 2 R 2 comparisonamong regression models in cross validation. LASSO regression is compared to single (left two panels) and multiple (right two panels) regressions for two datasets, Adipose and PBMC. Five-fold validation was used for all regression models. The red points represent the significant genes from multiple regressions (p =0.1
>=0.1
>=0.2
>=0.2
>=0.3
>=0.3
NO
491
125
85
18
20
2
YES
591
137
114
25
34
7
NO
2211
791
472
177
114
40
YES
3105
966
765
232
217
67
NO
1608
893
343
197
110
44
YES
2189
1094
538
272
195
75
NO
851
381
108
33
14
4
YES
1330
666
212
90
58
34
NO
3358
746
1163
109
382
30
YES
4455
994
1902
197
694
64
NO
2382
1022
561
165
142
30
YES
3207
1465
870
289
235
66
SNP-probe
3
Dataset
Method Inclusion
4 Single
5 6 7
Multi Adipose
8 LASSO
9 10
Single
11 12
PBMC
Multi
13 14
LASSO
15 16
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018
Table 2(on next page) Gene Ontology andKEGG pathway enrichment analysis for more predictive genes in the Adipose dataset.
Genes with R2 greater than 0.1 (1680) were selected for conducting the GO and Pathway enrichment analyses. Enriched and representative terms with FDR less than 0.01 are included in the table.
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018
1 Fold Term
P-value
FDR
Enrich GO:0006952~defense response
2.28
1.50E-08
2.70E-05
GO:0006954~inflammatory response
2.69
2.72E-07
4.80E-04
GO:0009611~response to wounding
2.22
5.79E-07
0.001
8.93
8.43E-07
0.0015
peptide or polysaccharide antigen via MHC class II
7.58
3.94E-06
0.007
KEGG_PATHWAY:hsa05330~Allograft rejection
5.91
8.58E-06
0.01
3.52
9.17E-06
0.011
presentation
3.73
1.65E-05
0.02
KEGG_PATHWAY:hsa05332~Graft-versus-host disease
5.45
1.87E-05
0.02
KEGG_PATHWAY:hsa04940~Type I diabetes mellitus
5.06
3.77E-05
0.046
GO:0048002~antigen processing and presentation of peptide antigen GO:0002504~antigen processing and presentation of
KEGG_PATHWAY:hsa05322~Systemic lupus erythematosus KEGG_PATHWAY:hsa04612~Antigen processing and
2
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27055v1 | CC BY 4.0 Open Access | rec: 26 Jul 2018, publ: 26 Jul 2018