Received: 8 March 2016
Revised: 4 October 2016
Accepted: 26 November 2016
DOI 10.1002/jgm.2936
RESEARCH ARTICLE
Genomic data integration in chronic lymphocytic leukemia Juan Luis Fernández‐Martínez1
|
Enrique J. deAndrés‐Galiana1,2
|
Stephen T. Sonis3
1
Group of Inverse Problems, Optimization and Machine Learning, Department of Mathematics, University of Oviedo, Oviedo, Asturias, Spain
2
Artificial Intelligence Center, University of Oviedo, Oviedo, Asturias, Spain
3
Biomodels LLC, Watertown, MA, USA
Correspondence J. L. Fernández‐Martínez, Group of Inverse Problems, Optimization and Machine Learning, Department of Mathematics, University of Oviedo, Oviedo, Asturias, Spain and Jesús Arias de Velasco s/n, 33005 Oviedo, Spain. Email:
[email protected]
Abstract Background
B‐cell chronic lymphocytic leukemia (CLL) is a heterogeneous disease and the
most common adult leukemia in western countries. IgVH mutational status distinguishes two major types of CLL, each associated with a different prognosis and survival. Sequencing identified NOTCH1 and SF3B1 as the two main recurrent mutations. We described a novel method to clarify how these mutations affect gene expression by finding small‐scale signatures that predict the IgVH, NOTCH1 and SF3B1 mutations. We subsequently defined the biological pathways and correlation networks involved in disease development, with the potential goal of identifying new drugable targets.
Methods
We modeled a microarray dataset consisting of 48807 probes derived from 163
samples. The use of Fisher's ratio and fold change combined with feature elimination allowed us to identify the minimum number of genes with the highest predictive mutation power and, subsequently, we applied network and pathway analyses of these genes to identify their biological roles.
Results
The mutational status of the patients was accurately predicted (94–99%) using small‐
scale gene signatures: 13 genes for IgVH, 60 for NOTCH1 and 22 for SF3B1. LPL plays an important role in the case of the IgVH mutation, whereas MSI2, LTK, TFEC and CNTAP2 are involved in the NOTCH1 mutation, and RPL32 and PLAGL1 are involved in the SF3B1 mutation. Four high discriminatory genes (IGHG1, MYBL1, NRIP1 and RGS1) are common to these three mutations. The IL‐4‐mediated signaling events pathway appears to be involved as a common mechanism and suggests an important role of the immune response mechanisms and antigen presentation.
Conclusions
This retrospective analysis served to provide a deeper understanding of the
effects of the different mutations in CLL disease progression, with the expectation that these findings will be clinically applied in the near future to the development of new drugs. KEY W ORDS
cancer, gene expression, hematologic, leukemia, mathematical modeling, oncology
1
|
I N T RO D U CT I O N
discerned by immunoglobulin variable‐region heavy chain (IgVH) gene mutation status.5 Because determination of IgVH mutation status is
B‐cell chronic lymphocytic leukemia (CLL) is a complex heterogeneous
very labor‐intensive and expensive, alternative markers have been
disease characterized by the accumulation of malignant B‐cells in the
investigated aiming to achieve a better prognosis with respect to dis-
blood and lymphoid organs.1 Clinical diagnosis of CLL is based on the
ease progression. ZAP‐70 became a very popular surrogate marker of
demonstration of an abnormal population of B lymphocytes in the
IgVH mutational status6–10 and cell membrane expression of CD38
blood, bone marrow or tissues displaying an unusual but characteristic
has been described as a reliable prognostic value for CLL.11 Both
pattern of molecules on the cell surface (CD5 and CD23 clusters of dif-
ZAP‐70 and CD38 have been established as good predictors of IgVH
ferentiation). The staging systems of Rai et al.2 or Binet et al.3 are cur-
mutational status.12,13
rently used to determine the extent of the disease and are primarily based on a low platelet or red cell counts.
Gene expression profiles were also used to understand the genesis and progression of CLL. As a result of the high dimensionality of the
DNA analysis distinguishes two major types of CLL with different
data, on first inspection, both subtypes of CLL showed quite homoge-
survival times.4 This distinction is based on lymphocyte maturity, as
neous expression profiles irrespective of their IgVH mutational
J Gene Med. 2017;19:e2936. https://doi.org/10.1002/jgm.2936
wileyonlinelibrary.com/journal/jgm
Copyright © 2016 John Wiley & Sons, Ltd.
1 of 23
2 of 23
FERNÁNDEZ‐MARTÍNEZ
status.4,14–16 This suggested that both CLL types derive from a com-
ET AL.
scale (log2) after the corresponding RMA preprocess. Of the original
mon pathogenic pathway. Nevertheless, different subsets of genes
cohort of 163 patients, 92 had mutated IgVH, which was associated
specifically expressed by CLL cells with potential pathogenesis and
with a favorable prognosis, whereas IgVH was not mutated in the
clinical relevance were identified.17–21
remainder (n = 71) and suggested an unfavorable outcome. The exome
Subsequently four major genomic aberrations have been identi-
sequencing data are described by Quesada et al.,25 who identified
22
1246 mutations resulting in protein coding changes. Six genes
fied in CLL cells that are strongly associated with disease behavior.
More recently, NOTCH1 and SF3B1 have been described as the most
appeared to be most frequently mutated (> 5%): NOTCH1, SF3B1,
frequently mutated genes predictive of CLL prognosis.23 Addition-
NOP16, CHD2, ATM and LRP1B. Amongst the 163 samples evaluated,
ally, disease progression has been associated with a number of
NOTCH1 and SF3B1 mutational status were determined for 117
genetic alterations, including cytogenetic abnormalities and specific
patients. Of these, 106 were unmutated for NOTCH1 and 107 were
gene mutations,23–29 and epigenetic alterations, including aberrant
unmutated for SF3B1. In the present study, this dataset is referred to
methylation CLL.30
as the reduced cohort and is used to explore the possible relationships
Given the low incidence of NOTCH1 (9%) and SF3B1 (8%)
between mutations.
mutations, it is unlikely that CLL progression could be solely ascribed
Table 1 shows the degree of mutational overlap between the
to the two. We therefore aimed to identify shared/synergistic mech-
patients in the reduced cohort (117 samples). It can be observed that
anisms among the three most common mutations (IgVH, NOTCH1
most of the patients with IgVH mutated (n = 66) do not show the
and SF3B1) that might better predict and explain disease progression
NOTCH1 (n = 63) and SF3B1 mutations (n = 61). This corresponds to
and behavior.
what is expected in terms of prognosis. Also, the NOTCH1 and SF3B1
Obviously, CLL is a multicause disease and, for that reason, it is important to integrate the modelling of different mutations. In the
mutations appear to be independent factors for CLL progression because only two patients exhibit both mutations at the same time.
present study, we describe the development of a novel a methodol-
These two classification problems (NOTCH1 and SF3B1 muta-
ogy aiming to clarify how these mutations affect gene expression
tional status) are naturally unbalanced, and the classifier has to take
by identifying small‐scale signatures to predict the IgVH, NOTCH1
this feature into account. In the case where the predictive accuracies
and SF3B1 mutations (genomic data integration). We subsequently
of the small‐scale signatures found are higher than those provided by
applied our methodology to define and understand the biological
the corresponding majority class classifiers, we could affirm that the
pathways and correlation networks that are involved in disease
classifier has learnt the set of discriminatory genes for these muta-
development, with the potential goal of identifying new druggable
tions. These signatures have been validated by cross‐validation. For
targets. We have identified four high discriminatory genes (IGHG1,
that purpose, the dataset was divided into two folds according to
MYBL1, NRIP1 and RGS1) that are common to these three mutations.
the different mutations: one fold was used for training and the other
Furthermore, the IL‐4‐mediated signaling events pathway appears to
for validation. This process is repeated until the whole dataset is
be a common mechanism to these genes. We confirm that the inte-
processed and it is termed leave‐one‐out‐cross‐validation (LOOCV).
gration of microarray data and the main mutational status of patients
Furthermore, we have confirmed the stability of these results using
achieved by CLL is a novel approach for understanding the main
different random holds‐outs validation experiments (75% for learning
causes of disease progression.
and 25% for validation).
2 2.1
MATERIALS AND METHODS
|
|
Dataset
We used a publically accessible dataset (European Bioinformatics Institute EGAS00001000374) in which microarray results consisting of 48807 probes were derived from 163 patients with a diagnosis of CLL.31 The expression data were originally presented in logarithmic TABLE 1 Degree of mutational overlap in the reduced cohort of patients (117 samples)
IgVH (+)
IgVH (−)
NOTCH1 (+)
NOTCH1 (−)
SF3B1 (+)
SF3B1 (−)
IgVH (+)
66
*
3
63
5
61
IgVH (−)
*
51
8
43
5
46
NOTCH1 (+)
3
8
11
*
2
9
NOTCH1 (−)
63
43
*
106
8
98
SF3B1 (+)
5
5
2
8
10
*
SF3B1 (−)
61
46
9
98
*
107
*No overlapping.
FIGURE 1
Flow diagram of the analytical procedure: (1) Data input. (2) Gene ranking according the discriminatory power. (3) Small‐scale signature predictive accuracy (LOOCV) and hold‐out analysis (stability of the signature). (4) Analysis of the biological pathways and correlation networks
FERNÁNDEZ‐MARTÍNEZ
2.2
|
3 of 23
ET AL.
Analysis
processes that are impacted by each mutation. The algorithm designed to perform the pathways sampling is outlined below.
Feature selection is one of the major challenges of any genotype‐ based phenotypic prediction problem because the number of genetic
3. The training (75%) and validation (25%) sets are randomly selected
probes far exceeds the number of samples. To directly address this
from the study cohort. We have performed 500 different
issue, we used a combination of two well‐known ranking algorithms:
simulations.
32
33
(for further details, see ).
4. For each simulation, the list of most discriminatory genes
We first ranked genes according to their discriminatory power (FR or
(smallest‐scale signature) is determined in the training setand
absolute FC value) and then applied a Nearest Neighbor (k‐NN) based
applied to the validation set to establish the predictive accuracy.34
algorithm to establish the accuracy of the different ranked sets of
If the validation accuracy is higher that a given minimum accuracy
genes via LOOCV modeling. The combination of this procedure with
(90% in this case), the current list is stored.
Fisher's ratio (FR)
and fold change (FC)
a feature elimination algorithm produced the shortest list of high dis-
5. After all the simulations are performed, a frequency analysis of
criminatory genes and served to validate the prognostic value of these
the most discriminatory genes found in the different simulations
gene signatures over the existing dataset by LOOCV.34 The stability of
is performed. The hold‐out experiment serves to analyze how
these signatures is also examined by random 75–25 hold‐out
the lack of information (samples) impact gene selection. Subse-
experiments.
quently, the most frequently sampled genes are used to infer
The aim of the hold‐out procedure is two‐fold.
the genetic pathways, which are the robust way of linking disease genes to phenotypes.
1. Checking the stability of the predictive accuracy of the small‐ scales signatures found via LOOCV when the number of training
Finally, the correlation networks between the set of highly dis-
samples is decreased. In this case, the minimum‐scale signature
criminatory genes determined via LOOCV using FR and FC as featured
is read in the training data set for training (75% of the whole set) and applied for blind validation in the validation set (25%).
TABLE 3
IgVH mutational status prediction using FC
The cumulative distribution function of the small‐scale predictive
Gene
μ1
σ1
fc (log)
Acc (%)
accuracies found in different hold‐outs is finally presented and
RGS13
261
568
22
25
3.6
57.7
serves to account for the variability in its predictive accuracy with
LPL
40
70
380
272
–3.2
86.5
partial information. A statistical analysis is performed providing
PXDNL
269
838
29
52
3.2
87.1
the minimum, maximum and median bounds that could be
SEPT10
22
50
177
232
–3.0
88.3
expected in an independent dataset.
SEPT10
21
46
172
232
–3.0
86.5
LOC100128252
29
43
224
194
–3.0
89.0
SEPT10
22
48
158
206
–2.9
85.3
LOC100128252
30
42
220
172
–2.9
86.5
KANK2
40
75
258
335
–2.7
87.1
NPTX1
28
32
168
363
–2.6
85.9
LMNA
30
54
178
446
–2.6
85.9
PTCH1
140
163
24
42
2.6
89.0
IGHG1
162
687
942
1893
–2.5
88.3
MYBL1
424
526
73
88
2.5
89.6
87.1
PPP1R9A
45
90
260
278
–2.5
90.2
62
125
352
298
–2.5
90.8
2. Sampling the uncertainty space in the corresponding phenotype 35
prediction problems to perform pathways analysis.
In this pro-
cedure, we look for the most predictive genes found in each random hold‐out. Based on these predictive high discriminatory signatures, we provide pathway analysis to unravel the biological
TABLE 2
IgVH mutational status prediction using FR
Gene LPL
μ1
σ1 40
70
μ2 380
σ2 272
FR (log) 4.6
Acc (%)
LPL
26
33
146
102
3.7
86.5
CRY1
CRY1
62
125
352
298
3.1
90.2
LPL
LOC100128252
29
43
224
194
3.0
90.2
TFEC
LOC100128252 SPG20 ZBTB20 NRIP1 SPG20
30 24
42 35
220 111
172 85
3.0 2.9
μ2
σ2
26
33
146
102
–2.5
92.0
412
580
77
230
2.4
92.6
89.6
PPP1R9A
33
63
171
175
–2.4
92.6
91.4
SPG20
30
53
148
126
–2.3
92.6
32
43
153
301
–2.3
92.6
1943
505
982
417
2.8
91.4
STK32B
275
183
63
81
2.7
91.4
ANKRD57
20
28
98
97
–2.3
91.4
91.4
SPG20
24
35
111
85
–2.2
92.0
205
203
45
97
2.2
92.0
30
53
148
126
2.6
ZAP70
103
151
273
140
2.4
92.6
ADAM29
LDOC1
20
19
50
27
2.3
92.6
RBMS3
19
27
84
122
–2.2
91.4
92.6
PPP1R9A
27
36
120
126
–2.2
91.4
93.3
NRIP1
275
183
63
81
2.1
92.6
PTCH1
115
123
27
33
2.1
93.3
COBLL1 NRIP1
186 85
107 60
85 24
100 24
2.3 2.1
List of the 13 most discriminatory genes list with the highest predictive accuracy (93.3%), ordered by decreasing FR. μ1 and σ1 refer, respectively, to the mean expression and SD in class 1, (mutated IgVH), whereas μ2 and σ2 indicate the same for the unmutated group. FR (log), logarithmic FR; Acc, accuracy.
List of the 28 most discriminatory genes with the highest predictive accuracy (93.3%), ordered by decreasing absolute FC. μ1 and σ1 refer, respectively, to the mean expression and SD in class 1 (mutated IgVH), whereas μ2 and σ2 indicate the same for the unmutated group. fc, FC; Acc, accuracy.
4 of 23
FERNÁNDEZ‐MARTÍNEZ
ET AL.
FIGURE 2
Correlation networks of the most discriminatory genes (FR) for the IgVH mutational status prediction. A, Using the PC coefficient. B, Using the NMI
selection methods, are calculated via the Pearson correlation (PC)
3
RESULTS
|
coefficient36 and normalized mutual information (NMI).37 These correlation networks are two different measures of the mutual dependence
3.1
|
IgVH mutational status
between gene expressions (for further details, see ). They serve to analyze inter‐relationships between genes, which impact both expression
We determined the best set of genes that discriminates IgVH
and function. We then identified the pathways ontology using
mutational status based on microarray expression and the class infor-
GeneAnalytics
38
to cover the altered and disease pathways. Figure 1
shows the flow‐diagram of the procedure that has been adopted.
mation defined by the IgVH phenotype using 92 mutated and 71 unmutated samples (whole cohort).
FERNÁNDEZ‐MARTÍNEZ
ET AL.
5 of 23
FIGURE 3 Correlation networks of the most differentially expressed genes (FC) for the IgVH mutational status prediction. A, Using the PC coefficient. B, Using the NMI
3.1.1
|
Gene ranking
The shortest list with the highest predictive accuracy (93.3%) was com-
of the expressions. Table 3 shows the first 28 genes with the highest predictive accuracy (93.3%), ordered by decreasing absolute FC
posed of 13 first probes: LPL (two probes), CRY1, LOC100128252 (two
(under‐ or over‐expressed) in logarithmic scale, fc (log), the mean
probes), SPG20 (two probes), ZBTB20, NRIP1 (two probes), ZAP‐70,
(μ1, μ2) and the SD (σ1, σ2) for each IgVH group, and the LOOCV accu-
LDOC1 and COBLL1. Table 2 shows a list of these genes, their associ-
racy [Acc (%)]. In this case, over‐expression implies that the average
ated FR and the mean LOOCV accuracy. FR was applied to the log2
expression is higher in the mutated group. The gene with the highest
6 of 23
TABLE 4
FERNÁNDEZ‐MARTÍNEZ
ET AL.
NOTCH1 mutational status prediction using FR
Gene
μ1
σ1
μ2
σ2
FR (log)
Acc (%)
MSI2
157
74
43
26
4.6
93.2
MSI2
238
123
62
49
4.1
94.9
MSI2
73
25
31
16
3.0
91.5
MSI2
283
149
92
61
2.8
90.6
MSI2
58
19
32
15
2.7
92.3
C10orf137
193
86
392
135
2.4
90.6
LAG3
236
155
77
103
2.4
90.6
LPL
357
250
170
254
2.3
92.3
NCK2
838
219
1560
529
2.2
93.2
CNTNAP2
66
96
667
799
2.1
92.3
ST3GAL1
38
11
85
36
2.1
90.6
CCDC24
109
73
48
44
2.0
92.3
LTK
216
96
103
132
2.0
90.6
FLNB
59
30
33
17
1.9
94.0
ZNF333
38
5
57
16
1.9
92.3
PREPL
190
62
329
108
1.9
93.2
C19orf28
120
37
217
80
1.9
93.2
C1orf38
365
148
189
109
1.8
91.5
LTK
107
52
52
64
1.8
91.5
SPG20
182
150
71
106
1.8
92.3
SAP30L
74
38
111
32
1.8
94.0
MYST1
248
37
322
60
1.7
93.2
C10orf137
99
41
187
66
1.7
94.9
ATP6V0B
831
198
596
183
1.7
91.5
LPL
130
89
75
99
1.7
92.3
SLC4A7
47
39
150
120
1.7
90.6
161
126
112
156
1.7
89.7
HNRNPR
57
22
110
48
1.7
89.7
REEP5
41
18
80
39
1.6
90.6
SRSF1
110
60
175
52
1.6
94.0
37
8
64
24
1.6
94.0
SHPRH
270
64
383
83
1.6
94.0
CNTNAP2
101
140
804
1105
1.6
94.9
PHF2
119
44
175
60
1.6
92.3
FCRL1
234
180
525
308
1.6
93.2
WSB2
804
329
489
258
1.6
93.2
ATP6V0B
624
145
448
134
1.6
94.9
LOC100128252
GNPNAT1
LYL1
87
31
140
47
1.5
94.9
ACSL5
230
85
332
106
1.5
94.9
STX17
50
21
75
25
1.5
94.0
SPG20
125
98
55
74
1.5
94.0
NHEJ1
29
7
37
8
1.5
94.0
ZNF248
48
25
89
45
1.5
93.2
MPST
55
20
35
10
1.5
93.2
CDK13
69
42
132
75
1.5
93.2
TRMT1
58
17
86
30
1.5
92.3
PI4K2A
224
101
115
84
1.5
93.2
ELOVL5
254
97
504
188
1.5
93.2
FAM30A
588
900
1535
1495
1.5
93.2
PTDSS1
129
21
190
44
1.5
94.0
PLGLB1
74
47
152
103
1.5
94.0 (Continues)
FERNÁNDEZ‐MARTÍNEZ
TABLE 4
7 of 23
ET AL.
(Continued)
μ1
Gene
σ1
μ2
σ2
FR (log)
Acc (%)
C5orf53
51
22
125
74
1.5
94.0
PSMD7
608
175
414
141
1.5
94.9
NASP
117
26
176
52
1.5
94.0
ATP6V0B
768
170
566
172
1.5
94.9
WDR36
108
36
164
43
1.4
94.9
LTN1
511
52
645
99
1.4
94.9
22
2
19
2
1.4
94.9
GAL3ST3 PDE7A CAPRIN2
102
67
214
120
1.4
94.9
1098
345
1511
368
1.4
95.7
List of the 60 most discriminatory genes to predict the NOTCH1 mutation list with the highest predictive accuracy (95.7%), ordered by decreasing FR. Class 1 corresponds to samples with mutated NOTCH1 and class 2 corresponds to those with unmutated NOTCH1. μ1 and σ1 refer, respectively, to the mean expression and SD in class 1 (mutated NOTCH1), whereas μ2 and σ2 indicate the same for the unmutated group. FR (log), logarithmic FR; Acc, accuracy.
under‐expression is LPL, and the gene with the highest over‐expres-
network (Figure 4B) demonstrates a main connection through NCK2.
sion is RSG13.
Figures 5A and 5B show these networks constructed with the list of the most differentially expressed genes defined by FC.
3.1.2
|
Correlation networks
Figure 2A shows the PC network of the most discriminatory genes
3.3
SF3B1 mutation status
|
(defined by FR) of IgVH mutational status. The NMI correlation network is shown in Figure 2B. Figures 3A and 3B show the same networks constructed with the list of most differentially expressed genes provided by FC.
SF3B1 gene (Splicing Factor 3b, Subunit 1) is located in chromosome 2. Its importance in CLL has been analyzed previously.26,41 As with NOTCH1, the SF3B1 classification problem was also highly unbalanced because 107 CLL samples (out of 117) in the reduced cohort did not
3.2
Modeling the NOTCH1 mutation status
|
It has been demonstrated that NOTCH1 mutation influences survival in CLL patients.
39,40
show this mutation.
3.3.1
|
Gene ranking
We recognized the challenge in analyzing those
The shortest list with the highest predictive accuracy (99.1%) was com-
genes for which the NOTCH1 mutation impacted expression given
posed of 22 probes with FR values of between 2.6 and 1.7. The most
the highly unbalanced sample mix (106 of 117 samples did not show
discriminatory gene was RPL32 (Table 6). The FC provided a predictive
the NOTCH1 mutation in the reduced cohort).
accuracy of 96.6% using the list of the first 118 genes ranked by absolute FC (Table 7). This accuracy was lower compared to the list of the
3.2.1
|
Gene ranking
22 most discriminatory genes provided by the FR. Seven different
The shortest list with the highest predictive accuracy (95.7%) was com-
probes of gene ANXA4 appeared in this list, which accounts for the
posed of 60 probes with FR values of between 4.6 and 1.4 (Table 4).
importance of this gene.
The first five probes of this list corresponded to MSI2. Also, using the two first probes of MSI2, the NOTCH1 mutation is predicted with 94.9% accuracy. All MSI2 probes had lower expression in NOTCH1‐ mutation negative patients. One probe of LPL appeared in eighth position within this list. Therefore, the incremental accuracy from probes 5 to 60 was minimal (0.8%). The genes from the sixth position to the 60th position serve to add high frequency details in the discrimination (helper genes).34 All of these genes show correlation networks that are analyzed as described further below. The FC provided a longer list (n = 126) with the same predictive accuracy as the FR list (Table 5). Different probes for TFEC and
3.3.2
|
Correlation networks
Figure 6A shows the PC network of the most discriminatory genes of the SF3B1 mutation. In general, correlations between discriminatory genes are low, implying that these genes are independent predictors for this mutation. Two main networks were noted to be associated with the most discriminatory gene RPL32 through YWHAB and KLF8. Conversely, the correlation network using the NMI (Figure 6B) demonstrated a single network associated with CNPY2‐STK38. Figures 7A and 7B show these networks constructed with the list of the most differentially expressed genes defined by FC.
CNTNAP2 were consistently expressed the most differentially. The high variability of these genes in samples with unmutated NOTCH1 precluded their inclusion among the leaders in the FR list.
3.4
|
Hold‐outs experiments and pathway analysis
Figures 8A, 8B and 8C show the cumulative probability distribution
3.2.2
|
Correlation networks
function (CDF) of the predictive accuracy for the small‐scale signatures
Figure 4A shows the PC network of the most discriminatory genes of
found for the three mutations, obtained via 500 different random
the NOTCH1 mutation in which three main networks associated with
75–25 hold‐out experiments. The CDF gives the probability that the
MSI2 through WSB2, ACSL5 and CNTNAP2 are apparent. The NMI
predictive accuracy of the minimum‐scale signature is smaller than a
8 of 23
TABLE 5
FERNÁNDEZ‐MARTÍNEZ
TABLE 5
NOTCH1 mutational status prediction using FC
Gene
μ1
σ1
μ2
σ2
fc (log)
Acc (%)
ET AL.
(Continued)
Gene
μ1
σ1
μ2
σ2
fc (log)
Acc (%)
TFEC
19
4
289
488
–3.9
85.5
IgVH5–78
56
46
166
232
–1.6
91.5
TFEC
15
2
180
331
–3.6
85.5
IGHG1
1369
2081
468
1313
1.5
91.5
CNTNAP2
66
96
667
799
–3.3
82.9
FCRL2
460
334
1334
872
–1.5
92.3
RASSF6
15
1
144
352
–3.2
84.6
FCRL5
226
185
656
590
–1.5
93.2
PXDNL
18
3
168
598
–3.2
82.1
FCGR3A
65
76
189
269
–1.5
93.2
CNTNAP2
101
140
804
1105
–3.0
79.5
FCRL2
420
303
1215
788
–1.5
91.5
DEFA1
272
816
1866
3452
–2.8
77.8
CD9
74
70
215
302
–1.5
92.3
ADAM29
24
20
156
199
–2.7
76.9
FCRL5
564
492
1629
1333
–1.5
93.2
NRIP1
31
10
200
187
–2.7
84.6
FCRL2
469
323
1351
877
–1.5
91.5
IGF2BP3
33
47
186
281
–2.5
88.0
NBPF3
27
19
78
88
–1.5
91.5
MYBL1
47
65
262
407
–2.5
86.3
FCRL2
220
261
629
541
–1.5
90.6
ZNF208
18
6
100
143
–2.5
89.7
CD9
37
27
106
152
–1.5
90.6
FGL2
67
84
359
422
–2.4
91.5
FCRL5
266
210
749
628
–1.5
91.5
CLC
29
17
141
184
–2.3
90.6
CD9
151
180
425
595
–1.5
91.5
ZNF208
18
4
81
108
–2.2
91.5
FCRL2
725
482
2017
1253
–1.5
92.3
APOD
32
34
142
204
–2.1
90.6
FGL2
17
7
46
50
–1.5
92.3
APOD
115
139
459
637
–2.0
89.7
C21orf7
66
100
178
225
–1.4
92.3
MSI2
238
123
62
49
1.9
90.6
FCRL3
1692
1627
4567
3202
–1.4
92.3
SORL1
58
72
221
415
–1.9
90.6
FCRL2
739
560
1992
1191
–1.4
91.5
NRIP1
17
2
64
60
–1.9
90.6
MAP4K4
33
16
90
65
–1.4
90.6
2509
5892
657
2843
1.9
88.9
TRIB2
51
55
135
214
–1.4
89.7
46
31
173
440
–1.9
88.0
EDNRB
326
368
123
280
1.4
91.5
MSI2
157
74
43
26
1.9
94.0
FCRL3
1653
1573
4376
3115
–1.4
91.5
IGJ
299
429
1079
1859
–1.9
92.3
TUBB6
409
540
156
210
1.4
92.3
68
39
239
337
–1.8
91.5
ATF5
186
192
71
68
1.4
91.5
420
808
121
382
1.8
92.3
LOC728175
69
67
26
26
1.4
92.3
TFEC
15
1
51
71
–1.8
92.3
FAM30A
588
900
1535
1495
–1.4
92.3
PTCH1
25
29
86
135
–1.8
93.2
ACSM3
483
397
185
208
1.4
93.2
FCER1A
27
15
93
145
–1.8
93.2
PYHIN1
262
264
683
447
–1.4
92.3
FCRL2
81
85
276
214
–1.8
92.3
C1orf38
593
309
229
195
1.4
88.9
CD9
94
118
316
466
–1.7
91.5
MEF2C
286
166
739
393
–1.4
88.0
LAIR1
31
27
105
114
–1.7
92.3
SPG20
182
150
71
106
1.4
90.6
IGJ
82
106
273
463
–1.7
91.5
FCRL5
128
98
325
300
–1.3
89.7
1561
4218
471
2074
1.7
90.6
LAIR1
29
16
75
81
–1.3
89.7
135
131
41
52
1.7
91.5
PLGLB2
78
52
198
187
–1.3
89.7
78
97
258
379
–1.7
92.3
TCF7
316
337
802
780
–1.3
89.7
PRKCA
119
132
36
54
1.7
92.3
H3F3C
47
24
119
118
–1.3
89.7
TSHZ2
28
39
92
143
–1.7
92.3
SLC15A2
53
33
133
107
–1.3
91.5
FCRL2
650
530
2102
1440
–1.7
93.2
PLGLB2
88
66
219
136
–1.3
92.3
RGS1
1049
1956
327
670
1.7
91.5
MEF2C
193
107
481
252
–1.3
91.5
SLC4A7
47
39
150
120
–1.7
91.5
CD24
1070
624
2656
1495
–1.3
93.2
CD9
89
86
281
433
–1.7
92.3
BMP6
26
19
65
60
–1.3
93.2
SORL1
49
34
153
273
–1.6
92.3
C5orf53
51
22
125
74
–1.3
92.3
THRB
78
150
25
30
1.6
91.5
C1orf38
1176
588
477
377
1.3
90.6
MSI2
283
149
92
61
1.6
91.5
SERPINB6
245
176
100
119
1.3
93.2
LAG3
236
155
77
103
1.6
91.5
HIST1H2BD
413
362
1014
1104
–1.3
90.6
FCRL3
133
125
400
362
–1.6
93.2
RORA
347
546
142
256
1.3
92.3
FGL2
64
57
191
193
–1.6
92.3
TUBB6
410
630
168
344
1.3
91.5
CNTNAP2
21
4
62
79
–1.6
92.3
C21orf7
207
281
504
587
–1.3
92.3
FCRL2
342
261
1018
724
–1.6
93.2
SESN3
29
7
70
61
–1.3
92.3
ATF5
146
152
49
45
1.6
92.3
IGKV3D‐11 TCTN1
HOMER3 RGS13
IGKC LOC728175 CD9
(Continues) (Continues)
FERNÁNDEZ‐MARTÍNEZ
TABLE 5
9 of 23
ET AL.
(Continued)
Gene
μ1
phenotypes using all of the genes available in the microarray and also σ1
μ2
σ2
fc (log)
Acc (%)
an unsupervised classification method that serves to project the
76
75
185
197
–1.3
92.3
ATF5
247
251
102
93
1.3
90.6
MSI2
73
25
31
16
1.3
92.3
FCRL2
2055
1421
4932
2543
–1.3
92.3
FCRL5
1062
723
2521
1444
–1.2
92.3
TCTN1
469
238
1103
1509
–1.2
93.2
PPAPDC1B
474
394
1113
832
–1.2
93.2
ACSM3
839
726
358
424
1.2
94.0
ZADH2
50
26
118
71
–1.2
93.2
MNDA
168
139
391
235
–1.2
94.0
PER1
110
102
48
39
1.2
93.2
SLAMF1
55
30
127
104
–1.2
92.3
SERPINB6
224
149
97
108
1.2
91.5
PYHIN1
161
150
370
223
–1.2
92.3
FCRL1
850
631
1955
1048
–1.2
92.3
KCTD7
62
37
143
96
–1.2
93.2
SLC30A1
65
60
150
266
–1.2
93.2
IL15
167
176
73
113
1.2
93.2
SPG20
125
98
55
74
1.2
92.3
HIST1H2AC
132
156
301
367
–1.2
91.5
DNASE1L3
63
98
28
60
1.2
92.3
CRIP3
SERPINB6
341
219
151
164
1.2
91.5
B4GALT2
135
95
60
59
1.2
94.0
29
12
64
46
–1.2
94.0
387
282
172
124
1.2
95.7
KLF7 ABCA9
the dataset reduced to their respective small‐scale signatures. PCA is
List of the most discriminatory genes (126) to predict the NOTCH1 mutation ordered by decreasing absolute FC with an accuracy of 95.7%. Class 1 corresponds to samples with mutated NOTCH1, and class 2 corresponds to those with unmutated NOTCH1. μ1 and σ1 refer, respectively, to the mean expression and SD in class 1 (mutated NOTCH1), whereas μ2 and σ2 indicate the same for the unmutated group. fc, FC; Acc, accuracy.
dataset in a reduced dimensional space. In this case, we used a two‐ dimensional plot to show that the classification problem approaches a linearly separable behavior in the PCA space (it is possible to separate both populations by a plane) when the dataset is reduced to the corresponding small‐scale signatures, in contrast that observed when all the genetic probes are taken into account. This can be considered as graphical proof of the discriminatory power of these small‐scale signatures.35 Figures 9, 10 and 11 show the PCA plots for IgVH, NOTCH1 and SF3B1 mutational status.
3.6 | Gene intersections for IgVH, NOTCH and SF3B1 mutations We analyzed the intersection between the most discriminatory genes for IgVH, NOTCH1 and SF3B1 mutations as defined by FR and FC analyses. We consolidated both lists for each mutation, and then performed pairwise intersections to establish shared genes. This would serve to determine which effects might be amplified by these mutations. Figure 12 shows the results of these intersections. The intersection with the greater number of genes is NOTCH1–SF3B1 (19 genes), followed by IgVH–NOTCH1 (11 genes) and IgVH–SF3B1 with only five genes. Only four genes were common to all mutations: IGHG1, MYBL1, NRIP1 and RGS13.
4
|
DISCUSSION AND CONCLUSIONS
In the present study, we demonstrate genomic data integration in CLL patients by linking together microarray expression data and their IgVH, NOTCH1 and SF3B1 mutational status. The present study focuses on a novel methodological approach aiming to define hierarchical gene rela-
given cut‐off (x‐value). The lower and upper percentiles and the
tionships among CLL patients expressing these three different muta-
median accuracy can be read directly in this plot, which provides stabil-
tions and to establish the predictive accuracy of gene clusters
ity in the predictive accuracy as a function of the variability in the train-
relative to each mutation. Because of the high dimensionality of the
ing and validation sets. Table 8 shows the main statistics related to this
microarray data with respect to the number of available samples, these
stability analysis. The minimum accuracy that has been obtained was
types of phenotype prediction problems have a very high
80.5% in the case of the IgVH mutation. In all cases, the median accu-
underdetermined character. Therefore, simple and efficient methods
racy is close to the accuracy obtained via LOOCV. The SF3B1 muta-
are needed to rank genes according to their discriminatory power
tional status has been more accurately predicted, as was already the
and establish their predictive accuracy. We used two well‐known filter
case in the LOOCV experiment.
techniques: FR and FC. For each mutation and ranking method, we
Table 9 shows the frequency analysis of the most sampled dis-
determined the shortest list of high discriminatory genes with its corre-
criminatory genes found in the hold‐out analysis for all the mutations.
sponding LOOCV predictive accuracy. Using this methodology, we
Most of the genes also appeared in the lists of the most discriminatory
predicted IgVH mutational status and how the NOTCH1 and SF3B1
genes given by FR and FC via LOOCV. These lists were also used to
mutations affect the expression of different genes and their correlation
perform pathway analysis and for comparison with those that were
networks via the PC and NMI similarity coefficients. In this discussion,
obtained for the FC and FR lists.
we also provide the top ontological pathways that are involved in the disease progression, using GeneAnalytics. Correlation networks and
3.5 | Principal component analysis (PCA) analysis of the small‐scale signatures
canonical pathways provide effective methodologies for understanding the mechanisms that are involved in disease progression. This methodology served us to depict the gene clusters that are most
To show the discriminatory power of the small‐scale signatures for
strongly associated with the expression of each selective mutation
IgVH, NOTCH1 and SF3B1, we performed PCA of these three
(networks of synergistically working genes) and their relationship
10 of 23
FERNÁNDEZ‐MARTÍNEZ
ET AL.
FIGURE 4
Correlation networks of the most discriminatory genes (FR) for the NOTCH1 mutational status prediction. A, Using the PC coefficient. B, Using the NMI
between mutation expression and a particular clinical outcome (survival).
polymerase chain reaction.43 Particularly, CRY1, LDOC1 and LPL were over‐expressed in IgVH‐unmutated compared with IgVH‐mutated
The main conclusions for each of these mutations are presented below.
cases. Conversely, COBLL1 and ZBTB20 were under‐expressed. The analysis of differential expression shows that LPL is also the second gene with the highest absolute FC (3.24) and is over‐expressed in the
4.1
|
group with unmutated IgVH of worst prognosis. RSG13 is the other
IgVH
gene that is highly under‐expressed in the same group. Other high dis-
The IgVH mutational status was predicted with very high accuracy
criminatory genes of the IgVH mutational status are CRY1, SPG20,
(94%) using a small‐scale signature composed of 13 genes. LPL (Lipo-
ZBTB20, NRIP1 and ZAP‐70, confirming findings reported previ-
protein Lipase) is the most discriminatory gene of the list, with two
ously.18–20
probes having a FR of 4.6 and 3.7. The predictive power of the first
The hold‐out analysis (Figure 8A) has shown a minimum predictive
LPL probe is also very high, providing a LOOCV predictive accuracy
accuracy of the small‐scale signature of 80.5%, a median accuracy of
of 87.1%. LPL has a lower expression in the patients with mutated
92.7% and a maximum accuracy of 100%. Also, the most frequently
20
in a study that also found
sampled genes in the hold‐out experiment coincide with expanded lists
high LPL mRNA expression to be associated with shorter treatment‐
of the most discriminatory genes found via FC and FR. This analysis
free survival. LPL is a very specific biomarker because its activity is
has also highlighted the importance of FCRL2 and FCRL3 genes related
IgVH. This was also reported previously
42
low or absent in other blood cells types. Kaderi et al
noted that LPL
was the strongest prognostic factor in comparative analysis of RNA‐
to immune receptor translocation proteins that may play a role in the regulation of the immune system.
based markers in early CLL. The results shown in the present study
The 2D PCA plots (Figures 9A and 9B) of both groups for IgVH
confirm that LPL has almost twice the discriminatory power of ZAP‐
mutational status indicate that the classification approaches a linearly
70. Several genes of this list have several probes involved and belong
separable behavior when the small‐scale signature is used because it
to the list of 24 genes differentially expressed in studies identifying
becomes very easy to predict the mutational status of a new incoming
and validating biomarkers of IgVH mutational status using the
sample by projecting it into the 2D PCA space shown in Figure 9B.
FERNÁNDEZ‐MARTÍNEZ
ET AL.
FIGURE 5 Correlation networks of the most differentially expressed genes (FC) for the NOTCH1 mutational status prediction. A, Using the PC coefficient. B, Using the NMI
11 of 23
12 of 23
TABLE 6
Gene
FERNÁNDEZ‐MARTÍNEZ
SF3B1 mutational status prediction using FR μ1
σ1
μ2
σ2
Therefore, it can be concluded that the small‐scale signature gathers
FR (log)
Acc (%)
RPL32
859
228
513
115
2.6
94.0
KLF8
131
45
59
30
2.4
94.0
PDGFD
85
34
42
20
2.2
95.7
PLAGL1
171
87
336
118
2.2
94.0
KLF3
40
29
239
221
2.2
94.0
UQCC
27
7
41
7
2.1
94.9
HBA1
3650
2978
755
2218
2.1
96.6
CNPY2
206
73
317
70
2.1
97.4
TMC6
322
74
546
155
2.0
97.4
CSNK2B
71
37
141
38
2.0
97.4
PLAGL1
282
135
507
174
2.0
97.4
PIP5K1B
55
32
212
200
1.9
98.3
DGKG
44
16
115
70
1.9
97.4
12044
6627
2783
5082
1.9
98.3
PLAGL1
138
83
252
92
1.9
98.3
ZNF76
34
8
61
20
1.8
98.3
AMT
48
8
97
41
1.8
97.4
206
108
368
156
1.8
97.4
HBB
8359
5278
1777
3669
1.8
97.4
ACTR2
3113
266
3789
506
1.8
97.4
GLIPR1
115
107
359
261
1.7
97.4
MAST4
136
89
59
60
1.7
99.1
HBB
STK38
List of most discriminatory genes (22) to predict the SF3B1 mutation, ordered by decreasing FR with an accuracy of 99.1%. Class 1 corresponds to samples with mutated SF3B1, and class 2 corresponds to those with unmutated SF3B1. μ1 and σ1 refer, respectively, to the mean expression and SD in class 1 (mutated SF3B1), whereas μ2 and σ2 indicate the same for the unmutated group. FR (log), logarithmic FR; Acc, accuracy. TABLE 7
ET AL.
important genetic features involved in the mutational status of the immoglobulin variable region heavy chain (IgVH). The PC network shows a main branch relating LPL and ZBTB20. ZBTB20 is a transcription factor that may be involved in hematopoiesis, oncogenesis and innate immune responses.44 ZBTB20 and LPL have been shown to predict survival in B‐cell CLL.45 Diseases associated with ZBTB20 include bone lymphoma. ZBTB20 expression is increased in hepatocellular carcinoma associated with poor prognosis.46 This network also shows connections between LPL and COBLL1, LDOC1, LOC100128252, KANK2, WSB2 and ANKRD57‐SEPT10. Some of these genes are known to have important roles in CLL and cancer. COBLL1 (Cordon‐Bleu Protein‐Like 1) is a gene related to actin binding. Actins participate in important processes such as muscle contraction, cell motility, cell division and cytokinesis, and cell signaling, etc. This gene is down‐regulated in CLL groups with a poor prognosis.47 LDOC1 (Leucine Zipper, Down‐Regulated in Cancer 1) has been proposed as a tumor suppressor gene whose protein product may have an important role in the development and/or progression of some cancers.48 It is considered to regulate the transcriptional responses by NF‐kappa B that play a key role in regulating the immune response to infection. It has been also shown that LDOC1 is differentially expressed in CLL and is a good predictor for overall survival in untreated patients.49 The NMI shows two main branches with TBC1D2B and SEPT10. SEPT10 has been associated with B‐cell CLL.50,51 GO annotations related to TBC1D2B include phospholipid binding. The Mutual information network appears to delineate the gene clusters better than the PC network. The analysis of the networks for the most differentially expressed genes provided by FC analysis (Figures 3A and 3B) shows the
SF3B1 mutational status prediction using FC
Gene ANXA4
μ1
σ1 42
μ2
σ2
19
430
524
fc (log)
Acc (%)
–3.3
87.2
IGHG1
63
37
599
1472
–3.3
88.0
ANXA4
41
17
333
408
–3.0
85.5
ANXA4
55
26
419
498
–2.9
87.2
ANXA4
47
17
335
408
–2.8
88.9
ANKRD36BP2
32
31
227
683
–2.8
90.6
TSPAN13
29
17
204
338
–2.8
86.3
ANXA4
52
24
364
460
–2.8
87.2
ANXA4
44
17
279
343
–2.7
88.9
MYBL1
41
36
261
406
–2.7
93.2
ANXA4
28
10
173
218
–2.6
91.5
KLF3
40
29
239
221
–2.6
92.3
NT5E
17
2
96
210
–2.5
92.3
GNB4
24
7
130
253
–2.4
90.6
TRPV3
198
525
38
67
2.4
92.3
ZNF608
16
2
82
166
–2.4
91.5
RBM20
26
25
132
212
–2.3
93.2
KLRK1 CNTNAP2 HBA1
53
63
266
610
–2.3
93.2
135
188
655
801
–2.3
91.5
3650
2978
755
2218
2.3
90.6 (Continues)
FERNÁNDEZ‐MARTÍNEZ
TABLE 7
Gene PPP1R9A HBB HTRA3 KLRK1 HBB TUBB6 CNTNAP2
13 of 23
ET AL.
(Continued)
μ1
σ1
μ2
σ2
fc (log)
Acc (%) 90.6
22
18
103
145
–2.2
8359
5278
1777
3669
2.2
87.2
202
409
45
55
2.2
88.9 88.9
51
48
227
512
–2.2
12044
6627
2783
5082
2.1
88.9
48
42
204
398
–2.1
88.9
189
266
789
1105
–2.1
88.0 88.0
TCTN1
42
32
172
438
–2.0
PTPRK
14
1
58
117
–2.0
88.9
HOMER3
59
31
239
336
–2.0
88.0
PIP5K1B
55
32
212
200
–2.0
89.7
39
43
148
227
–1.9
90.6
1224
2650
339
591
1.9
90.6
15100
6062
4201
5929
1.8
90.6
39
44
136
161
–1.8
88.9
166
259
48
56
1.8
88.9
21
9
71
99
–1.8
89.7
CD69
362
484
106
153
1.8
88.9
FCRL3
118
154
399
359
–1.8
89.7
PPP1R9A S100A9 HBB VASH1 CD69 PPP1R9A
43
39
145
212
–1.7
91.5
FOSB
659
1112
199
418
1.7
91.5
RASSF6
362
720
110
274
1.7
92.3
51
67
168
308
–1.7
93.2
MAP3K8
SDPR CYBB
91
112
295
433
–1.7
93.2
PSD3
235
211
74
175
1.7
93.2
ADM
153
206
48
84
1.7
93.2
CD72
148
94
471
297
–1.7
91.5
FOS
928
1556
292
513
1.7
91.5
25
8
80
123
–1.7
91.5
365
1057
1151
2084
–1.7
90.6
BCL7A
61
56
193
248
–1.7
92.3
GLIPR1
115
107
359
261
–1.6
93.2
96
160
296
533
–1.6
94.9
1105
1870
360
631
1.6
94.0
30
16
92
144
–1.6
93.2
RAB20 IGKC
SDPR FOS FCER1A CX3CR1
32
14
97
141
–1.6
92.3
GNG11
60
85
179
322
–1.6
94.0
GLIPR1
127
79
373
246
–1.6
94.0
FCRL3
1498
1626
4365
3100
–1.5
90.6
64
42
185
196
–1.5
90.6
FCRL3
1565
1755
4552
3186
–1.5
90.6
ADM
163
221
56
102
1.5
89.7
RPL31
803
501
279
140
1.5
95.7
TUBB1
176
297
505
870
–1.5
95.7
BCL7A
20
3
57
77
–1.5
94.9
IGKV1–5
367
1058
1047
2114
–1.5
94.0
HOMER3
17
2
49
80
–1.5
94.0
SLAMF1
PLAC8
559
386
1588
1047
–1.5
94.0
BTNL9
201
384
71
300
1.5
94.0
FCRL1
186
105
527
307
–1.5
94.0 (Continues)
14 of 23
TABLE 7
Gene CNR1
FERNÁNDEZ‐MARTÍNEZ
ET AL.
(Continued)
μ1
σ1
μ2
σ2
fc (log)
Acc (%)
301
362
106
236
1.5
94.0
CLEC4C
24
9
69
95
–1.5
94.9
FCGBP
51
26
145
209
–1.5
95.7
GLIPR1
189
117
532
361
–1.5
94.9
CYBB
28
19
80
108
–1.5
94.9
CLLU1
449
597
161
355
1.5
94.9
GEN1
77
31
215
374
–1.5
94.0
GAPT
64
38
177
139
–1.5
94.0
PPBP
624
1065
1730
2686
–1.5
94.9
RASSF6
124
258
45
100
1.5
94.9
NRIP1
23
17
63
60
–1.5
94.9 94.0
22
9
62
78
–1.5
355
777
130
399
1.5
93.2
SIGLEC6
37
47
100
164
–1.4
92.3
GLIPR1
207
120
554
354
–1.4
94.0
CNTNAP2 RGS13
SMAD3
51
33
136
198
–1.4
94.0
PRDM1
128
201
48
51
1.4
94.9
NRIP1
73
119
195
187
–1.4
94.0
RNGTT
981
825
370
627
1.4
94.0 93.2
PF4
85
114
226
319
–1.4
CYB5A
49
17
130
104
–1.4
93.2
FHDC1
52
43
137
148
–1.4
94.0
GLIPR1
168
135
444
292
–1.4
94.0
FCGR3A
71
82
187
268
–1.4
94.9
GLIPR1
346
186
907
595
–1.4
95.7
RNGTT
673
558
257
419
1.4
94.9
CYB5A
62
34
163
135
–1.4
94.9
TCF7
306
247
798
780
–1.4
95.7
GEN1
38
8
99
213
–1.4
95.7
RNGTT
972
755
374
622
1.4
95.7
ITGAX
196
217
510
356
–1.4
95.7
DGKG
44
16
115
70
–1.4
94.0
TCF7
100
54
259
267
–1.4
94.9
RASGRP1
177
137
455
315
–1.4
94.9
44
29
112
87
–1.4
94.0
PDK4
31
14
78
74
–1.3
92.3
KLF3
22
4
56
73
–1.3
92.3
SSBP2
KLF3
27
6
68
56
–1.3
92.3
PF4
107
163
272
399
–1.3
93.2 92.3
FGL2
138
152
349
424
–1.3
IPCEF1
101
100
40
34
1.3
95.7
FCRL1
775
460
1952
1051
–1.3
95.7
FCRL1
642
350
1610
883
–1.3
95.7 95.7
FRMD5
58
82
23
14
1.3
NSUN7
171
168
69
95
1.3
95.7
FCRL2
109
75
271
216
–1.3
95.7
HBM
41
31
103
708
–1.3
95.7
RGS1
871
1001
351
864
1.3
96.6
List of the most discriminatory genes (118) to predict the SF3B1 mutation ordered by decreasing absolute FC with an accuracy of 96.6%. Class 1 corresponds to samples with mutated SF3B1, and class 2 corresponds to those with unmutated SF3B1. μ1 and σ1 refer, respectively, to the mean expression and SD in class 1 (mutated SF3B1), whereas μ2 and σ2 indicate the same for the unmutated group. fc, FC; Acc, accuracy.
FERNÁNDEZ‐MARTÍNEZ
15 of 23
ET AL.
FIGURE 6 Correlation network of the most discriminatory genes (FR) for the SF3B1 mutational status prediction. A, Using the PC coefficient. B, Using the NMI
importance of the RGS13–SPG20 connection. RGS13 encodes a pro-
within the set of the 13 most discriminatory genes. The mutual infor-
tein of the RGS family and has been associated with mantle cell lym-
mation network also highlights the connection RGS13–LMNA–SPG20.
phoma. SPG20 encodes the protein called Spartin that appears to be
LMNA is a gene that in involved in many pathways, including apoptosis
related to endocytosis. This gene has been associated to ZAP70 and
and the caspase cascade.
LPL as a good prognostic biomarker in CLL survival.18 Hyper‐methyla-
GeneAnalytics software has shown that the most important path-
tion of SPG20 in early stage CLL has been positively associated with
ways involved are the inflammatory response and the PAK pathways.
progression‐free survival, supporting the idea that epigenetic changes
The main GO biological processes involved were related to aging and
have clinical impact in CLL.52 Two different probes of this gene are
vasoconstriction, and the main GO molecular functions were
16 of 23
FERNÁNDEZ‐MARTÍNEZ
ET AL.
FIGURE 7 Correlation networks of the most differentially expressed genes (FC) for the SF3B1 mutational status prediction. A, Using the PC coefficient. B, Using the NMI
FERNÁNDEZ‐MARTÍNEZ
17 of 23
ET AL.
synthase binding. Finally, the main compounds that were identified target IGHG1 and IGKC. Finally, pathway analysis using the list of the first 300 most‐frequently sampled genes in the hold‐outs analysis demonstrated the importance of certain networks: B cell development pathways; the inflammatory response pathway; the gastric cancer network 2; the T cell co‐signaling pathway (ligand‐receptor interactions); and the Th17 differentiation pathway. Both analyses have provided common mechanisms.
4.2
|
NOTCH1
The NOTCH1 mutation was also predicted with very high accuracy (96%) using a set of 60 most discriminatory genes. MSI2 was the most important gene, with five probes affected by this mutation. This gene plays an important role in post‐transcriptional gene regulation and tumorigenesis. MSI1 and MSI2 are cooperatively involved in the proliferation and maintenance of central nervous system stem cell populations.53 Also, their expression levels in human myeloid leukemia directly correlate with decreased survival in patients, defining MSI2 as a new prognostic marker and a new target for therapy.54 The Musashi‐Numb pathway can control the differentiation of chronic myeloid leukemia cells. MSI2 expression is upregulated during human chronic myeloid leukemia progression and is also an early indicator of poorer prognosis.55 The analysis of most differentially expressed genes in this mutation has showed the importance of TFEC and CNTNAP2. TFEC (Transcription Factor EC) encodes a member of the microphthalmia (MiT) family. MiT transcription factors play an important role in multiple cellular processes, including survival, growth and differentiation. CNTNAP2 encodes a member of the neurexin family with functions in cell adhesion. This protein contains epidermal growth factor and laminin G domains. Annotations related to this gene include enzyme binding and receptor binding. This gene has been associated with genetic risk prediction for acute myeloid leukemia56 and is involved in the genomic abnormalities for this illness.57 The hold‐out analysis (Figure 8B) has shown a minimum predictive accuracy of the small‐scale signature of 82.8%, a median accuracy of 96.6% and a maximum accuracy of 100%. As in the previous case, the most frequently sampled genes in the hold‐out experiment coincide with expanded lists of most discriminatory genes found via FC and FR. This analysis has highlighted the importance of ST3GAL1, as FIGURE
8 Hold‐out stability analysis. Cumulative probability functions for the predictive accuracy of the small‐scale lists in the IgHV (A), NOTCH1 (B) and SF3B1 (C) mutations
involved in glycosphingolipids biosynthesis and associated with colorectal cancer and immune diseases (HIV‐1). The PC network of the most discriminatory genes (Figure 4A) shows three main connections: MSI2‐ACSL5, MSI2‐CNTNAP2 and
cholesterol binding, antigen binding, patched binding and nitric‐oxide
MSI2‐WSB2, whereas the NMI network (Figure 4B) shows the connections to NCK2, LPL, WSB2 and SPG20. Some of these genes are not
TABLE 8
Statistical analysis of the predictive accuracy of the minimum‐scale signature in these three mutations, obtained after 500 random 75–25 hold outs: minimum, maximum, mean, median, SD and interquartile range (IQR) Min
Max
Mean
Median
STD
IQR
IgVH
80.5
100
93.1
92.7
3.6
4.9
NOTCH1
82.8
100
94.6
96.6
3.6
3.5
SF3B1
89.7
100
98.3
2.2
3.5
known to play a role in CLL. The FC networks (Figures 5A and 5B) are simpler and, in both cases (PC and NMI), show one main connection between TFEC and NRIP1. NRIP1 encodes a nuclear protein that specifically interacts with the hormone‐dependent activation domain AF2 of nuclear receptors
100
and modulates transcriptional activity of the estrogen receptor. This gene has been shown to have predictive value in CLL together with LPL and ZAP‐70.18
18 of 23
TABLE 9
FERNÁNDEZ‐MARTÍNEZ
ET AL.
Hold‐out frequency analysis
IgVH Gene name
NOTCH1 Frequency
Gene name
SF3B1 Frequency
Gene name
Frequency
WSB2
2.79
ST3GAL1
0.18
HBB
0.33
LPL
2.79
LAG3
0.18
TMC6
0.33
LPL
2.79
MSI2
0.18
PIP5K1B
0.33
NRIP1
2.79
MSI2
0.18
UQCC
0.33
FCRL2
2.79
LTK
0.18
KLF3
0.33
PYHIN1
2.79
MSI2
0.18
MAST4
0.33
ZBTB20
2.79
CNTNAP2
0.18
FAM114A1
0.33
FCRL2
2.79
KIAA1715
0.18
KLF8
0.33
ZBTB20
2.79
MSI2
0.18
AMT
0.33
SH3PXD2A
2.51
MSI2
0.18
HBB
0.33
COBLL1
2.51
NCK2
0.18
HBA1
0.33
FCRL2
2.51
LPL
0.18
RPL32
0.33
LDOC1
2.23
ZNF333
0.18
CSNK2B
0.33
ZAP70
2.23
PI4K2A
0.18
ZNF76
0.33
PYHIN1
2.23
CNTNAP2
0.18
DGKG
0.33
USP6NL
2.23
DCLK2
0.18
ACTR2
0.33
COBLL1
2.23
LTK
0.17
DLG2
0.33
FCRL2
2.23
SRSF1
0.17
PLAGL1
0.33
FCRL2
2.23
C19orf28
0.17
YWHAB
0.33
FCRL2
2.23
HNRNPR
0.17
ANXA4
0.32
FCRL3
2.23
RMND5A
0.17
ZC3H12B
0.32
KLF12
1.96
C10orf137
0.17
CNTNAP2
0.31
FCRL2
1.96
AQR
0.17
PLAGL1
0.31
FCRL3
1.96
SRSF1
0.17
PDGFD
0.31
PYHIN1
1.96
PTDSS1
0.17
PDGFD
0.30
KLF12
1.68
PREPL
0.16
DGKG
0.30
DCLK2
1.68
GNPNAT1
0.16
PSD3
0.30
WSB2
1.68
EXD2
0.16
PLAGL1
0.29
FCRL2
1.68
C10orf137
0.16
VPS29
0.29
LOC100128252
1.68
FNIP1
0.16
LOC100287515
0.28
FCRL2
1.68
LTN1
0.16
HBB
0.28
SPG20
1.40
DTX1
0.16
FKBP3
0.28
MBOAT1
1.40
REEP5
0.16
HMGB2
0.28
LOC100128252
1.40
CCDC24
0.16
ATP2B1
0.28
TBC1D2B
1.12
ZNF544
0.16
FBXO38
0.28
ATOX1
1.12
FCRL2
0.16
RNF135
0.28
RPS4Y1
0.84
LDOC1
0.16
XPC
0.27
ADRBK2
0.84
C1orf38
0.16
JAZF1
0.27
DDX3Y
0.84
ICK
0.16
GLIPR1
0.27
List of the most sampled discriminatory genes for the IgVH, NOTCH1 and SF3B1 mutational status prediction, obtained after 500 random simulations.
Pathways analysis using GeneAnalytics has shown that the main
triglyceride biosynthetic processes and neural tube formation. The
discriminatory and differentially expressed genes are related to multi-
main GO molecular function is antigen binding and the compounds
ple cellular processes, including survival, growth and differentiation,
found also target the genes IGHG1 and IGKC. Other types of com-
apoptosis and host defense. The main super‐pathways were the crea-
pounds are retinoid (APOD, BMP6, NRIP1, PRKCA, RORA, THRB) and
tion of C4 and C2 activators, FCGR‐dependent phagocytosis and adi-
valine (FCGR3A, IGKC, LPL, PRKCA, THRB). Finally, pathway analysis
pogenesis. The main GO biological processes involve the regulation
using the list of the first 300 most‐frequently sampled genes in the
of the smoothened signaling pathway, the immune response, retina
hold‐out analysis has shown the importance of the certain networks:
homeostasis, positive regulation of cardiac muscle hypertrophy, the
the MRNA splicing – major pathway; gene expression; the notch sig-
Fc‐gamma receptor signaling pathway involved in phagocytosis,
naling pathway (KEGG); TCR signaling (REACTOME); and the MRNA
FERNÁNDEZ‐MARTÍNEZ
FIGURE 9
19 of 23
ET AL.
IgVH mutational status. A, 2D PCA plot using all the genes. B, 2D PCA plot in the small‐scale signature
FIGURE 10
NOTCH1 mutational status. A, 2D PCA plot using all the genes. B, 2D PCA plot in the small‐scale signature
FIGURE 11
SF3B1 mutational status. A, 2D PCA plot using all the genes. B, 2D PCA plot in the small‐scale signature
surveillance pathway. The Notch signaling pathway occurs within
named ZAC1), and two different probes of HBB. PLAGL1 encodes a
these
is
C2H2 zinc finger protein with transactivation and DNA‐binding activ-
progressing in the right direction. Both analyses highlight the impor-
ities and has been shown to have anti‐proliferative properties as a
tance of different signaling pathways and the immune response.
tumor suppressor.58 HBB (hemoglobin beta) expression in CLL sam-
pathways,
confirming
that
bioinformatics
modeling
ples with SF3B1 mutation is almost six times greater than HBB
4.3
|
SF3B1
Finally, the SF3B1 mutation was predicted with almost 100% accu-
expression in the unmutated samples. The analysis of most differentially expressed genes in this mutation showed the importance of ANXA4, with seven different probes in this list. ANXA4 (Annexin IV)
racy using a list of the 22 most discriminatory genes, including
belongs to the annexin family of calcium‐dependent phospholipid
RPL32, KLF8 and PDGFD, three different probes of PLAGL1 (also
binding proteins.
20 of 23
FERNÁNDEZ‐MARTÍNEZ
ET AL.
FIGURE 12 Intersection among the most discriminatory genes of the IgVH, NOTCH1 and SF3B1 mutations. The three main mutations are represented with a rectangle and the most discriminatory genes are surrounded by ellipses. An edge represents that the gene appears as most discriminatory for a specific mutation. Genes with three edges (surrounded by a dot rectangle) are common to these three main mutations
The hold‐out analysis (Figure 8C) has shown a minimum predictive
lipoxins on super‐oxide production in neutrophils; the HIF‐1 alpha
accuracy of the small‐scale signature of 89.7%, a median accuracy of
transcription factor network; the Fc‐gamma receptor signaling path-
98.3% and a maximum accuracy of 100%. This analysis has highlighted
way; the PAK pathway; the immune response CCR signaling in eosin-
the importance of HBB that encodes the beta globin involved in oxy-
ophils; and the GPCR pathway. The main GO molecular function
gen transport.
involved is oxygen transporter activity and the main GO biological
The Pearson network of the most discriminatory genes (Figure 6A)
processes are oxygen transport, the innate immune response, the
shows two main networks of RPL32, with YWHAB and KLF8. YWHAB
Fc‐gamma receptor signaling pathway involved in phagocytosis, and
(Tyrosine 3‐Monooxygenase/Tryptophan 5‐Monooxygenase Activa-
signal transduction. Finally, the main compounds found were thymi-
tion Protein, Beta) encodes a protein that has been shown to interact
dine (ADM, CD69, CD72, FOS, HBB, NT5E, PF4, PPBP, RNGTT and
with RAF1 and CDC25 phosphatases, suggesting that it may play a role
SMAD3) and the drugs targeting the genes HBA1 and HBB
in linking mitogenic signaling and the cell cycle machinery. KLF8
concerning hemoglobin. Finally, pathway analysis using the list of
(Kruppel‐Like Factor 8) encodes a protein that is a member of the
the first 300 most‐frequently sampled genes in the hold‐out analysis
Sp/KLF family of transcription factors, and is considered to play an
has shown the importance of certain networks: WNT mediated acti-
59
important role in metastasis.
Diseases associated with KLF8 include
ovarian epithelial cancer and mental retardation.
vation of DVL; Epstein–Barr virus infection; actin nucleation by ARP‐ WASP complex; rheumatoid arthritis; and CD28 co‐stimulation.
The NMI correlation (Figure 6B) shows one main network relating
Most of the genes affected by the SF3B1 mutation are not known
RPL32 with CNPY2‐STK38. Related to this last gene are hemoglobin
to play a role in CLL. In these three classification problems, the FR
HBA1 and HBB. Also, the analysis of the networks for the most differ-
always provided the shortest list with the highest discriminatory
entially expressed genes (Figures 7A and 7B) shows the importance of
power, whereas FC gave longer lists of genes with lower predictive
the connections of ANXA4 with CYBB and FCRL3 (PC network) and one
accuracy in general terms. Therefore, we can conclude that the most
main connection with ADM that is involved in vasodilation, regulation
differentially expressed genes are not the most discriminatory as a
of hormone secretion, promotion of angiogenesis and antimicrobial
result of the high variability of these genes in the groups of patients
activity. The NMI network has a peculiar circular geometry where all
with unmutated NOTCH1 and SF3B1.
the genes relate to ADM. The most important pathways found using GeneAnalytics were
Regarding commonalities, the NOTCH1 and SF3B1 mutations share a longer list of high discriminatory genes than the IgVH mutation.
NFAT in the immune response; ERK signaling; the immune response
Furthermore, the relationship IgVH–NOTCH1 is stronger than the rela-
of DAP12; creation of C4 and C2 activators; inhibitory action of
tionship IgVH–SF3B1. Using an expanded list of high discriminatory
FERNÁNDEZ‐MARTÍNEZ
21 of 23
ET AL.
and differentially expressed genes, we have shown that these three mutations share only four genes (IGHG1, MYBL1, NRIP1 and RGS1) that are related to immune diseases, blood diseases, rare diseases and cancer. To better understand the functional and biological ramifications of the most discriminatory and common genes associated with the IgVH, NOTCH1 and SF3B1 mutations (IGHG1, MYBL1, NRIP1 and RGS13) (Figure 12), we evaluated the super‐pathway fit using GeneAnalytics software and identified the IL‐4‐mediated signaling events pathway as the one that was highly matched. This finding is particularly relevant because IL‐4 regulation of B‐cell receptor signaling has been identified as integral to CLL (in CLL, both B‐ and T‐cells highly express IL‐4).60–62 We then performed a secondary analysis to identify KEGG pathways using BinoX63 and noted thyroid hormone signaling, notch signaling, viral carcinogenesis and, interestingly, chronic myeloid leukemia. Clearly, the identification of notch signaling is consistent with the findings that we reported earlier in the present study, and viral carcinogenesis is not unexpected given the possible pathogenesis of CLL. We suspect that the other two pathways were noted because of common components, although definitive interpretation requires additional investigation. Finally, using GeneAnalytics, we have identified the main pathways, as well as GO molecular and biological functions, that are involved in each mutation, and also the compounds that are at disposal and the genes that could be targeted. These analyses suggest an important role of the immune response and antigen presentation in the three cases. This methodology could also be applied to analyze the effect of other mutations in CLL and to understand the genesis of other illnesses with a genetic background. The aim of this retrospective analysis was to provide a deeper understanding of the effects of the different mutations in CLL disease progression, with the aim of applying these findings clinically in the near future with respect to the development of new drugs. A future verification of these findings with other independent cohorts could lead to a better design of therapeutic targets.
The authors declare that there are no conflicts of interest. They have also read and accepted the journal's authorship agreement. We would like to thank different researchers from the Biochemistry Department (University of Oviedo, Spain) for having introduced us to this interesting problem. Enrique J. de Andrés‐Galiana salary has assisted
1. Rodríguez‐Vicente AE, Díaz MG, Hernández‐Rivas JM. Chronic lymphocytic leukemia: a clinical and molecular heterogeneous disease. Cancer Genet. 2013;206:49–62. 2. Rai KR, Sawitsky A, Cronkite EP, Chanana AD, Levy RN, Pasternack BS. Clinical staging of chronic lymphocytic leukemia. Blood. 1975;46:219– 234. 3. Binet JL, Auquier A, Dighiero G, et al. A new prognostic classification of chronic lymphocytic leukemia derived from a multivariate survival analysis. Cancer. 1981;48:198–206. 4. Rosenwald A, Alizadeh AA, Widhopf G, et al. Relation of gene expression phenotype to immunoglobulin mutation genotype in b cell chronic lymphocytic leukemia. J Exp Med. 2001;194:1639–1647. 5. Hamblin TJ, Davis Z, Gardiner A, Oscier DG, Stevenson FK. Unmutated ig v(h) genes are associated with a more aggressive form of chronic lymphocytic leukemia. Blood. 1999;94:1848–1854. 6. Crespo M, Bosch F, Villamor N, et al. Zap‐70 expression as a surrogate for immunoglobulin‐variable‐region mutations in chronic lymphocytic leukemia. N Engl J Med. 2003;348:1764–1775. 7. Wiestner A, Rosenwald A, Barry TS, et al. Zap‐70 expression identifies a chronic lymphocytic leukemia subtype with unmutated immunoglobulin genes, inferior clinical outcome, and distinct gene expression profile. Blood. 2003;101:4944–4951. 8. Orchard JA, Ibbotson RE, Davis Z, et al. Zap‐70 expression and prognosis in chronic lymphocytic leukemia. Lancet. 2004;363:105–111. 9. Kim SZ, Chow KU, Kukoc‐Zivojnov N, et al. Expression of zap‐70 protein correlates with disease stage in chronic lymphocytic leukemia and is associated with, but not generally restricted to, non‐mutated ig vh status. Leuk Lymphoma. 2004;45:2037–2045. 10. Smolej L, Saudkova L, Spacek M, Kozak T. Zap‐70 in b‐cell chronic lymphocytic leukemia: clinical significance and methods of detection. Vnitr Lek. 2006;52:1194–1199. 11. Mainou‐Fowler T, Dignum HM, Proctor SJ, Summerfield GP. The prognostic value of cd38 expression and its quantification in b cell chronic lymphocytic leukemia (b‐cell). Leuk Lymphoma. 2004;45:455–462. 12. Cruse JM, Lewis RE, Webb RN, Sanders CM, Suggs JL. Zap‐70 and cd38 as predictors of IgVH mutation in CLL. Exp Mol Pathol. 2007;83:459–461. 13. Shanafelt TD, Byrd JC, Call TG, Zent CS, Kay NE. Narrative review: initial management of newly diagnosed, early‐stage chronic lymphocytic leukemia. Ann Intern Med. 2006;145:435–447. 14. Klein U, Tu Y, Stolovitzky GA, et al. Gene expression profiling of b cell chronic lymphocytic leukemia reveals a homogeneous phenotype related to memory b cells. J Exp Med. 2001;194:1625–1638.
ACKNOWLEDGMENTS
been
RE FE RE NC ES
by
the
Spanish
Ministerio
de
Economía
y
Competitividad (grant TIN2011‐23558). No additional research funds were provided to perform this research. JLFM and EJAG prepared the data, designed the machine learning methodology, carried out the experiments, analyzed and interpreted the results and drafted the manuscript. SS revised the design of the methodology critically, analyzed and interpreted the results, and drafted the manuscript. All authors read and approved the final manuscript submitted for publication. There are no competing interests (political, personal, religious, ideological, academic, intellectual, commercial or any other) to declare in relation to this manuscript.
15. Ferrer A, Ollila J, Tobin G, et al. Different gene expression in immunoglobulin‐mutated and immunoglobulin‐ unmutated forms of chronic lymphocytic leukemia. Cancer Genet Cytogenet. 2004;153:69–72. 16. Haslinger C, Schweifer N, Stilgenbauer S, et al. Microarray gene expression profiling of b‐cell chronic lymphocytic leukemia subgroups defined by genomic aberrations and vh mutation status. J Clin Oncol. 2004;22:3937–3949. 17. Vasconcelos Y, De Vos J, Vallat L, et al. Gene expression profiling of chronic lymphocytic leukemia can discriminate cases with stable disease and mutated ig genes from those with progressive disease and unmutated ig genes. Leukemia. 2005;19:2002–2005. 18. Van't Veer MB, Brooijmans AM, Langerak AW, et al. The predictive value of lipoprotein lipase for survival in chronic lymphocytic leukemia. Haematologica. 2006;91:56–63. 19. Nuckel H, Hüttmann A, Klein‐Hitpass L, et al. Lipoprotein lipase expression is a novel prognostic factor in b‐cell chronic lymphocytic leukemia. Leuk Lymphoma. 2006;47:1053–1061. 20. Hartman ML, Kilianska ZM. Lipoprotein lipase: a new prognostic factor in chronic lymphocytic leukaemia. Contemp Oncol (Pozn). 2012;16:474–479.
22 of 23
FERNÁNDEZ‐MARTÍNEZ
ET AL.
21. Oppezzo P, Vasconcelos Y, Settegrana C, et al. The LPL/ADAM29 expression ratio is a novel prognosis indicator in chronic lymphocytic leukemia. Blood. 2005;106:650–657.
42. Kaderi MA, Kanduri M, Buhl AM, et al. Lpl is the strongest prognostic factor in a comparative analysis of rna‐based markers in early chronic lymphocytic leukemia. Haematologica. 2011;96:1153–1160.
22. Döhner H, Stilgenbauer S, Benner A, et al. Genomic aberrations and survival in chronic lymphocytic leukemia. N Engl J Med. 2000;343:1910–1916.
43. Abruzzo LV, Barron LL, Anderson K, et al. Identification and validation of biomarkers of IGV(h) mutation status in chronic lymphocytic leukemia using microfluidics quantitative real‐time polymerase chain reaction technology. J Mol Diagn. 2007;9:546–555.
23. Puente XS, Pinyol M, Quesada V, et al. Whole‐genome sequencing identifies recurrent mutations in chronic lymphocytic leukemia. Nature. 2011;475:101–105. 24. Fabbri G, Rasi S, Rossi D, et al. Analysis of the chronic lymphocytic leukemia coding genome: role of notch1 mutational activation. J Exp Med. 2011;208:1389–1401. 25. Quesada V, Conde L, Villamor N, et al. Exome sequencing identifies recurrent mutations of the splicing factor sf3b1 gene in chronic lymphocytic leukemia. Nat Genet. 2012;44:47–52. 26. Wang L, Lawrence MS, Wan Y, et al. Sf3b1 and other novel cancer genes in chronic lymphocytic leukemia. N Engl J Med. 2011;365:2497–2506. 27. Oscier DG, Rose‐Zerilli MJ, Winkelmann N, et al. The clinical significance of notch1 and sf3b1 mutations in the uk lrf cll4 trial. Blood. 2013;121:468–475. 28. Ramsay AJ, Quesada V, Foronda M, et al. Pot1 mutations cause telomere dysfunction in chronic lymphocytic leukemia. Nat Genet. 2013;45:526–530. 29. Ghia P, Guida G, Stella S, et al. The pattern of cd38 expression defines a distinct subset of chronic lymphocytic leukemia (cll) patients at risk of disease progression. Blood. 2003;101:1262–1269. 30. Rush LJ, Raval A, Funchain P, et al. Epigenetic profiling in chronic lymphocytic leukemia reveals novel methylation targets. Cancer Res. 2004;64:2424–2433. 31. Ferreira PG, Jares P, Rico D, et al. Transcriptome characterization by RNA sequencing identifies a major molecular and clinical subdivision in chronic lymphocytic leukemia. Genome Res. 2014;24:212–226. 32. Fisher RA. The use of multiple measurements in taxonomic problems. Ann Eugen. 1936;7:179–188. 33. Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A. 2001;98:5116–5121.
44. Liu X, Zhang P, Bao Y, et al. Zinc finger protein ZBTB20 promotes Toll‐ like receptor‐triggered innate immune responses by repressing IκBα gene transcription. Proc Natl Acad Sci U S A. 2013;110:11097–11102. 45. Nikitin EA, Malakho SG, Biderman BV, et al. Expression level of lipoprotein lipase and dystrophin genes predict survival in B‐cell chronic lymphocytic leukemia. Leuk Lymphoma. 2007;48:912–922. 46. Wang Q, Tan YX, Ren YB, et al. Zinc finger protein ZBTB20 expression is increased in hepatocellular carcinoma and associated with poor prognosis. BMC Cancer. 2011;11:271 47. Ronchetti D, Mosca L, Cutrona G, et al. Small nucleolar RNAs as new biomarkers in chronic lymphocytic leukemia. BMC Med Genomics. 2013;6:27 48. Nagasaki K, Manabe T, Hanzawa H, Maass N, Tsukada T, Yamaguchi K. Identification of a novel gene, LDOC1, down‐regulated in cancer cell lines. Cancer Lett. 1999;140:227–234. 49. Duzkale H, Schweighofer CD, Coombes KR, et al. LDOC1 mRNA is differentially expressed in chronic lymphocytic leukemia and predicts overall survival in untreated patients. Blood. 2011;117:4076–4084. 50. Benedetti D, Bomben R, Dal‐Bo M, et al. Are surrogates of IgVH gene mutational status useful in b‐cell chronic lymphocytic leukemia? The example of septin‐10. Leukemia. 2008;22:224–226. 51. Travella A, Ripollés L, Aventin A, et al. Structural alterations in chronic lymphocytic leukaemia. Cytogenetic and fish analysis. Hematol Oncol. 2013;31:79–87. 52. Ronchetti D, Tuana G, Rinaldi A, et al. Distinct patterns of global promoter methylation in early stage chronic lymphocytic leukemia. Genes Chromosomes Cancer. 2014;53:264–273. 53. Sakakibara S, Nakamura Y, Yoshida T, et al. RNA‐binding protein Musashi family: roles for CNS stem cells and a subpopulation of ependymal cells revealed by targeted disruption and antisense ablation. Proc Natl Acad Sci U S A. 2002;99:15194–15199. 54. Kharas MG, Lengner CJ, Al‐Shahrour F, et al. Musashi‐2 regulates normal hematopoiesis and promotes aggressive myeloid leukemia. Nat Med. 2010;16:903–908.
34. Saligan L, Fernández‐Martínez JL, deAndrés‐Galiana EJ, Sonis S. Supervised classification by filter methods and recursive feature elimination predicts risk of radiotherapy‐related fatigue in patients with prostate cancer. Cancer Inform. 2014;13:141–152.
55. Ito T, Kwon HY, Zimdahl B, et al. Regulation of myeloid leukaemia by the cell‐fate determinant Musashi. Nature. 2010;466:765–768.
35. deAndrés‐Galiana EJ, Fernández‐Martínez JL, Sonis ST. Design of biomedical robots for phenotype prediction problems. J Comput Biol. 2016;23:678–692.
56. Heo SG, Hong EP, Park JW. Genetic risk prediction for normal‐karyotype acute myeloid leukemia using whole‐exome sequencing. Genomics Inform. 2013;11:46–51.
36. Witten IH, Eiben F, Hall MA. Data Mining: Practical Machine Learning Tools and Techniques. 3rd ed. Burlington, MA: Morgan Kaufmann; 2011.
57. De Weer A, Poppe B, Vergult S, et al. Identification of two critically deleted regions within chromosome segment 7q35‐q36 in EVI1 deregulated myeloid leukemia cell lines. PLoS One. 2010;5:e8676
37. Peng HC, Long F, Ding C. Feature selection based on mutual information: criteria of max‐dependency, max‐relevance, and min‐ redundancy. IEEE Trans Pattern Anal Mach Intell. 2005;27:1226–1238.
58. Varrault A, Ciani E, Apiou F, et al. hZac encodes a zinc finger protein with antiproliferative properties and maps to a chromosomal region frequently lost in cancer. Proc Natl Acad Sci U S A. 1998;95:8835–8840.
38. Stelzer G, Inger A, Olender T, et al. GeneDecks: paralog hunting and gene‐set distillation with GeneCards annotation. OMICS. 2009;13: 477–487.
59. Wan W, Zhu J, Sun X, Tang W. Small interfering RNA targeting Krüppel‐like factor 8 inhibits U251 glioblastoma cell growth by inducing apoptosis. Mol Med Rep. 2012;5:347–350.
39. Rossi D, Rasi S, Fabbri G, et al. Mutations of notch1 are an independent predictor of survival in chronic lymphocytic leukemia. Blood. 2012;119:521–529.
60. Douglas RS, Capocasale RJ, Lamb RJ, Nowell PC, Moore JS. Chronic leukemia B cells are resistant to the apoptotic effects of transforming growth factor‐beta. Blood. 1997;89:941–947.
40. Willander K, Dutta RK, Ungerbäck J, et al. Notch1 mutations influence survival in chronic lymphocytic leukemia patients. BMC Cancer. 2013;13:274
61. Rossman ED, Lewin N, Jeddi‐Tehrani M, Osterborg A, Mellstedt H. T‐cell cytokines in patients with B cell chronic lymphocytic leukemia. Eur J Hematol. 2002;68:299–306.
41. Wan Y, Wu CJ. SF3B1 mutations in chronic lymphocytic leukemia. Blood. 2013;121:4627–4634.
62. Guo B, Zhang L, Chiorazzi N, Rothstein TL. IL‐4 rescues surface IgM expression in chronic lymphocytic leukemia. Blood. 2016;128:553–562.
FERNÁNDEZ‐MARTÍNEZ
23 of 23
ET AL.
63. Ogris C, Guaia D, Helleday T, Sonnhammer EL. A novel method for crosstalk analysis of biological networks: improving accuracy of pathway annotation. Nucleic Acids Res. 2016. doi: 10.1093/nargl/gkw849
SUPPORTING I NFORMATION Additional Supporting Information may be found online in the supporting information tab for this article.
How to cite this article: Fernández‐Martínez JL, deAndrés‐ Galiana EJ, Sonis ST. Genomic data integration in chronic lymphocytic leukemia. J Gene Med. 2017;19:e2936. https://doi. org/10.1002/jgm.2936