Gene expression profile based classification models of psoriasis - Core

0 downloads 0 Views 1MB Size Report
Nov 13, 2013 - for manual tuning. A random forest ..... [51] A.A. Lobito, S.R. Ramani, I. Tom, J.F. Bazan, E. Luis, W.J. Fairbrother, W. Ouyang, L.C. · Gonzalez ...
Genomics 103 (2014) 48–55

Contents lists available at ScienceDirect

Genomics journal homepage: www.elsevier.com/locate/ygeno

Gene expression profile based classification models of psoriasis Pi Guo a,b,c,1, Youxi Luo a,b,1, Guoqin Mai a,b, Ming Zhang f,g, Guoqing Wang d,e,⁎⁎, Miaomiao Zhao a,b, Liming Gao a,b, Fan Li d,e, Fengfeng Zhou a,b,⁎ a

Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, Guangdong 518055, PR China Key Lab for Health Informatics, Chinese Academy of Sciences, Shenzhen, Guangdong 518055, PR China Department of Public Health, Shantou University Medical College, No. 22 Xinling Road, Shantou, Guangdong 515041, PR China d Department of Pathogeny Biology, Norman Bethune Medical College, Jilin University, Changchun, Jilin 130021, PR China e Key Laboratory of Zoonosis Research, Ministry of Education, Jilin University, Changchun, Jilin 130021, PR China f Department of Epidemiology and Biostatistics, Faculty of Infectious Diseases, University of Georgia, Athens, GA 30605, USA g Institute of Bioinformatics, University of Georgia, Athens, GA 30605, USA b c

a r t i c l e

i n f o

Article history: Received 25 March 2013 Accepted 1 November 2013 Available online 13 November 2013 Keywords: Gene expression profiles Classification Psoriasis

a b s t r a c t Psoriasis is an autoimmune disease, which symptoms can significantly impair the patient's life quality. It is mainly diagnosed through the visual inspection of the lesion skin by experienced dermatologists. Currently no cure for psoriasis is available due to limited knowledge about its pathogenesis and development mechanisms. Previous studies have profiled hundreds of differentially expressed genes related to psoriasis, however with no robust psoriasis prediction model available. This study integrated the knowledge of three feature selection algorithms that revealed 21 features belonging to 18 genes as candidate markers. The final psoriasis classification model was established using the novel Incremental Feature Selection algorithm that utilizes only 3 features from 2 unique genes, IGFL1 and C10orf99. This model has demonstrated highly stable prediction accuracy (averaged at 99.81%) over three independent validation strategies. The two marker genes, IGFL1 and C10orf99, were revealed as the upstream components of growth signal transduction pathway of psoriatic pathogenesis. © 2013 Elsevier Inc. All rights reserved.

1. Introduction Psoriasis is a widely spread chronic autoimmune disease with major symptoms on skins and joints [1,2]. Psoriasis may increase the risks of stroke and myocardial infarction [3,4] and shares the same effective drug, p40-neutralizing antibodies, with the neurodegenerative Alzheimer's disease [5]. The prevalence of psoriasis varies considerably in different geographic regions and ethnic groups, with higher rate in the Caucasian population (e.g. 2.2% in the United States) and lower rate in the Asian (e.g. less than 1% in China) [6,7]. Along with the cutaneous manifestations, psoriasis is accompanied by inflammatory arthritis in up to 40% cases [8]. Its phenotypic symptoms may also include epidermal hyperplasia, keratinocyte differentiations, angiogenesis, infiltration of T lymphocytes, dendritic cells, neutrophils, cytokines as well as chemokines [9–11]. The current diagnosis of psoriasis heavily relies on the visual observation of the skin lesion biopsy by an experienced dermatologist [12], which procedure is inefficient and labor intensive. ⁎ Correspondence to: Fengfeng Zhou, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, Guangdong 518055, PR China. Fax:+86-75586392299. ⁎⁎ Correspondence to: Guoqing Wang, Department of Pathogeny Biology, Norman Bethune Medical College, Jilin University, Changchun, Jilin 130021, PR China. E-mail addresses: [email protected], [email protected] (F. Zhou). URL: http://www.healthinformaticslab.org/ffzhou/ (F. Zhou). 1 These authors contribute equally to this work. 0888-7543/$ – see front matter © 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.ygeno.2013.11.001

Majority of the large-scale psoriasis studies have employed microarray or qPCR techniques to compare the gene expression pattern in psoriatic lesion samples with healthy controls. Gudjonsson et al. systematically screened ~54,000 probe sets in the Affymetrix HG-U133 Plus 2 platform, and detected 179 unique genes for 223 probe sets with differential expression in the uninvolved psoriatic skin samples [13]. The lipid metabolism was demonstrated to be most associated with psoriasis, and three lipid metabolic related transcription factors, peroxisome proliferatoractivator receptor alpha (PPARA), sterol regulatory element-binding protein (SREBF), and estrogen receptor 2 (ESR2), seem to regulate the dysregulated genes in the uninvolved psoriatic samples. Reischl et al. chose the Affymetrix HG-U133A platform to profile the expression patterns of ~22,000 probe sets in the human genome, and detected 179 genes for 203 probe sets with more than 2-fold expression changes in the psoriatic lesion skins [14]. Their data strongly suggested that the stem cell proliferation regulation Wnt pathway is associated with the psoriasis development. But it still remains elusive whether the Wnt pathway plays a causative role or merely the human immune response to the psoriatic lesion. The human immune response to the environmental stimulus, the type I interferon system, was screened in both the psoriatic lesion and uninvolved skin samples on the Affymetrix HG-U133 Plus 2 platform [15]. There were 1408 up-regulated and 1465 down-regulated probe sets detected in the psoriatic lesion skin samples, among which multiple T cell marker genes and two type I interferon inducible genes (STAT1 and ISG15) had ~3 fold or higher in expression changes. Swindell et al.

P. Guo et al. / Genomics 103 (2014) 48–55

[16,17] compared the gene expression files of psoriasiform versus normal samples between human and mouse. The calculated lists of top 5000 foldchange ranked up-regulated and down-regulated probe sets showed consistent overlaps among different mouse psoriasiform phenotypes. Unfortunately, the lists of marker genes barely overlapped between different studies, leaving little information to construct a psoriatic classification model. We hypothesized that a robust disease classification model can be constructed based on marker genes. This study employed the microarray based gene expression profiles, in which three widely-used feature selection algorithms were applied to screen the psoriasis associated features (probe sets). Majority of the 21 features were known to be associated with psoriasis. We further conducted a comprehensive performance evaluation of the neural network based binary classifiers trained with the Incremental Feature Selection strategy. The classifier based on the three features of IGFL1 and C10orf99 outperformed the rest of models, with over 99.43% accuracy in all three validation tests. 2. Methods 2.1. Data collection and preprocessing Two microarray data sets with accession numbers GSE14905 [15] and GSE13355 [16,17] were retrieved from the Gene Expression Omnibus (GEO) Database [18]. The two data sets provide gene expression profiles for the same disease type (psoriasis) and the corresponding controls, but have no overlap of samples. This makes them an ideal paired set for detecting the consistent psoriasis biomarkers. There are 176 participating subjects in total, including 91 psoriatic patients and 85 healthy controls, denoted by P = {P1, P2,…, P91} and N = {N1, N2,…, N85}. Samples from both data sets were processed by the standard protocol for the Affymetrix Human Genome U133 Plus 2.0 platform. The detailed information of the samples used in this study is listed in the Supplementary Table S1. The raw fluorescence intensity data within CEL files were processed by background correction, log2 transformation, and quantile normalization using the Robust Multichip Analysis (RMA) algorithm [19], as implemented in the program Affymetrix Expression Console™, which was downloaded from http://www.affymetrix.com/estore/index.jsp. The systematic batch biases were corrected by the Distance Weighted Discrimination (DWD) algorithm [20]. DWD eliminates source effects across different studies by finding a hyper-plane that separates the two systematic biases and adjusts the microarray data by projecting them on the hyper-plane through subtracting out the DWD plane multiplied by the batch mean. We removed the probe sets with less discriminating power by measuring the overall variance. This procedure was performed in varFilter of the R package Bioconductor genefilter. We chose the default setting of the Inter-Quartile Range function IQR for the parameter var.func, and .5 as the var.cutoff. The function IQR was chosen for its robust outlier detection. This filtering step was applied on the DWDprocessed data sets, and 27,336 probe sets of the 176 participating subjects were retained for further analysis. This study investigated the binary classification problem between N1 = 91 psoriatic patients and N2 = 85 healthy controls. The total number of samples was N = 91 + 85 = 176. The expression level of each probe set was denoted as Fi, where i ϵ {1, 2,…, 27,336}. The following sections aimed to detect m features F′,j where j ϵ {1, 2,…, m} from the M = 27,336 features of the binary classification model. We investigated various feature selection algorithms to detect psoriasis related features from 27,336 probe sets, with classification accuracy as the most important feature. The following three algorithms were chosen for their complementary performance features. ROC screens for the most significant phenotype-associated individual features, by investigating each feature's parameter-independent association [21]. SVM-RFE further optimizes the inter-feature associations, which can top-rank the important features that have weak individual phenotype-association

49

[22]. Boruta removes the features with less contribution to classification accuracy by introducing random variables for the competition [23,24]. We believe, the consensus features derived from the three algorithms represent a reliable list of biomarker candidates. The R implemented version of the three algorithms was used in this study.

2.2. SVM-RFE feature selection algorithm SVM-RFE is a Recursive Feature Elimination (RFE) strategy utilizing the weight vector produced by the classification model Support Vector Machine [25]. RFE is a widely used strategy that can recursively reduce the number of features by removing those features with the least phenotype correlation rankings [22,26,27]. The Support Vector Machine algorithm was invented by Vladimir N Vapnik in 1979 for the classification problem [28], and it produces a weight vector for the input features after being trained to optimize the classification accuracy between the psoriatic and healthy samples. The generated weight vector will be used to rank the features, and the RFE strategy will recursively eliminate the least ranked features. The R package mSVM-RFE was chosen as the implementation of this algorithm [22].

2.3. ROC feature selection algorithm The receiver operating characteristic (ROC) is a graphical plot in the signal detection theory that illustrates the relationship between the true positive rate (TPR) versus the false positive rate (FPR) of a binary classifier [29]. The ROC curve is one of the best measurements to rank the features between multiple groups of tissues [30]. We ranked the features based on the ROC-based binary classification performance of the psoriatic and healthy samples. In brief, the two expression distributions of a given feature and/or gene were calculated for the psoriatic and healthy samples, respectively. Next, the area under the ROC curve (AUC) was calculated to rank the features. The R package rocc was used as the implementation of this algorithm [21].

2.4. Boruta feature selection algorithm The Boruta feature selection algorithm is a random forest based wrapper strategy that removes features proven to be less informative than random probes [31]. The classification algorithm, random forest, was chosen because of its inexpensive calculation and parameter free for manual tuning. A random forest [22] collects votes over multiple decision tree based weak classifiers, which are independently trained over the binary classification data set between the psoriatic and healthy samples. The importance of a probe set is measured by how the classification performance is decreased due to the random permutation of the probe set among samples. The trained random forest classifier provides an importance estimate for all features [22]. The R package implementation of Boruta was used for this study [24].

2.5. Classification model This study utilized a feed-forward neural network classifier with one hidden layer of nodes [32] to distinguish the psoriatic samples from the healthy controls. A neural network is a mathematical model with a function f : X → Y, where X is a collection of input neurons and Y is a collection of output neurons. X = {F′}, j where j ϵ {1, 2,…, m}, is the features selected by the aforementioned feature selection algorithms. There are two output nodes, Y = {0, 1}, representing the predicted results of a healthy or psoriatic sample, respectively. The input and output neurons are connected by the hidden layers of neurons. In order to avoid model over-fitting, the number of hidden layers was set to one in this study. The rest of the parameters used the default values.

50

P. Guo et al. / Genomics 103 (2014) 48–55

2.6. Measurement of prediction performance A neural network based binary classifier was evaluated using the following three performance measurements [33,34]. Let P = {P1, P2,…, P91} and N = {N1, N2,…, N85} as the positive and negative data sets, respectively. TP is the number of psoriatic samples predicted to be psoriatic, and FN is the number of psoriatic samples incorrectly predicted to be healthy. TN and FP are the numbers of healthy samples predicted to be healthy and psoriatic, respectively. The classification model's sensitivity and specificity are defined to be Sn = TP/(TP + FN) and Sp = TN/(TN + FP), respectively. The model's overall accuracy is Ac = (TP + TN)/(TP + FN + TN + FP). A neural network classification model may be over-fitted if the number of features used in the model is similar to or larger than that of the training samples. This can be overcome by restricting the number of selected features and evaluating the classification model's cross-validation performance. This study measures how a classification model performs by the commonly used 10-fold cross-validation strategy, as similar to [35–37]. Its principle is to randomly split the 91 psoriatic and 85 healthy samples into ten equal-sized subsets P = {P1, P2,…, P10} and N = {N1, N2,…, N10}, respectively. For the pair of psoriatic and healthy sample subsets Pk and Nk, where k ϵ {1, 2,…, 10}, a single-hidden-layer feed-forward neural network classifier was trained over the data sets P\Pl and N\Nl,

where l ≠ k, and the prediction performance was measured over the data sets Pk and Nk. The overall prediction performances, Sn, Sp, and Ac, were summed over 10 times of calculations. 3. Results 3.1. DWD adjustment results The systematic batch bias between the two microarray data sets, GSE14905 [15] and GSE13355 [16,17] used in this study, was corrected by the DWD algorithm [20]. The samples from the two microarray data sets were clearly partitioned into the two data sets in the unsupervised hierarchical clustering, as shown in Fig. 1(A). This separation suggests that the samples were more similar to those from the same microarray data set than those from the other data set, and samples of the same phenotype labeling, i.e. healthy or psoriatic, have no correlations with each other. Similar separation can be achieved by the first principle component in the principle component analysis (PCA) algorithm [38,39], as shown in Supplementary Fig. S1(a). The batch bias needs to be corrected, before the two independently produced microarray data can be used as the crossvalidation data sets. By the DWD-based projection, the two data sets were reasonably mixed together after removing the batch bias, as shown in the dendrogram in Fig. 1(B). Supplementary Fig. S1(b) shows

Fig. 1. Hierarchical clustering plots of the samples from the two microarray data sets for the raw data (A) before and (B) after the DWD batch effect correction. The data sets GSE14905 and GSE13355 were colored in red and green, respectively. The three finally chosen features, i.e. 227736_at (gene: C10orf99), 227735_s_at (gene: C10orf99) and 239430_at (gene: IGFL1), were highlighted as red in the right vertical panel.

P. Guo et al. / Genomics 103 (2014) 48–55

that at least the first four principle components and their pairs cannot separate the samples from the two different microarray data sets, thus supporting the hypothesis that the batch bias was corrected. 3.2. Gene sets determined by feature selection algorithms Different feature selection algorithms rank the features with different phenotype-association measurements, and we hypothesized that features chosen by all the three feature selection algorithms, i.e. SVM-RFE, ROC and Boruta, have stable discrimination power. We chose 80 features top-ranked by each of the three algorithms SVM-RFE, ROC and Boruta. The intersection of the three top-ranked feature lists consisted of 21 features, corresponding to 18 human genes, as shown in Fig. 2. The probe set IDs, official gene symbols, and functional annotations of the above chosen features/probe-sets were detailed in Supplementary Table S2. We further investigated the discriminating power of the combined list of all the 21 features. The expression patterns of the 21 features were visualized as the unsupervised hierarchical clustering heatmap of all the 176 samples, as shown in Fig. 3. The 176 samples can be easily separated into their correct labels, with only two psoriatic samples incorrectly predicted as healthy. This visual separation gave a detection accuracy Ac = 174/176–98.86% for the 176 samples. The 10-fold cross-validation of the neural network binary classifier rendered the same prediction result. Fig. 3 also demonstrated that all of the 21 features (probe sets) were up-regulated in the psoriatic samples. 3.3. Robustness test of the binary classifier The prediction robustness of the binary classifier was evaluated using the Incremental Feature Selection (IFS) strategy. The selected 21 features were ranked by the averaged scores calculated from the three feature selection algorithms as {f1, f2,…, f21}. We calculated the overall accuracy Ac using the features {f1, f2,…, f1}, (i ≤ 21) over three independent validations. The number of features i was chosen, if the average accuracy AvgAc over the three validations reaches the peak as shown in Fig. 4. The three validation strategies are described below. For the two independently generated microarray data sets, GSE14905 and GSE13355, we firstly conducted a 10-fold cross-validation of the binary classifier over the combined data set, denoted here as 10FCV. In the second validation strategy, we used GSE14905 as the training data set and GSE13355 the test set to validate the classifier, denoted here as

Fig. 2. The overlapping features top-ranked by the three feature selection algorithms. Each of the three algorithms, i.e. Boruta, SVM-RFE and ROC, chooses 80 features. There are 21 features that are in the overlapping region of the three algorithms.

51

tGSE14905. The third validation strategy tGSE13355 was defined similarly as tGSE14905. The latter two validations evaluated the stability of the classification model on data sets with batch effects. The binary classifier achieved the best overall accuracies (99.81%) over the three validation strategies, as shown in Fig. 4. The top one ranked feature, 227736_at (gene C10orf99), achieved 100% in Ac for both 10FCV and tGSE13355 validations, but didn't work equally well on the tGSE14905 validation. The second best feature, 227735_s_at, also resides within C10orf99. But the classifier based on the two features, 227736_at and 227735_s_at, performed worse than 227736_at itself over the 10FCV validation. The top three ranked features, 227736_at and 227735_s_at (gene C10orf99) and 239430_at (gene IGFL1), produced a binary classifier which performed well with 99.43%, 100% and 100% in overall accuracies (Ac) for the three validation strategies. A classifier based on more features produced fluctuated overall accuracies between 80% and 100% (Fig. 4). The three features were highlighted in bold and blue font in Supplementary Table S2, whose functions are discussed in details below. 3.4. Enrichment analysis of differentially expressed genes The Gene Ontology (GO) annotations were screened for the terms enriched in the 21 psoriasis associated features (18 human genes), as shown in Table 1. The enrichment analysis was conducted using the system DAVID with default parameters (Evalue cutoff = E −2) [40]. The Pvalues of these multiple tests were statistically corrected by Bonferroni and Benjamini algorithms [41]. Only the GO categories with Pvalues ≤ 0.05 after both multi-test corrections were kept for further analysis, as shown in Table 1. Genes involved in the development and differentiation processes of the skin cells, in particular the keratinocyte and epidermal cells, were significantly enriched in the 21 features that were chosen by three different feature selection algorithms. This result is consistent with the clinical observation that uncontrolled differentiation of the skin cells is psoriasis's major symptom [42,43]. There is also a sub-cellular compartment, cornified envelope (GO:0001533 with Pvalue = 5.65E− 03 after both correction algorithms), which was enriched with 3 psoriasis associated genes. 4. Discussion 4.1. A robust prediction signature for psoriasis This study aimed to establish a robust psoriasis prediction model, based on the gene expression profiles of human samples. Multiple sets of genes were proposed to be associated with psoriasis, or differentially expressed in psoriasis samples [13–15,44]. Such gene sets consist of dozens or even hundreds of candidate markers, which may potentially be used for psoriasis diagnosis. However, psoriasis classification model study is scarce. To our knowledge, this study is the first of this kind in constructing a feed-forward neural network based psoriasis binary classifier. We hypothesized that the association between the marker features and psoriasis is strong enough to pass the filtering criteria of different feature selection algorithms. Therefore we applied three widely used feature selection algorithms, i.e. SVM-RFE, ROC and Boruta, to screen the 27,336 features that passed the data quality filtering criteria and have statistical significant variations between the disease and control samples. The top 80 features ranked by each of the three feature selection algorithms with default parameters, i.e. SVM-RFE, ROC and Boruta, were retained for further analysis. The intersection of these three lists resulted in 21 features (probe sets) that belong to 18 human genes. Majority of the 21 features were known to be associated with the initiation and development of psoriasis, as shown in Table 1 and Supplementary Table S2. A comprehensive classification validation was conducted to investigate how a subset of these 21 features may be combined to build a binary classifier, which can perform equally well on different data sets. We incrementally added one feature at a time to the feature subset by the ranking order. We evaluated the feed-forward neural network

52

P. Guo et al. / Genomics 103 (2014) 48–55

Fig. 3. The hierarchical clustering heatmap of samples based on expression patterns of the 21 features. Each row corresponds to a feature and each column corresponds to a sample. The status of psoriasis or normal for each subject is shown with the above bar. The psoriatic and healthy samples are in red and blue, respectively.

classifier with one hidden layer of nodes in three independent validation strategies, as described in the above section. The binary classifiers based on the top one or two features excelled in some validations, but were not stable enough on the other data sets. With one more feature, the trained binary classifier achieved the overall accuracies of 99.43%, 100% and 100% for the three validation strategies, respectively, and the best averaged overall accuracy (99.81%) among all the incrementally added feature subsets, as shown in Fig. 4. We also observed that more features made the overall accuracies more fluctuating. In all, the binary classifier trained over the top three ranked features (two human genes, C10orf99 and IGFL1) represents a robust and highly accurate psoriasis prediction model. 4.2. Functional insights of the two marker genes, IGFL1 and C10orf99 Psoriasis was regarded as a T-cell-driven disease with autoimmune mechanisms as its primary cause [45]. This view was supported by

Fig. 4. Prediction accuracy assessment based on the feature subset in three scenarios by the Incremental Feature Selection (IFS) strategy. The overall accuracy (Ac) values of the validations 10FCV, tGSE14905 and tGSE13355 were plot in different line types. The average accuracy is averaged over the three validation strategies for the number of features i, and is plotted as a dash line with circle points.

clinical data and genetic studies, such as HLA-C*06, ERAP1, the IL-23 pathway, and NF-κB signaling [46]. The PI3K/Akt pathway may be continuously sustained by the C10orf99 stimulus, promoted keratinocyte proliferation, and increased the anti-apoptotic NF-κB signaling, which not only resist psoriatic keratinocytes to cell death [47], but also trigger cytokine expression resulting to inflammatory responses [48]. At present, there exist conventional and biologic agents to treat psoriasis. Conventional treatments include methotrexate, cyclosporine A, and acitretin. Five biologic agents are FDA approved for psoriasis treatment: alefacept (T-cell modulator), adalimumab, etanercept, ustekinumab and infliximab (TNF-α inhibitors) [49]. These biologic agents may only control the development of the symptoms. The psoriasis classification model developed in this study suggested an association between psoriasis and two marker genes, IGFL1 and C10orf99. A molecular pathway of psoriasis development was manually curated from the literature [50,51], as shown in Fig. 5. IGFL1 is one of the four members of the human insulin growth factor (IGF)-like (IGFL) family, which is the ligand of cell membrane associated receptor IGFLR1. Structurally similar to the IGF family members, IGFLs comprise small and secreted proteins (shorter than 80 aa on average in the mature polypeptide) and contain 11 Cysteines including two CC motifs [52]. With other hormones or growth factors, IGFLs promote cell proliferation and differentiation, and suppress apoptosis in various cell types [53]. The continuous stimulation of IGFL1 induces a prolonged association of phosphatidylinositol 3 kinase (PI3K) with the IGFL1 receptor, which was shown to be essential for IGFL1-induced cell proliferation, including promoting DNA synthesis and increasing cell number [54]. Two IGFLR1 can dimerize when the growth factor IGFL1 is coupled and induce auto-phosphorylation. The post-translational modification of IGFLR1 transduces the growth signal to the downstream PI3K/Akt pathway. C10orf99 may be a serine/threonine kinase that can phosphorylate IGFLR1 and induce the same signal transduction, based on multiple sources of evidences. The functional annotation of C10orf99 was not documented in various databases, such as UniProtKB [55] and Human Protein Atlas [56]. We structurally aligned C10orf99 to the known 3D structures using I-TASSER version 1.1 [57]. The top five enzyme homologs of C10orf99 were found belonging to the Enzyme

P. Guo et al. / Genomics 103 (2014) 48–55

53

Table 1 Enriched Gene Ontology (GO) terms. Gene Ontology (GO) terms from the groups of biological process (BP), cellular component (CC) and molecular function (MF), that were significantly enriched in the 21 selected features (probe sets). Term

Pvalue

FC

Bonferroni

Benjamini

BP|GO:0031424 — keratinization BP|GO:0009913 — epidermal cell differentiation BP|GO:0030216 — keratinocyte differentiation BP|GO:0030855 — epithelial cell differentiation CC|GO:0001533 — cornified envelope

6.45E−06 3.07E−05 2.36E−05 2.09E−04 2.18E−04

96.8 57.81 63.07 30.38 124.5

8.13E−04 3.86E−03 2.98E−03 2.60E−02 5.65E−03

8.13E−04 1.29E−03 1.49E−03 6.57E−03 5.65E−03

Classification (EC) of serine/threonine kinases (2.7.11.14, 2.7.11.14, 2.7.11.1, 2.7.11.13 and 2.7.11.1). The best two matches (PDBs 3C4W and 3C51) are G protein coupled receptor (GPCR) kinases (GRKs). The best match, enzyme PDB 3C4W, has as little as 3.38 RMSD variation to the predicted structure of C10orf99 as shown in Fig. 6 (within the reasonable RMSD cutoff of 5.5 [58]), where RMSD is the root mean square deviation between two 3D structures. GRK was proposed to facilitate the phosphorylation of IGFLR1, and to activate the downstream PI3K/ Akt pathway [59]. By suppressing the GRK expression with siRNA, Zheng et al. demonstrated that the decreased expression of GRK5/6 abolished IGFL1-mediated AKT activation, and GRK2 inhibition partially inhibited AKT signaling [59]. IGFL1 binding to its receptor IGFLR1 can stimulate the phosphorylation of the insulin receptor substrate (IRS) [60] and activate the PI3K/Akt pathway [61]. Therefore, the elevated expression of C10orf99 can activate the downstream IGLF1-mediated AKT pathway, as shown in Fig. 5. This hypothesis was supported by a number of recent studies at the genomics and epigenetics levels. C10orf99 was shown to be up-regulated by 1.7-fold in uninvolved skins of psoriatic patients and even by 30-fold in the lesion skins [62]. The two microarray data sets used in this study also showed the elevated expression of C10orf99 in the psoriatic lesion skins [16,17]. The up-regulation of

C10orf99 in psoriasis was independently supported by the epigenetic studies, which showed significantly less methylation of C10orf99 in the psoriatic samples compared to the healthy controls [63]. 5. Conclusions This study constructed a robust two-gene expression signature to distinguish the psoriasis samples from the healthy ones. The functional characterization of the two genes, IGFL1 and C10orf99, suggests that the kinase IGFL1 activates the growth signal receptor C10orf99 by phosphorylation, and triggering a growth signal to the cell proliferation machinery. Further experiments will be conducted to investigate the regulatory roles of IGFL1 and C10orf99 in the psoriasis pathogenesis, and their potential use as psoriasis drug targets. List of abbreviations used DWD Distance Weighted Discrimination RFE Recursive Feature Elimination ROC receiver operating characteristic TPR true positive rate FPR false positive rate

Fig. 5. Molecular pathway of psoriasis development. The details of the abbreviations are listed here: IGFL1, insulin-like growth factor (IGF) like family member 1; IGFLR1, IGF-like family receptor 1; IRS, insulin receptor substrate; C10orf99, chromosome 10 open reading frame 99; P85, p85 regulatory subunit; P110α, PtdIns-3-kinase subunit p110-alpha; PI3K, phosphoinositide 3 kinase, catalytic, alpha polypeptide; PIP2, phosphatidylinositol (4,5)-bisphosphate; PIP3, phosphatidylinositol (3,4,5)-trisphosphate; NF-κB, nuclear factor of kappa light polypeptide gene enhancer in B-cells; P, phosphate.

54

P. Guo et al. / Genomics 103 (2014) 48–55

References

Fig. 6. The predicted structure of C10orf99. The alignment of C10orf99 to the best matched enzyme, GRK, with the RMSD variance is only 3.38.

AUC area under the ROC curve PCA principle component analysis IFS Incremental Feature Selection GO Gene Ontology (IGF)-like (IGFL) insulin growth factor (GPCR) kinases (GRKs) G protein coupled receptor

Authors' contributions FZ conceived and designed the study. PG, MZ, YL, and LG performed the classification training and cross-validation. GM, GW and FL performed the functional analysis and clinical characterization of the results. FZ, PG and GM drafted the manuscript. All authors read and approved the final manuscript. Competing interests Authors declare no potential competing interests.

Acknowledgments This study was supported in part by the Shenzhen Research Grant ZDSY20120617113021359, China 973 program (2011CB512003 and 2010CB732606-6) and NSFC 31000447. Computing resources were partly provided by the Dawning supercomputing clusters at SIAT CAS. We would like to thank the Health Informatics Laboratory (HILab) members for their helpful discussions. Constructive comments from the two anonymous reviewers are also appreciated.

Appendix A. Supplementary data The data sets supporting the results of this study are cited in the text and additional files. The Supplementary Figs. S1–S2 and Supplementary Tables S1–S2 are available at http://www.healthinformaticslab.org/ supp/. Supplementary data to this article can be found online at http:// dx.doi.org/10.1016/j.ygeno.2013.11.001.

[1] J. Villasenor-Park, D. Wheeler, L. Grandinetti, Psoriasis: evolving treatment for a complex disease, Cleve. Clin. J. Med. 79 (2012) 413–423. [2] T.W. Chu, T.F. Tsai, Psoriasis and cardiovascular comorbidities with emphasis in Asia, G. Ital. Dermatol. Venereol. 147 (2012) 189–202. [3] L. Puig, Cardiovascular risk and psoriasis: the role of biologic therapy, Actas Dermosifiliogr. 103 (2012) 853–862. [4] J.M. Gelfand, E.D. Dommasch, D.B. Shin, R.S. Azfar, S.K. Kurd, X. Wang, A.B. Troxel, The risk of stroke in patients with psoriasis, J. Invest. Dermatol. 129 (2009) 2411–2418. [5] J. Vom Berg, S. Prokop, K.R. Miller, J. Obst, R.E. Kalin, I. Lopategui-Cabezas, A. Wegner, F. Mair, C.G. Schipke, O. Peters, Y. Winter, B. Becher, F.L. Heppner, Inhibition of IL-12/IL-23 signaling reduces Alzheimer's disease-like pathology and cognitive decline, Nat. Med. 18 (2012) 1812–1819. [6] R.G. Langley, G.G. Krueger, C.E. Griffiths, Psoriasis: epidemiology, clinical features, and quality of life, Ann. Rheum. Dis. 64 (Suppl. 2) (2005) ii18–ii23(discussion ii24-15). [7] Psoriasis-Aid.com, Psoriasis Statistics, 2012. [8] D.D. Gladman, Natural history of psoriatic arthritis, Baillieres Clin. Rheumatol. 8 (1994) 379–394. [9] F. Chamian, M.A. Lowes, S.L. Lin, E. Lee, T. Kikuchi, P. Gilleaudeau, M. Sullivan-Whalen, I. Cardinale, A. Khatcherian, I. Novitskaya, K.M. Wittkowski, J.G. Krueger, Alefacept reduces infiltrating T cells, activated dendritic cells, and inflammatory genes in psoriasis vulgaris, Proc. Natl. Acad. Sci. U. S. A. 102 (2005) 2075–2080. [10] F. Chamian, J.G. Krueger, Psoriasis vulgaris: an interplay of T lymphocytes, dendritic cells, and inflammatory cytokines in pathogenesis, Curr. Opin. Rheumatol. 16 (2004) 331–337. [11] J.K. Kulski, W. Kenworthy, M. Bellgard, R. Taplin, K. Okamoto, A. Oka, T. Mabuchi, A. Ozawa, G. Tamiya, H. Inoko, Gene expression profiling of Japanese psoriatic skin reveals an increased activity in molecular stress and immune response signals, J. Mol. Med. 83 (2005) 964–975. [12] Wikipedia, Psoriasis, 2012. [13] J.E. Gudjonsson, J. Ding, X. Li, R.P. Nair, T. Tejasvi, Z.S. Qin, D. Ghosh, A. Aphale, D.L. Gumucio, J.J. Voorhees, G.R. Abecasis, J.T. Elder, Global gene expression analysis reveals evidence for decreased lipid biosynthesis and increased innate immunity in uninvolved psoriatic skin, J. Invest. Dermatol. 129 (2009) 2795–2804. [14] J. Reischl, S. Schwenke, J.M. Beekman, U. Mrowietz, S. Sturzebecher, J.F. Heubach, Increased expression of Wnt5a in psoriatic plaques, J. Invest. Dermatol. 127 (2007) 163–169. [15] Y. Yao, L. Richman, C. Morehouse, M. de los Reyes, B.W. Higgs, A. Boutrin, B. White, A. Coyle, J. Krueger, P.A. Kiener, B. Jallal, Type I interferon: potential therapeutic target for psoriasis? PLoS One 3 (2008) e2737. [16] W.R. Swindell, A. Johnston, S. Carbajal, G. Han, C. Wohn, J. Lu, X. Xing, R.P. Nair, J.J. Voorhees, J.T. Elder, X.J. Wang, S. Sano, E.P. Prens, J. DiGiovanni, M.R. Pittelkow, N.L. Ward, J.E. Gudjonsson, Genome-wide expression profiling of five mouse models identifies similarities and differences with human psoriasis, PLoS One 6 (2011) e18266. [17] R.P. Nair, K.C. Duffin, C. Helms, J. Ding, P.E. Stuart, D. Goldgar, J.E. Gudjonsson, Y. Li, T. Tejasvi, B.J. Feng, A. Ruether, S. Schreiber, M. Weichenthal, D. Gladman, P. Rahman, S.J. Schrodi, S. Prahalad, S.L. Guthery, J. Fischer, W. Liao, P.Y. Kwok, A. Menter, G.M. Lathrop, C.A. Wise, A.B. Begovich, J.J. Voorhees, J.T. Elder, G.G. Krueger, A.M. Bowcock, G.R. Abecasis, P. Collaborative Association Study of, Genome-wide scan reveals association of psoriasis with IL-23 and NF-kappaB pathways, Nat. Genet. 41 (2009) 199–204. [18] T. Barrett, D.B. Troup, S.E. Wilhite, P. Ledoux, C. Evangelista, I.F. Kim, M. Tomashevsky, K.A. Marshall, K.H. Phillippy, P.M. Sherman, R.N. Muertter, M. Holko, O. Ayanbule, A. Yefanov, A. Soboleva, NCBI GEO: archive for functional genomics data sets—10 years on, Nucleic Acids Res. 39 (2011) D1005–D1010. [19] R.A. Irizarry, B. Hobbs, F. Collin, Y.D. Beazer-Barclay, K.J. Antonellis, U. Scherf, T.P. Speed, Exploration, normalization, and summaries of high density oligonucleotide array probe level data, Biostatistics 4 (2003) 249–264. [20] M. Benito, J. Parker, Q. Du, J. Wu, D. Xiang, C.M. Perou, J.S. Marron, Adjustment of systematic microarray data biases, Bioinformatics 20 (2004) 105–114. [21] M. Lauss, A. Frigyesi, T. Ryden, M. Hoglund, Robust assignment of cancer subtypes from expression data using a uni-variate gene expression average as classifier, BMC Cancer 10 (2010) 532. [22] I. Guyon, J. Weston, S. Barnhill, V. Vapnik, Gene selection for cancer classification using support vector machines, Mach. Learn. 46 (2002) 389–422. [23] M.B. Kursa, W.R. Rudnicki, Feature selection with the Boruta package, Journal, 2010. [24] A. Liaw, M. Wiener, Classification and regression by randomForest, R News 2 (2002) 18–22. [25] Y. Ding, D. Wilkins, Improving the performance of SVM-RFE to select genes in microarray data, BMC Bioinforma. 7 (Suppl. 2) (2006) S12. [26] C. Furlanello, M. Serafini, S. Merler, G. Jurman, Entropy-based gene ranking without selection bias for the predictive classification of microarray data, BMC Bioinforma. 4 (2003) 54. [27] C. Furlanello, M. Serafini, S. Merler, G. Jurman, An accelerated procedure for recursive feature ranking on microarray data, Neural Netw. 16 (2003) 641–648. [28] V.N. Vapnik, The Nature of Statistical Learning Theory, 2 ed. Springer-Verlag, New York, 1999. [29] J.A. Swets, Signal Detection Theory and ROC Analysis in Psychology and Diagnostics: Collected Papers, Lawrence Erlbaum Associates, Inc., Hillsdale, NJ, England, 1996. [30] M.S. Pepe, G. Longton, G.L. Anderson, M. Schummer, Selecting differentially expressed genes from microarray experiments, Biometrics 59 (2003) 133–142. [31] M.B. Kursa, W.R. Rudnicki, Feature selection with the Boruta package, J. Stat. Softw. 36 (2010) 1–13.

P. Guo et al. / Genomics 103 (2014) 48–55 [32] D.A. Elizondo, R. Birkenhead, M. Gongora, E. Taillard, P. Luyima, Analysis and test of efficient methods for building recursive deterministic perceptron neural networks, Neural Netw. 20 (2007) 1095–1108. [33] K. Li, M. Yang, G. Sablok, J. Fan, F. Zhou, Screening features to improve the class prediction of acute myeloid leukemia and myelodysplastic syndrome, Gene 512 (2013) 348–354. [34] F. Zhou, Y. Xu, cBar: a computer program to distinguish plasmid-derived from chromosome-derived sequence fragments in metagenomics data, Bioinformatics 26 (2010) 2051–2052. [35] X. Zhou, J. Mao, J. Ai, Y. Deng, M.R. Roth, C. Pound, J. Henegar, R. Welti, S.A. Bigler, Identification of plasma lipid biomarkers for prostate cancer by lipidomics and bioinformatics, PLoS One 7 (2012) e48889. [36] A.P. Thrift, B.J. Kendall, N. Pandeya, D.C. Whiteman, A model to determine absolute risk for esophageal adenocarcinoma, Clin. Gastroenterol. Hepatol. 11 (2013)138–144.e2. [37] X. Hu, H. Cammann, H.-A. Meyer, K. Miller, K. Jung, C. Stephan, Artificial neural networks and prostate cancer—tools for diagnosis and management, Nat. Rev. Urol. 10 (2013) 174–182. [38] S. Pegolo, G. Gallina, C. Montesissa, F. Capolongo, S. Ferraresso, C. Pellizzari, L. Poppi, M. Castagnaro, L. Bargelloni, Transcriptomic markers meet the real world: finding diagnostic signatures of corticosteroid treatment in commercial beef samples, BMC Vet. Res. 8 (2012) 205. [39] H. Akanuma, X.Y. Qin, R. Nagano, T.T. Win-Shwe, S. Imanishi, H. Zaha, J. Yoshinaga, T. Fukuda, S. Ohsako, H. Sone, Identification of stage-specific gene expression signatures in response to retinoic acid during the neural differentiation of mouse embryonic stem cells, Front. Genet. 3 (2012) 141. [40] G. Dennis Jr., B.T. Sherman, D.A. Hosack, J. Yang, W. Gao, H.C. Lane, R.A. Lempicki, DAVID: Database for Annotation, Visualization, and Integrated Discovery, Genome Biol. 4 (2003) P3. [41] Y. Benjamini, Y. Hochberg, Controlling the false discovery rate: a practical and powerful approach to multiple testing, J. R. Stat. Soc. Ser. B Methodol. 57 (1995) 289–300. [42] D. Globe, M.S. Bayliss, D.J. Harrison, The impact of itch symptoms in psoriasis: results from physician interviews and patient focus groups, Health Qual. Life Outcomes 7 (2009) 62. [43] M.G. Boy, C. Wang, B.E. Wilkinson, V.F. Chow, A.T. Clucas, J.G. Krueger, A.S. Gaweco, S.H. Zwillich, P.S. Changelian, G. Chan, Double-blind, placebo-controlled, dose-escalation study to evaluate the pharmacologic effect of CP-690,550 in patients with psoriasis, J. Invest. Dermatol. 129 (2009) 2299–2302. [44] E. Guttman-Yassky, M. Suarez-Farinas, A. Chiricozzi, K.E. Nograles, A. Shemer, J. Fuentes-Duculan, I. Cardinale, P. Lin, R. Bergman, A.M. Bowcock, J.G. Krueger, Broad defects in epidermal cornification in atopic dermatitis identified through genomic analysis, J. Allergy Clin. Immunol. 124 (2009) 1235–1244(e1258). [45] J.G. Bergboer, P.L. Zeeuwen, J. Schalkwijk, Genetics of psoriasis: evidence for epistatic interaction between skin barrier abnormalities and immune deviation, J. Invest. Dermatol. 132 (2012) 2320–2321. [46] J.G. Bergboer, P.L. Zeeuwen, J. Schalkwijk, Genetics of psoriasis: evidence for epistatic interaction between skin barrier abnormalities and immune deviation, J. Invest. Dermatol. 132 (2013) 2320–2331.

55

[47] S. Madonna, C. Scarponi, S. Pallotta, A. Cavani, C. Albanesi, Anti-apoptotic effects of suppressor of cytokine signaling 3 and 1 in psoriasis, Cell Death Dis. 3 (2012) e334. [48] M. Pasparakis, Role of NF-kappaB in epithelial biology, Immunol. Rev. 246 (2012) 346–358. [49] E. Krulig, K.B. Gordon, Ustekinumab: an evidence-based review of its effectiveness in the treatment of psoriasis, Core Evid. 5 (2010) 11–22. [50] E.D. Roberson, Y. Liu, C. Ryan, C.E. Joyce, S. Duan, L. Cao, A. Martin, W. Liao, A. Menter, A.M. Bowcock, A subset of methylated CpG sites differentiate psoriatic from normal skin, J. Invest. Dermatol. 132 (2012) 583–592. [51] A.A. Lobito, S.R. Ramani, I. Tom, J.F. Bazan, E. Luis, W.J. Fairbrother, W. Ouyang, L.C. Gonzalez, Murine insulin growth factor-like (IGFL) and human IGFL1 proteins are induced in inflammatory skin conditions and bind to a novel tumor necrosis factor receptor family member, IGFLR1, J. Biol. Chem. 286 (2011) 18969–18981. [52] P. Emtage, P. Vatta, M. Arterburn, M.W. Muller, E. Park, B. Boyle, S. Hazell, R. Polizotto, W.D. Funk, Y.T. Tang, IGFL: a secreted family with conserved cysteine residues and similarities to the IGF superfamily, Genomics 88 (2006) 513–520. [53] S.M. Jones, A. Kazlauskas, Growth-factor-dependent mitogenesis requires two distinct phases of signalling, Nat. Cell Biol. 3 (2001) 165–172. [54] T. Fukushima, Y. Nakamura, D. Yamanaka, T. Shibano, K. Chida, S. Minami, T. Asano, F. Hakuno, S. Takahashi, Phosphatidylinositol 3-kinase (PI3K) activity bound to insulin-like growth factor-I (IGF-I) receptor, which is continuously sustained by IGF-I stimulation, is required for IGF-I-induced cell proliferation, J. Biol. Chem. 287 (2012) 29713–29721. [55] M. Magrane, U. Consortium, UniProt Knowledgebase: a hub of integrated protein data, Database (Oxford) 2011 (2011) bar009. [56] F. Ponten, J.M. Schwenk, A. Asplund, P.H. Edqvist, The Human Protein Atlas as a proteomic resource for biomarker discovery, J. Intern. Med. 270 (2011) 428–446. [57] A. Roy, A. Kucukural, Y. Zhang, I-TASSER: a unified platform for automated protein structure and function prediction, Nat. Protoc. 5 (2010) 725–738. [58] Y. Zhang, J. Skolnick, Tertiary structure predictions on a comprehensive benchmark of medium to large size proteins, Biophys. J. 87 (2004) 2647–2655. [59] H. Zheng, C. Worrall, H. Shen, T. Issad, S. Seregard, A. Girnita, L. Girnita, Selective recruitment of G protein-coupled receptor kinases (GRKs) controls signaling of the insulin-like growth factor 1 receptor, Proc. Natl. Acad. Sci. U. S. A. 109 (2012) 7055–7060. [60] T. Ogihara, T. Isobe, T. Ichimura, M. Taoka, M. Funaki, H. Sakoda, Y. Onishi, K. Inukai, M. Anai, Y. Fukushima, M. Kikuchi, Y. Yazaki, Y. Oka, T. Asano, 14-3-3 protein binds to insulin receptor substrate-1, one of the binding sites of which is in the phosphotyrosine binding domain, J. Biol. Chem. 272 (1997) 25267–25274. [61] Y. Benomar, A.F. Roy, A. Aubourg, J. Djiane, M. Taouis, Cross down-regulation of leptin and insulin receptor expression and signalling in a human neuronal cell line, Biochem. J. 388 (2005) 929–939. [62] Y. Liu, Identification of Genetic and Epigenetic Risk Factors for Psoriasis and Psoriatic Arthritis in: Graduate School of Arts and Sciences, Washington University in St. Louis, St. Louis, MO, 2011. 193. [63] S.H. Chu, D.F. Feng, Y.B. Ma, H. Zhang, Z.A. Zhu, Z.Q. Li, P.C. Jiang, Promoter methylation and downregulation of SLC22A18 are associated with the development and progression of human glioma, J. Transl. Med. 9 (2011) 156.

Suggest Documents