Gene-Markers Representation for Microarray Data - IEEE Xplore

1 downloads 0 Views 4MB Size Report
Gene-Markers Representation for Microarray Data. Integration. Elena Baralis. Politecnico di Torino. Torino - Italy. Email: elena.baralis@polito.it. Elisa Ficarra.
Gene-Markers

Representation for Microarray Data Integration

Enrico Macii Elena Baralis Elisa Ficarra Alessandro Fiori Politecnico di Torino Politecnico di Torino Politecnico di Torino Politecnico di Torino Torino - Italy Torino - Italy Torino - Italy Torino - Italy Email: [email protected] Email: [email protected] Email: [email protected] Email: [email protected]

platforms using an analysis of the 2nd expression order. This approach permits to discover only doublets of genes that are significantly correlated. Moreover, it is useful to discover the genes which are most probably responsible of the analysed process, but it only handles common genes (i.e., genes shared by all studies) and it does not handle genes with a correlation under a threshold. Other methods, such as [5], use the gene signature to evaluate the cross-platform integration, but using information from biological studies to select the markers. The system described in [6] performs the integration of heterogeneous kinds of information as graph-structured data such as ontologies and taxonomies with experimental data. However it does not allow the integration of different platforms of I. INTRODUCTION microarrays. The integration of microarrays data based on single or In this paper, we present a novel method that overcomes combined experimental conditions, is a critical task for the the problem of integrating heterogeneous gene expression understanding of the relationships between genes under deter- datasets, under different experimental conditions. Our apminate conditions and pathologies, for the observation of how proach is based only on gene expression values and on synergies (co-expression) among genes evolve with changing information which can be derived from the original datasets, experimental conditions and, finally, for the definition of more without any other a-priori knowledge. We try to tackle the intereliable biological hypotheses. Moreover, it allows a reduction gration problem by performing a projection of the microarray of noise and the use of redundant information to overcome data in a common space. In particular, our method captures one of the current limitations of microarrays, that is, the the data set-wise characteristics of a gene in terms of its reproducibility of the results. Microarray data integration leads correlations with a set of reference genes (gene-markers). The to a systemic approach that exploits the abundance and the gene-markers are common to all the microarray studies and are complexity of several types of data for the study of gene selected without any a-priori knowledge about the distribution activities, relationships and network evolutions. of the data or their biological meaning. The expression level of The main challenge of microarray data analysis is the explo- a gene in a microarray is then converted into its similarity (or sion in the volume and complexity of gene expression data. correlation) to the set of gene-markers. As result, if we have Different experiments utilize different tissue types, examine n gene-markers, the expression values of each gene will be different treatment strategies, and consider different stages of transformed in a n-dimensional vector where the components disease development. Putting this together with differences in represent the correlation with the gene-markers, such as in microarray platforms, technologies and protocols used in the a space transformation where the gene-markers are the axes. labs, leads to difficulties in integrating microarray data across This approach projects semantically non-conforming data from experiments. How to combine data (gene expression levels) disparate sources into common dimensions, allowing the coin different microarrays is a challenging problem since these herent interpretation and integration of the data. Moreover, gene expression levels are not necessarily directly comparable. since we perform a projection of all the microarray data, we Thus, directly integrating the microarrays according to the do not need to reduce the biological information content. In our system it is possible to identify clusters of genes with a low gene ids would result in inconsistency. Several studies try to combine different datasets for dis- influence of noise and outliers. For this reason, the presented covering the relationships between genes in two datasets, but approach can be used also on single studies in order to perform under the same experimental conditions, as [1], [2] and [3]. gene expression analyses and network discovery. The method presented in [4] combines different microarray The paper is organized as follows. In Section II is described Abstract-When analyzing the relationship between genes under different scenarios, the integration of different microarray experiments becomes a relevant task. This paper presents a framework to address some intrinsic problems of integration, due for instance to scaling issues, error bias, different experimental conditions or technology and protocols. Our approach projects original microarray data in a common transformed space to create a common representation of different microarray datasets. This approach allows us to integrate data from various microarray platforms or microarrays based on different experimental conditions. We validate our framework with experiments on real microarray datasets. The results suggest that our approach can be a profitably exploited for microarray data integration and further gene expression analysis applications.

1-4244-1509-8/07/$25.00 02007 IEEE

1056

B. Gene-markers selection The gene-markers are selected through the following three steps (see Figure 1). Filtering In this step the research space for the selection of gene-markers is reduced. The variance filter allows removing the genes characterized by a low variation of their expression over the dataset samples. This filter is basically used to remove flat genes, i.e. genes whose expression does not vary significantly over the conditions and the experiments. We consider these genes as not representative of the considered Fig. 1. Framework for the integration process process. Note that this step is performed only to select the our method for space transformation and data integration. In gene-markers. The microarray data represented in the new Section III are shown the experimental results and finally in space will instead include all the genes in the microarray studies without any filtering or feature selection. To eliminate Section IV the conclusions. flat genes we first calculate the variance of each gene with the II. MICROARRAY DATA INTEGRATION FRAMEWORK unbiased estimator: T 9 This framework aims at creating a new space for the repreN sentation for microarray data and the integration of different xt2 N studies in order to analyze the genes behavior across exper2 i=l U7 (1) imental designs and tissues, and infer correlations between N-1 groups of genes. The creation of a new space starts from where is the expression value of the considered gene the intrinsic problems of each single dataset: the noise and and N xi the number of samples. Then, we filter the features the scale of representation. Exploiting this transformation we characterized by a small variance to obtain a subset of K want to eliminate the link between data representation and agenes which gives the maximum contribution to the the global priori knowledge about data distribution, experimental design, variance of the process across samples. A parameter a, which class labels, biological meaning and all the other types of represents a percentage of the total variance of the dataset, is information which can be collected for a single study. For set as boundary to determine this subset. The maximization of this reason, the selection of the axes for the new space is a (2) is performed to select the K features: critical task for the space transformation. Our method selects a K number of features, called gene-markers, in order to determine 2i the new space topology. i=l a max (2) N The gene-markers are genes selected among the set of most Lvi2 representative common genes for all the studies analyzed in i=l the process, as we show later in this section. These genes should be characterized by a low percentage of correlation where ori is the variance of the jth gene , N is the number between them. This allows the creation of a space with a low of genes (features), K is the number of selected genes, inter-dependency between axes and with a high contribution with a given threshold a returns the K features which have of information. In this preliminary version we consider as the minimum probability of superposition between expression representative the discrimination power between class labels intervals of classes. The threshold a is typically set to 0.9 in in order to discriminate genes. As future work, we are imple- order to eliminate a minimum number of features. menting a new feature selection approach which is independent On this subset we will select the gene-markers through the of class labels and considers the behavior and the biological following feature selection step. relevance of the genes. Feature selection Feature selection is exploited to eliminate less relevant features in the K gene set of the single A. Overview dataset. Relevance is defined by the specific feature selection The architecture of our framework is shown in Figure 1. technique which is adopted. Hence, it is possible to integrate Each building block represents a phase of the global process in our framework different types of techniques (supervised or of microarray integration. The phases are the followings: unsupervised) which implement different notions of relevance. This first version of our work uses the ANOVA supervised . Block 1: Datasets selection for the integration method [7] to select the relevant features for the considered . Block 2: Gene-markers identification scenario. Hence, genes are ranked on a decreasing order - Step 1: filtering by their F-value which indicates the power of the gene to - Step 2: feature selection discriminate between classes. The higher the F-value, the - Step 3: gene-markers integration higher is the probability that that gene expression values can . Block 3: Space transformation and gene representation distinguish different classes. We selected the ANOVA method b~ Dataset ISWR£tnbr

I

1-4244-1509-8/07/$25.00 02007 IEEE

1057

because it addresses both binary and multi-class problems, while the t-test or other statistical methods usually address only binary problems. The result of this step is a set of genemarkers, which will represent appropriately the information in a give microarray dataset. Gene-marker integration This step is needed to perform gene-markers selection in case we are interested in comparing and integrating many microarray studies. To evaluate genemarkers common to all the datasets in order to build the new space for the data integration we have to merge the ranks generated in the previous step. The gene-markers are selected in a two-steps procedure. First, the features selected in the previous step are ranked by computing a unique score. For feature i, its global rank is computed by the following formula: K

ranki

=

E rankij J=1

(3)

. Cosine correlation

n

Zai bi cos

()

It measures similarity as the angle between gene vectors. It is robust with respect to the presence of outliers. It returns values in the range [-1, +1]. . Pearson correlation

r(a,)

nEaibi -EZaiE (bi rn EZai _ (EZai)2

bi)

It measures the linear dependence between two features. The presence of noise may hide correlation. Manhattan distance. It measures distance between two points in a Euclidean space. This measure is affected by a rotation of the coordinate system, but not by a translation of the system or a reflection with respect to an axis. It is more robust to outliers than Euclidean metric. In the new n-dimensional space, each gene expression value is transformed in a n-dimensional vector where the elements are the distance (or the correlation, depending on the chosen metric) between the gene and each gene-marker. This common representation for microarray data allows the datasets integration and the analysis of heterogeneous microarray studies.

where K is the number of studies analyzed and rankij is the position in the rank of dataset j for feature i. The global rank is sorted first on the the frequency in all the ranks for the feature i and then on the score calculated in (3). Further, we extract from this rank the set of gene-markers minimizing the correlation between the elements. The first gene of the global rank is removed from the rank and inserted in the gene-markers set. Then the global rank is pruned by the features which have an average quadratic correlation versus the markers higher a threshold selected by the user (i.e. 20%). The quadratic correlation, which represents the percentage of III. PRELIMINARY EXPERIMENTAL RESULTS linear dependence between features, is calculated with the We performed the experiments on the four datasets DLBCL, Pearson correlation. The first gene in the global rank, remained Leukemial, Brainl, Tumor9 which were analyzed in [8]. They after the pruning step, is inserted in the set of gene-markers. are characterized by a variable number of features, samples This procedure is repeated until the number of gene-markers and classes, as detailed in Table I at [9]. selected by the user is extracted or there is not any features We performed two sets of experiments. We evaluated: in the global rank. In this way we create a space made of 1) the effectiveness of our method in reducing the amount uncorrelated axes for minimizing the inter-dependency. of noise and outliers in the dataset C. Space transformation 2) the integration of two microarray datasets. Our purpose is to evaluate if the space transformation is conservative Given the set of N gene-markers, a new representation for with respect to the biological information content of the microarray data can be defined. The new representation can data. x N M N be modeled by the G matrix, a matrix, where is the number of gene-markers and M is the total number of A. Entropy evaluation genes. An arbitrary element gij of the matrix is given by the An important objective of data transformation is to reduce distance, in the original matrix representing microarray data, the entropy of the representation. The entropy of a system between expression data for gene i and expression data for describes the data distribution in space. A high value shows a gene-marker j (which is also an element of the same matrix). uniform distribution of data which implies a difficult identifiLet a be the expression data for gene i and b be the cation of clusters of genes with the same behavior [10]. We expression data for gene-marker j, which is a gene as well. in the datasets before the transformation, the computed entropy Both vectors are characterized by I elements, where I is the on the so called raw data, and after the transformation. We number of samples for both genes. Then, the element gij, of three kinds of tests: performed matrix G is given by of data 1) comparison entropy in raw and transformed data (4) gij =d(,b) 2) impact of the Filtering phase on entropy where d is a distance measure. 3) cardinality of the gene-marker set Distance may be measured by means of: We compute distance based entropy [11]: Euclidean distance which considers a Gaussian distribuE =jN E [Dij 1og2 Dij + (1 -Di) 1og2 (1 -Dij)] tion of data and is used for a Hilbert space. Xi Xj

1-4244-1509-8/07/$25.00 02007 IEEE

1058

TABLE I

TABLE II

EFFECT OF DIFFERENT DISTANCE MEASURE ON ENTROPY

EFFECT OF FILTERING ON ENTROPY

Dataset DLBCL Leukemial Brainl Tumor9

Cosine corr raw transf 0.750 0.127 0.722 0.245 0.813 0.305 0.813 0.292

Pearson corr raw transf 0.957 0.639 0.940 0.707 0.943 0.664 0.976 0.762

where Dij is the normalized distance [0,1] between points Xi and X. and N is the total number of distances. The entropy is normalized with respect to this value. In the following, we presents the results of the three experiments. Comparison of data entropy in raw and transformed data Using different metrics in the space transformation we want to analyze the effect of choice of the metric on the entropy values. In this test we used the cosine and the Pearson correlation as metric for space transformation because we are interested in evaluating the behavior of the genes more than their position in space. Since the space built using the genemarkers is not a Euclidean space, the Manhattan (or Euclidean) metric may be more error prone than the other two metrics. We evaluated the entropy of the original dataset and we compared it with the entropy of the transformed dataset, as shown in Table I. The same table shows the impact of the distance measure (cosine or Pearson correlation) on the entropy of the new space. For this test we have filtered all the original datasets with the variance filter, then we have applied ANOVA feature selection and gene-markers extraction with the following parameters. The number of gene-markers is 5, the maximum correlation between gene-markers is 20% and the threshold for the variance filter is 90%. These last two parameters has been set with same values for all the experiments discussed in this section. All the transformed datasets are characterized by an entropy much lower than the original one. This is independent of the metric chosen for the transformation process. The results show that the proposed space transformation highlights the existing correlations between genes and reduces the influence of noise and outliers. Hence our method can be profitably used also for improving analysis on single dataset. The representation based on the cosine correlation reduces the entropy more than the Pearson correlation. This behavior is due to the fact that the Pearson correlation is less robust with respect to the presence of outliers than the cosine correlation. Furthermore, the cosine correlation measures the similarity of features as the angle between them so it does not affect by the order of the expression values in the computation. For these reason, we only considered this metric in the following experiments. Impact of the filtering phase on entropy We evaluated the impact on the dataset entropy of the Filtering phase in the gene-marker selection. For this purpose, we computed the entropy i) in the original datasets, ii) in the transformed ones, without the filtering phase for the selection of the genemarkers, and iii) in the transformed datasets obtained with the filtering phase. We set the parameters of our experiments as follows. Cosine correlation is the metric for the space

1-4244-1509-8/07/$25.00 02007 IEEE

Dataset

raw data

DLBCL Leukemial Brainl Tumor9

0.750 0.722 0.813 0.813

data transformed without filter 0.270 0.296 0.299 0.371

data transformed with filter 0.127 0.245 0.305 0.292

transformation. The number of gene-markers is 5. The results are presented in Table II. The filter improves the performance of the system. The entropy obtained by applying filtering before feature selection in the extraction of the gene-markers is lower than that obtained without the filter. Furthermore, the filtering phase reduces the search space for the gene-markers selection and thus the computational effort required for the feature selection. Cardinality of the gene-marker set As final test we evaluated the entropy variation due to the number of genemarkers. For this test we measured the correlation between genes and gene-markers using the cosine correlation. The number of gene-markers weakly affects the entropy of the system (i.e., about 6 to 9%). In general a representation with a high number of gene-markers is richer from a biological viewpoint, because the gene-markers represent different characteristics of the analyzed process. B. Testing the stability of the model To evaluate the effectiveness of our method in preserving the biological information content of the microarray data we integrated two microarray dataset. This experiment allowed us to show that the space transformation does not affect the relevant biological content of the dataset. Gene clustering aims at highlighting the synergies (co-expression) of the genes in the datasets. Thus, it is a method to highlight relevant biological information in the data. We have considered the DLBCL dataset and we split it in two independent datasets with a similar distribution depending on class labels. To observe the effect of the space transformation we transformed and integrated the two datasets. Next, we applied a hierarchical clustering technique on the raw dataset and on the integrated dataset. We expected clusters to approximately include the same genes. Clusters have been extracted by means of the R package pvlcust [12], considering the correlation as distance between points and clusters. The two datasets are composed by 5469 features. To ease readability of the figures, we only represent a subset of the actual genes included in clusters. We have chosen 15 genes among which 5 (MINOR_1, MINOR_2, PRKCB1, PDE4B, PDE4A) regulate the outcome in the diffuse large B-cell lymphoma as discussed in [13]. Other genes, TI, HMG_I, MIF, LDHA and ENOI are the first ranked with ANOVA. The complete reference of these genes is resumed at [9]. We obtained similar results for both the first and the second dataset. In detail, the results (see Figure 2) show two main clusters composed by the same elements before and after the transformation. The only exception is for the gene SLC that

1059

Fig. 2. Cluster dendrogram of a subset of 15 features with AU (Approximately Unbiased) p-value and BP (Bootstrap Probability) value (%) for raw and transformed data of the second dataset. The hierarchy clustering is computed using R tool.

we will discuss later in this section. The discovered clusters include well known markers for DLBCL such as the highmobility group protein isoforms I and Y (HMGIY) and the LDHA [13]. Nevertheless, it can be noticed that the height of dendrogram is smaller in the transformed data than in the raw ones. The effect of the transformation is highlighting links among the genes with respect to the original data where the synergies between the genes are hidden and affected by the noise and the outliers. Furthermore, the correlation between the five genes (MINOR 1, MINOR_2, PRKCB1, PDE4B, PDE4A) is much higher in the transformed datasets than in the original ones, as it can be seen in Figure 2. Hence the space transformation method does not affect the biological information content of microarray data. It highlights meaningful information and improves on gene expression analysis. The presence of a prominent T-cell and follicular dendriticcell signature in the dataset also indicates that microarray profiling can be used to capture additional non-malignant components of the tumor microenvironment. However, these components can not be considered as markers for DLBCL outcome predictions. In particular, the secondary lymphoidtissue chemokine (SLC) that participates in the emigration pathway of mature dendritic cells from the skin to regional lymph nodes is a non-malignant component of the DLBCL signature and thus it has not to be directly correlated to DLBCL markers [13]. In Figure 2 the SLC is correlated in the raw data with the genes MINORI, PDE4A and PRKCB 1. These genes are studied to be markers for fatal/refractory DLBCLs [13]. Thus the SLC, as a non-malignant component of the DLBCL signature, has not to be directly correlated to DLBCL markers. In fact, in the transformed data (see again Figure 2) the SLC is not strongly correlated to MINOR1, PDE4A, PRKCB 1 and the others markers of DLBCLs. This confirms that the proposed method improves the quality of biological information content by highlighting the link between correlated genes and by isolating the outliers. We performed similar experiments on Lukemial and Tumor9 datasets reaching same conclusions. In particular, we found out that the transformation conserves the same clusters characterized by same genes leaving unchanged the information content. IV. CONCLUSIONS The presented approach, based only on gene expression values and on information which can be derived from the original datasets, tackles the microarray integration problem

1-4244-1509-8/07/$25.00 02007 IEEE

by performing a projection of the microarray data in a common space. The projection is performed on all the microarray data. Thus, no reduction of the space of biological information is needed. We tested the entropy of many microarray datasets before and after the space transformation. The entropy of a system describes the distribution of the data in space. A high value highlights the difficulty in separating clusters of genes with the same behavior. This method reduces the influence of the noise and the outliers and highlights the correlation and the synergies among genes in the datasets. The results show an entropy reduction that is a measurement of the uncertainty of the information content of the datasets. When integrating different microarray data, we tested the effectiveness of the method in keeping the biological information content of the data after the space transformation. Our method is able to improve the quality of this information by highlighting the link between correlated genes and by isolating the outliers. As future work, we will implement different feature selection methods (supervised and unsupervised) to improve the gene-marker extraction and reduce the effect of this step on the entropy of the system. A future version of this approach will also allow integration of different kinds of information, such as ontologies, with integrated microarray data. REFERENCES [1] H. Jiang and al., "Joint analysis of two microarray gene-expression data sets to select lung adenocarcinoma marker genes," BMC Bioinformatics, vol. 5, 2004. [2] F. Warnat, R. Eils, and B. Brors, "Cross-platform analysis of cancer microarray data improves gene expression based classification of phenotypes," BMC Bioinformatics, vol. 4, no. 6, p. 265, Nov. 2005. [3] L. Xu and A. C. Tan, "Robust prostate cancer marker genes emerge from direct integration of inter-study microarray data," Bioinformatics, vol. 21, no. 20, pp. 3905-3911, 2005. [4] X. Zhou and al., "Functional annotation and network reconstruction through cross-platform integration of microarray data," Nature Biotechnology, vol. 23, no. 2, pp. 238-43, 2005. [5] J. Kangl and al., "Integrating heterogeneous microarray data sources using correlation signatures," Data Integration in the Life Sciences, vol. 3615/2005, pp. 105-120, 2006. [6] M. Baitaluk and al., "Biologicalnetworks: visualization and analysis tool for systems biology," Website, 2006, http://biologicalnetworks.net. [7] I. Jeffery and al., "Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data." BMC Bioinformatics, vol. 7, no. 1, p. 359, July 2006. [8] A. Statnikov and al., "A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis," BMC Bioinformatics, vol. 21, no. 5, pp. 631,643, March 2005. [9] E. Baralis, E. Ficarra, A. Fiori, and E. Macii, "Gene-markers representation for microarray data integration: on-line appendix," Website, 2007, https://tasmania.polito.it/twiki/bin/view/Public/DownloadMaterial. [10] D. Manoranjan and H. L., "Feature selection for clustering," Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 110,121, 2000. [11] D. Manoranjan and al., "Feature selection for clustering - a filter solution," IEEE International Conference on Data Mining (ICDM), pp. 115,122, 2002. [12] R. Suzuki and H. Shimodaira, "pvclust," Website, 2006, http:llwww.is.titech.acjp/ shimo/prog/pvclust/. [13] M. Shipp and al., "Diffuse large b-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning," Nature Medicine, vol. 8, no. 1, pp. 68,74, Jan. 2002.

1060

Suggest Documents