Core and rim samples from various specimens tend to fall into two clusters, however, .... Regulation of cell growth, apoptosis. 9. NM 014486 unknown unknown.
Exploiting Statistical Redundancy in Expression Microarray Data to Foster Biological Relevancy Lei Yu1 , Jessica L Rennert2 , Huan Liu1 , Michael E Berens2 1 Department of Computer Science & Engineering Arizona State University, Tempe, AZ 85287-8809, USA {leiyu,hliu}@asu.edu 2 Translational Genomics Research Institute, Phoenix, AZ 85004, USA {jrennert,mberens}@tgen.org Abstract Discriminating meaningful biochemical differences between normal and tumor cells, as well as discerning nuances in the gene expression patterns among tumor cells of differential behaviors are likely to lead to foundational insights empowering improved diagnosis, prognosis, and therapeutics for cancer patients. Laboratory techniques now are routinely employed which portray the expression levels of tens of thousands of genes under experimental conditions, affording an unprecedented glimpse into the controlling influences underlying pathological behaviors. Basic statistical methods assess the differential expression of individual genes without considering the statistical redundancy between genes and often produce an overwhelming number of candidate genes for subsequent biological and clinical validation. This work points out the necessity to exploit statistical redundancy to foster biological relevance. It describes a method RSVP (Reporter Surrogate Variable Program) which reduces the number of selected candidate genes while increasing the overall discriminative power; use of the tool also fosters guided biological relevance of the candidate gene set for subsequent validation. From a set of differentially expressed genes, RSVP identifies a subset of reporter genes which are mutually non-redundant and jointly provide a signature profile for discriminating distinct specimen types or cellular phenotypes. In addition, the method exploits the correlation between each of the reporter genes and non-reporter genes and enables biologists to select candidate genes of biological relevance according to biological knowledge. The effectiveness of this method is demonstrated through results from computational experiments on glioma microarray data.
1 Introduction The pathological features of cancer include unregulated cell growth creating destructive masses of abnormal cells; a propensity of tumor cell to migrate away from the central tumor mass, invading normal tissue, and seeding secondary destructive tumors; and the suppressed or lost capacity to undergo programmed cell death (apoptosis). Comparisons of genes expressed between normal cells and tumor cells, as well as comparisons of global gene expression between tumors of 1
different histopathologies demonstrate an ability to use molecular techniques for diagnosis and prognosis [25]. Significant efforts are also being applied to develop gene expression signatures of tumors that may inform an appreciation of specific chemotherapeutic vulnerabilities of individual tumors, heralding the implementation of personalized medicine [12]. Since tumors are comprised of heterogeneous cell populations of disparate genetic aberrations and displaying various behaviors, applications of such gene expression analysis to further enlighten an understanding of the regulatory processes underlying malignant progression are vigorously sought. Inherent to such research will be the confounding challenge of the specimen-tospecimen variations impacting the potentially subtle changes in the expression levels of specific genes that drive the malignant behavior. Anticipating this challenge, bioinformatics techniques by which to accelerate and refine the detection of critical genes underlying tumor behavior are in demand. An empiric laboratory technique by which to reproducibly impose two behavioral patterns on tumor cells is a radial cell migration assay [3], in which tumor cells from well-established cell lines or from primary tumor explants are configured as crowded, nonmotile “cores” of cells and as dispersed “rims” of migrating cells [4]. From ten specimens, such cells were collected as these two discreet populations (cores and rims), and the mRNA was isolated and processed for whole genome expression analysis using oligonucleotide microarrays. A crucial step in analyzing the obtained gene expression profiles is to identify genes which show statistically interesting expression patterns and hence may serve as candidate genes for biological and clinical validation. In many cases, genes of statistical interests are those which are differentially expressed between two types of tissues or between two experimental conditions [2, 7, 8, 19]. Various statistical methods such as Student’s t-tests, nonparametric tests, and ANOVA have been employed to identify genes of differential expression [6, 14, 20, 24]. One important issue inherent to statistical tests of microarray data is to properly control the family-wise error rate (FWER) when simultaneously testing thousands of genes. Several robust methods have been recently introduced [10, 17, 18] to address this issue. Although statistical methods for selection of differentially expressed genes are widely employed as preliminary analysis tools for microarray data, follow-up validation is critical to translate statistical findings into a better understanding of the biological phenomena. As biological and clinical validation of candidates is very time consuming and expensive, a pressing challenge faced by researchers is to decide how many and which genes should be selected for validation from the pool of statistically meaningful candidates. It is often desired that the candidate gene set should contain a sufficiently small number of genes with biological relevance. However, it has been shown in many studies that the number of genes identified as differentially expressed by various statistical methods is often several hundred to over a thousand [6, 10, 17, 18], which is much beyond the capacity of biological and clinical validation. The reason for this overwhelming catchment of candidates is due to the fact that univariate statistical tests only assess the differential expression of genes one by one without considering the correlation between genes. If a gene is one of the true genetic “drivers” for the differences between two phenotypes under study and has been found differentially expressed, some other genes highly correlated with this gene are also likely to be identified as differentially expressed genes. On one hand, we want to reduce the number of selected candidate genes while increasing the overall discriminative power; on the other hand, we want to select more genes of biological relevance as candidates for subsequent validation. In this paper, we propose to achieve these two goals by the following method. First, we identify from differentially expressed genes a small 2
subset of reporter genes which are mutually non-redundant and jointly provide a signature profile for discriminating two phenotypes under study. Second, we exploit the correlation between each of the reporter genes and non-reporter genes to determine a candidate gene set of enhanced biological relevance. The remainder of this paper is organized as follows. Section 2 provides some theoretical background on statistical relevance and redundancy. Section 3 describes our proposed method RSVP. Section 4 presents a case study on a glioma data set from a radial migration assay and discusses results from applying RSVP to foster biological relevance in glioma migration. Section 5 concludes this work.
2 Statistical Relevance and Redundancy A microarray data set can be viewed as a gene expression matrix, in which each column represents a gene and each row represents a sample (or experiment) with a class label. Each value fij is the measurement of the expression level of the jth gene for the ith sample where i = 1, ..., M and j = 1, ..., N . Such format conforms to the normal data format of machine learning, where a gene can be regarded as a feature and a sample as an instance. For a real-world data set with hundreds or thousands of features, it is common that a large number of features are not informative in discriminating the target class because they are either irrelevant or redundant with respect to the class. Some recent work in feature selection [5, 27] have been able to consider the correlation between features in selecting a minimum subset of features with maximum discriminative power for the class. In this section, we discuss statistical relevance and redundancy of genes in terms of statistical relevance and redundancy of features.
2.1 Statistical Relevance In [15], features are classified into three disjoint categories, namely, strongly relevant, weakly relevant, and irrelevant features. Let F be a full set of features, Fi be the domain of the ith feature, and p(C | F ) be the probability distribution of the class values given the feature values in F . These categories of relevance can be formalized as follows. Definition 1 (Strong relevance) A feature Fi is strongly relevant iff p(C | F ) 6= p(C | F − {Fi }) .
Definition 2 (Weak relevance) A feature Fi is weakly relevant iff p(C | F ) = p(C | F − {Fi }), and ∃ S ⊂ F (Fi ∈ / S), such that p(C | Fi , S) 6= p(C | S) .
Corollary 1 (Irrelevance) A feature Fi is irrelevant iff ∀ S ⊂ F (Fi ∈ / S), p(C | Fi , S) = p(C | S) .
3
Strong relevance of a feature indicates that the feature is indispensable; it cannot be removed without loss of discriminative power. Weak relevance suggests that the feature is not always necessary but may become necessary to the discrimination of the class. Irrelevance (following Definitions 1 and 2) indicates that the feature can never contribute to the discrimination of the class. To achieve maximum discriminative power with a minimum subset of features, all strongly relevant features, none of irrelevant features, and a subset of weakly relevant features should be selected. However, it is not given in the definitions which of weakly relevant features should be selected and which of them removed. Therefore, there is also a need for feature redundancy analysis.
2.2 Statistical Redundancy In [26], we formally defined feature redundancy based on the definition of a feature’s Markov blanket [16]. Definition 3 (Markov blanket) Given a feature Fi , let Mi ⊂ F (Fi ∈ / Mi ), Mi is said to be a Markov blanket for Fi iff p(F − Mi − {Fi }, C | Fi , Mi ) = p(F − Mi − {Fi }, C | Mi ) .
The Markov blanket condition requires that Mi subsume not only the information that Fi has about C, but also about all of the other features. It is proved in [16] that a feature removed based on the existence of a Markov blanket in an earlier phase will still find a Markov blanket in any later phase when other features are removed based on the Markov blanket criterion. According to previous definitions on feature relevance, we can also prove that strongly relevant features cannot find any Markov blanket. Since irrelevant features should be removed anyway, we exclude them from our definition of redundant features. Hence, our definition of redundant feature is given as follows. Definition 4 (Redundant feature) A feature is redundant iff it is weakly relevant and has a Markov blanket in the current set.
I
II
I : Irrelevant features
III
IV
II : Weakly relevant and redundant features IV : Strongly relevant features
III : Weakly relevant but non-redundant features
Figure 1: A view of feature relevance and redundancy
4
From the property of Markov blanket, it is easy to see that a redundant feature removed earlier remains redundant when more features are removed. Figure 1 depicts the relationships between definitions of feature relevance and redundancy. It shows that an entire feature set can be conceptually divided into four basic disjoint parts: irrelevant features, redundant features (part of weakly relevant features), weakly relevant but non-redundant features, and strongly relevant features. Our goal is to find all strongly relevant features and weakly relevant but non-redundant features to form a minimum feature set of maximum discriminative power.
3 Reporter Surrogate Variable Program (RSVP) The method RSVP proposed in this section consists of two steps: it first identifies a subset of reporter genes based on statistical redundancy analysis and then further exploits statistical redundancy to identify surrogate genes for each reporter gene.
3.1 Searching for Reporter Genes When searching for statistically relevant genes and statistically redundant genes according to the definitions in Section 2, efficient approximation methods are needed for two reasons. First, an exhaustive or complete search is prohibitive with a large number of features due to the combinatorial nature of these definitions. Second, these definitions are based on the full population where the true data distribution is known. It is generally assumed that a training data set is only a small portion of the full population, especially in a high-dimensional space as in microarray data. Basic statistical tests for differentially expressed genes are commonly adopted approximation methods to find statistically relevant genes. However, genes are independently assessed of their differential expression without considering redundancy between them. We next present an effective correlation based method to identify statistically redundant genes from a set of statistically relevant genes. There exist two major types of measures for the correlation between genes or between a gene and the class: linear correlation measures and information-theoretical measures. Of linear correlation measures, the most well known measure is linear correlation coefficient [21]. It has many variations. Linear correlation measures may not be able to capture correlations that are not linear in nature. In addition, since these measures are calculated on numerical values, they cannot directly measure the correlation between two sets of genes or between a set of genes and the class. We adopt measures based on the information-theoretical concept of information gain [21]. Since information gain tends to favor variables with more values, we employ symmetrical uncertainty (SU ) [21], defined as IG(X | Y ) SU (X, Y ) = 2 , H(X) + H(Y ) which compensates for information gain’s bias toward variables with more values and restricts its values to the range [0, 1]. A value of 1 indicates that knowing the values of either variable completely predicts the values of the other; a value of 0 indicates that X and Y are independent. In the following, the correlation between any gene Fi and the class C is called individual Ccorrelation, measured by ISUi - individual symmetrical uncertainty; the correlation between any pair of genes (Fi and Fj (i 6= j)) and the class C is called combined C-correlation, measured by CSUi,j - combined symmetrical uncertainty. For combined C-correlation, genes Fi and Fj are 5
virtually treated as one single gene Fi,j , and the cartesian product of the domains of Fi and Fj is the domain of Fi,j . As individual C-correlation captures the discriminative power of a single gene, combined C-correlation captures the jointly discriminative power of two genes. Our method first ranks each gene according to its individual C-correlation and then determines the statistical redundancy between these genes based on pair-wise correlation analysis. It approximately determines the redundancy between two genes based on both their individual C-correlation and combined C-correlation. It assumes that a gene with a larger individual C-correlation value contains by itself more information about the class than a gene with a smaller individual Ccorrelation value. For two genes Fi and Fj with ISUi ≥ ISUj , it chooses to evaluate whether gene Fj can be approximately redundant to gene Fi (instead of Fi to Fj ) in order to maintain more information about the class. In addition, if combining Fj with Fi does not provide more discriminative power than Fi alone, it heuristically decides that Fj is approximately redundant to Fi and hence a candidate for being removed from the current gene set. An approximately redundant gene (redundant gene for short in the remaining) is defined as follows. Definition 5 (Approximately redundant gene) For two genes Fi and Fj , Fj is approximately redundant to Fi iff ISUi ≥ ISUj and ISUi ≥ CSUi,j . A gene Fj that causes a set of genes to be redundant can itself be redundant to another set of genes. For instance, Fk is redundant to Fj and Fj is redundant to Fi . In such case, if Fj is the only gene in the current gene set that causes Fk to be redundant, after removing Fk based on Fj , further removing Fj based on Fi will make the previous removal of Fk unjustified. To guarantee that a gene removed due to its redundancy to other genes in an earlier phase will remain redundant in any later phase when another gene is removed, a gene is removed only when it is found to be redundant to a selected reporter gene, defined as follows. Definition 6 (Reporter gene) A gene is selected as a reporter gene iff it is not redundant to any other genes in the current gene set. Reporter genes will not be removed at any stage. Since a gene with the highest ISU value is not redundant to any other genes according to Definition 5, it must be one of the reporter genes and can be used as the starting point to determine the redundancy of the rest of the genes in the current gene set. Table 1: An Approximation Algorithm for Finding Reporter Genes 1. Order genes based on decreasing ISU values 2. Initialize Fi with the first gene in the list 3. Find and remove all genes redundant to Fi 4. Set Fi as the next remaining gene in the list and repeat step 3 until the end of the list
An algorithm for finding reporter genes is summarized in Table 1. The algorithm only considers the correlation between individual genes in redundancy analysis. It is fairly straightforward to extend the algorithm to consider the correlation between two subsets of genes. However, this will not only increase the time complexity of the algorithm, but also might cause an over-searching problem [13] due to the data characteristics of limited samples in a high-dimensional space. 6
3.2 Searching for Surrogate Genes Effectiveness of reporter genes in terms of discriminative power for cancer classification has been extensively tested in another set of experiments using benchmark data sets in which irrelevant and redundant genes are discarded [27]. Our focus in this work is to identify candidates genes underlying glioma migration behavior. Each of the reporter genes may not necessarily be a gene of biological relevance to the differences between the stationary and migratory phenotypes. It is anticipated that a gene of biological relevance is likely to be removed during the process of finding reporter genes. Therefore, it is highly desirable to identify surrogate genes which are highly correlated to a reporter gene and to present these surrogate genes back to biologists in the form of a surrogate list for each reporter gene. With the reporter gene set and its surrogate lists, biologists will have a much reduced set of genes with desired discriminative power yet the freedom to pick genes from either side as candidates for validation according to biological knowledge. By doing so, the candidate gene set will be statistically and biologically more robust than those produced using only basic statistical methods. The correlation between two individual genes (ISU value) and a user-defined threshold δ is used to determine genes in each surrogate list. Specifically, let F 0 denote the whole set of statistically relevant genes and G denote the reporter gene set (G ⊂ F 0 ), for each reporter gene Fi ∈ G, Fj is a surrogate gene to Fi if and only if Fj ∈ F 0 − G and ISUi,j ≥ δ. The threshold δ can be tuned by biologists to adjust the number of genes in each surrogate list. Since ISU value ranges from 0 to 1, a default threshold is set as 0.5.
4 Results We study in this section the distinct empirical results obtained by RSVP on a real-world data set, glioma migration microarray data, being investigated at the Translational Genomics Research Institute about brain cancers. We compare RSVP with classic statistical methods and verify the effectiveness of exploiting statistical redundancy to foster biological relevance of selected genes for glioma migration study.
4.1 Glioma Migration Microarray Data Ten glioma specimens from 7 long term cell lines and 3 primary tumor explant cultures were manipulated to produce triplicate biological replicates of cores and rims using a radial migration assay. Standard processing techniques for RNA isolation, amplification, and labeling were employed, and the analytes applied to 40K gene chips according to manufacturer’s protocols (Agilent) [23]. The data matrix obtained is in the form of 40K genes X 60 samples holding log2 ratios of background subtracted LOWESS normalized fluorescence values of samples versus a universal reference RNA sample (Stratagene). By averaging the expression ratios of the three biological replicates based on the median value of the three ratios, the final data set used in this study contains 20 samples [22].
7
4.2 Results from Classic Statistical Methods Hierarchical clustering techniques [1, 9] have been widely employed by biologists to study gene expression profiles from microarray data. In Alon’s work, a two-way clustering method was applied to detect groups of correlated genes and cell line samples from tumor and normal colon tissues. The clustering method separated tumor and normal samples into two distinct clusters based on a small set of genes with the most statistically significant difference between tumor and normal samples. Their additional experiments showed that clustering distinguished tumor and normal samples even when the genes used have a small average difference between the two phenotypes. This finding suggested that for many genes there is a systematic difference between tumor and normal samples.
v
(a)
(b)
Figure 2: Hierarchical clustering results based on all genes (a) and genes selected using two-sample t-test (p < 0.05) (b).
In this work, hierarchical clustering is adopted to capture and visualize the discriminative power of various gene sets emerging from the gene expression profiles of the two phenotypes. It is worthy to mention that predictive accuracy is another commonly adopted criterion to evaluate the discriminative power of a selected gene set. However, as stated earlier, the focus of our study on glioma migration data is not to select genes to derive a classification model. Figure 2 (a) shows the dendrogram by applying hierarchical clustering on the whole data set. Core and rim samples from the same specimen are uniformly grouped together. This result indicates that the core-to-rim variations are remarkably less significant than specimen-to-specimen variations, making it difficult to discriminate rim from core samples. It also suggests that it may be more difficult to establish distinct genetic profiles between stationary and migratory tumor cells than between tumor and normal cells. Therefore, it is necessary to look for genes that are differentially expressed between the two phenotypes. Two-sample t-test are commonly used to identify genes of differential expression. Figure 3 depicts the expression patterns of five genes with the smallest p-values from two-sample t-test. These five genes display two distinct co-expression patterns: genes NM 013271 and BC004888 are down-regulated; genes NM 032735, I 1985866, and AB007950 are up-regulated from core to rim sample of the same specimen. These patterns illustrate the statistical redundancy between differentially expressed genes. 8
NM_013271
BC004888
NM_032735
I_1985866
AB007950
Gene Expression (Log2(Ratio))
2 1.5 1 0.5 0 -0.5 -1 -1.5 -2
G
11
2 G MS 11 -C 2M or e G S-R 12 im 0 G -Co 12 re 0 G -R H im 3G Cor H e 3 G -R H im 4G Cor H e G 4-R H im 6C G o SF H6 re 76 -Ri m SF 3-C 7 o SF 63 re 76 -Ri m SF 7-C 76 ore T9 7-R 8G im U T9 -Co 87 r EG 8G- e U F Ri 87 R m EG -C FR ore U -Ri 87 m -C U or 87 e -R im
-2.5
Samples
Figure 3: Illustrative expression patterns of top 5 genes selected using two-sample t-test.
Based on p-values 0.1, 0.05, and 0.01, three sets of genes (composed of 306, 137, and 22 genes) were selected using two-sample t-test. Figure 2 (b) shows the dendrogram of clustering result for 137 genes. Core and rim samples from various specimens tend to fall into two clusters, however, matched core and rim samples from three out of the ten specimens (U87EGFR, G120, SF767) are not separated. Dendrograms produced by 306 genes and 22 genes look very similar to the one in Figure 2 (b). The similarity between these dendrograms verifies that the resulting structure is not a simple artifact of the clustering procedure. In addition to two-sample t-test, we also experimented with Golub’s correlation coefficiency [11] to select differentially expressed genes and obtained similar results due to a similar ranking of genes by Golub’s correlation coefficiency and by pvalues from two-sample t-test.
4.3 Results from RSVP Method To apply RSVP, continuous expression ratios of each gene were discretized into three values -1, 0, and 1 representing the over-expression, baseline, and under-expression of genes, which correspond to (−∞, µ−σ/2), [µ−σ/2 , µ+σ/2], and (µ+σ/2, +∞), respectively. From the set of 306 genes selected using two-sample t-test (p < 0.1), RSVP selected a total of 23 reporter genes. Table 2 reports a summary of these reporter genes. RSVP also identified surrogate genes for each reporter gene based on a default correlation threshold 0.5. Table 2 also reports some of the surrogate genes of biological relevance (refer to supplementary information for a complete list of surrogate genes). A comparison of genes in Table 2 and Figure 3 shows that among the top five genes selected by two-sample t-test which display two distinct co-expression patterns (shown in Figure 3), only genes BC004888 and AB007950 appear in the reporter gene set. This verifies the effectiveness of RSVP in identifying redundancy between differentially expressed genes. Figure 4 (a) shows the clustering dendrogram based on the 23 reporter genes and a heat map of the expression distribution for these genes. In the heat map, log ratios of 0 are colored black, and increasingly positive or negative log ratios are respectively colored red or green with increasing intensity. The 20 samples form two distinct clusters according to the two phenotypes. The 23 genes also fall into two distinct clusters, with 12 genes down-regulated and 11 genes up-regulated from core to rim samples.
9
(a)
(b)
Figure 4: Hierarchical clustering results based on genes selected using RSVP: (a) shows a dendrogram and expression heat map from 23 reporter genes and (b) shows similar results with 4 reporter genes replaced by surrogate genes.
We now examine the biological relevance for genes in Table 2. Due to the pathological features of cancer (mentioned in the introduction), we particularly look for genes whose functions are known to be related to cell movement, growth, survival/apoptosis, and transcriptional regulation. Among the 23 reporter genes, 7 genes are found to be biologically relevant according gene ontology information in existing knowledge bases. Among the surrogate lists, 9 such genes are found for 4 reporter genes (NM 014486, THC1422993, NM 030802, NM 003961). Simultaneously replacing these reporter genes with one of their biologically relevant surrogate genes produced very similar cluster results as the one shown in Figure 4 (a). Figure 4 (b) illustrates the clustering dendrogram and heat map obtained from one possible way of choosing the surrogate genes (marked by arrows) to replace the reporter genes. However, replacing these reporter genes with randomly picked genes or simply removing them shows a reduced discriminative power observed previously from the dendrograms in Figure 2. These experiments verify that exploiting the correlation between reporter genes and other statistically important genes enable us to pick genes according to biological relevance without scarifying the overall discriminative power of the candidate gene set for biological validation. 10
5 Conclusion This work highlights the importance of identifying and exploiting gene redundancy in analyzing gene expression microarray data, and proposes a method for selection of reporter genes and highly correlated surrogate genes. The purpose of this paper is to present a bioinformatics tool which helps biologists focus their effort on a small set of genes with enhanced discriminative power and biological relevance. At present, we are evaluating some of the genes identified by RSVP in additional tumor specimens using laboratory molecular tools and biological pathway analysis. The overall findings from this research may lead to strategies by which to control the spread and metastasis of cancer. As another line of future work, we plan to conduct further study on relations between reporter and surrogate genes.
Acknowledgements We thank Lance Parsons for discussion on the role of surrogate genes, Seungchan Kim for help with the presentation of clustering results, and Dominique Hoelzinger for feedback on the selected genes. This work is supported by NIH (NS42262; JLR and MEB) and ET-I3 (LY and HL).
References [1] U. Alon, N. Barkai, D. A. Notterman, et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl Acad. Sci. USA, 96:6745–6750, 1999. [2] K. A. Baggerly, K. R. Coombes, K. R. Hess, et al. Identifying differentially expressed genes in cDNA microarray experiments. Journal of Computational Biology, 8:639–659, 2001. [3] M. E. Berens, M. D. Rief, M. A. Loo, and A. Giese. The role of extracellular matrix in human astrocytoma migration and proliferation studied in a microliter scale assay. Clin Exp Metastasis, 12:405–415, 1994. [4] M. Bittner, P. Meltzer, Y. Chen, et al. Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature, 406:536–540, 2000. [5] C. Ding and H. Peng. Minimum redundancy feature selection from microarray gene expression data. In Proceedings of the Computational Systems Bioinformatics conference (CSB’03), pages 523–529, 2003. [6] S. Draghici, O. Kulaeva, B. Hoff, et al. Noise sampling method: an ANOVA approach allowing robust selection of differentially regulated genes measured by DNA microarrays. Bioinformatics, 19:1348–1359, 2003. [7] S. Dudoit, Y. H. Yang, M. J. Callow, and T. P. Speed. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Statistica Sinica, 12:111–139, 2002. 11
[8] B. Efron, R. Tibshirani, J. D. Storey, and V. Tusher. Empirical Bayes analysis of a microarray experiment. Journal of the American Statistical Association, 96:1151–1160, 2001. [9] M. Eisen, P. Spellman, P. Brown, and D. Botstein. Cluster analysis and display of genomewide expression patterns. Proc. Natl Acad. Sci. USA, 95:14863–14868, 1998. [10] D. Ghosh. Mixture models for assessing differential expression in complex tissues using microarray data. Bioinformatics, 20:1663–1669, 2004. [11] T. R. Golub, Slonim D. K., P. Tamayo, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286:531–537, 1999. [12] L. Hood, J. R. Heath, M. E. Phelps, and B. Lin. Systems biology and new technologies enable predictive and preventative medicine. Science, 306:640–643, 2004. [13] D. D. Jensen and P. R. Cohen. Multiple comparisions in induction algorithms. Machine Learning, 38(3):309–338, 2000. [14] M. K. Kerr, M. Martin, and G. Churchill. Analysis of variance for gene expression microarray data. Journal of Computational Biology, 7:819–837, 2000. [15] R. Kohavi and G. H. John. Wrappers for feature subset selection. Artificial Intelligence, 97(1-2):273–324, 1997. [16] D. Koller and M. Sahami. Toward optimal feature selection. In Proceedings of the Thirteenth International Conference on Machine Learning, pages 284–292, 1996. [17] J. G. Liao, Y. Lin, Z. E. Selvanayagam, and W. J. Shih. A mixture model for estimating the local false discovery rate in DNA microarray analysis. Bioinformatics, 20:2694–2701, 2004. [18] R. Mansourian, D. M. Mutch, N. Antille, et al. The global error assessment (GEA) model for the selection of differentially expressed genes in microarray data. Bioinformatics, 20:2726– 2737, 2004. [19] M. A. Newton, C. M. Kendziorski, C. S. Richmond, et al. On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. Journal of Computational Biology, 8:37–52, 2001. [20] W. Pan. A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics, 18:546–554, 2002. [21] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes in C. Cambridge University Press, Cambridge, 1988. [22] J. L. Rennert, D. B. Hoelzinger, L. B. Reavie, et al. Steps towards the identification of a global transcriptome for glioma migration and invasion. manuscript in preparation, 2005. [23] The Tumor Analysis Best Practices Working Group. Expression profiling - best practices for data generation and interpretation in clinical trials. Nature Reviews Genetics, 5:229–237, 2004. 12
[24] O. Troyanskaya, M. E. Garber, P. O. Brown, et al. Nonparametric methods for identifying differentially expressed genes in microarray data. Bioinformatics, 18:1454–1461, 2002. [25] M. J. van de Vijver, Y. D. He, L. J. van’t Veer, et al. A gene-expression signature as a predictor of survival in breast cancer. N. Engl. J. Med., 347:1999–2009, 2002. [26] L. Yu and H. Liu. Efficient feature selection via analysis of relevance and redundancy. Journal of Machine Learning Research, 5:1205–1224, 2004. [27] L. Yu and H. Liu. Redundancy based feature selection for microarray data. In Proceedings of the Tenth ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 737–742, 2004.
13
Table 2: Reporter and surrogate genes selected by RSVP. The first column presents 23 reporter genes. The second column presents surrogate genes of biological relevance (if any) for each reporter gene. Genes of biological relevance are highlighted in boldface. Gene Name
Gene Functions
1
Gene Accession Number Reporter Surrogate AB007950
Cerebral protein 11
unknown
2
ENST00000330263
unknown
unknown
3
AB040898
KIAA1465 protein
unknown
4
NM 001955
Endothelin 1
Regulation of cell proliferation
5
ENST00000329590
unknown
unknown
6
BC004888
Hypothetical protein FLJ10052
unknown
7
NM 013283
Methionine adenosyltransferase II, beta
Regulation of cell proliferation
8
NM 005900
SMAD, mothers against DPP homolog 1 (Drosophila)
Regulation of cell growth, apoptosis
9
NM 014486
unknown Selectin P ligand Caspase 8, apoptosis-related cysteine protease Myosin IC Paired box gene 4
unknown Regulation of cell migration Regulation of cell apoptosis Regulation of cell migration Fetal development and cancer growth
NM 003006 NM 033356 BC044891 NM 006193 10
NM 001805
CCAAT/enhancer binding protein (C/EBP), epsilon
Terminal differentiation
11
NM 014453
Putative breast adenocarcinoma marker (32kD)
unknown
12
NM 013446
Makorin, ring finger protein, 1
Transcriptional regulation
13
NM 022104
Chromosome 20 open reading frame 67
Transcriptional regulation
14
THC1422993
unknown Cysteine-rich, angiogenic inducer, 61 Cysteine-rich, angiogenic inducer, 61
unknown Regulation of cell growth, migration Regulation of cell growth, migration
NM 001554 Z97068 15
THC1562600
unknown
unknown
16
NM 144608
Hypothetical protein MGC39389
unknown
17
NM 018235
CNDP dipeptidase 2 (metallopeptidase M20 family)
unknown
18
THC1468677
unknown
unknown
19
NM 003961 NM 173587 NM 016643
Rhomboid, veinlet-like 1 (Drosophila) REST corepressor 2 Mesenchymal stem cell protein DSC43
Regulation of cell proliferation, migration Transcriptional regulation Transcriptional regulation
NM 013994
C/EBP-induced protein Discoidin domain receptor family, member 1
unknown Regulation of cell growth, differentiation
20
NM 030802
21
ENST00000296657
unknown
unknown
22
AB037798
KIAA1377 protein
unknown
23
NM 014125
unknown
unknown
14