On integrating multi-platform microarray data

0 downloads 0 Views 453KB Size Report
May 1, 2013 - via public repositories such as Gene Expression Omnibus ... gov/geo/ [4] and ArrayExpress http://www.ebi.ac.uk/arrayexpress [5]. Additionally,.
On integrating multi-platform microarray data Georgia Tsiliki, Dimitrios Vlachakis and Sophia Kossida Bioinformatics and Medical Informatics Group, Biomedical Research Foundation, Academy of Athens, 4 Soranou Ephessiou, 115 27, Greece With the extensive use of microarray technology as a potential prognostic and diagnostic tool, the comparison and reproducibility of results obtained from the use of different platforms is of interest. The integration of those data sets can yield more informative results corresponding to numerous data sets and microarray platforms. We developed a novel integration technique for microarray gene-expression data derived by different studies for the purpose of a two-way Bayesian partition modelling which estimates co-expression profiles under subsets of genes and between samples or experimental conditions. The suggested methodology transforms disparate gene-expression data on a common probability scale to obtain inter-study validated gene signatures. We evaluated the performance of our model using artificial data. Finally, we applied our model to six publicly available cancer gene-expression data sets and compared our results with well-known integrative microarray data methods. Our study shows that the suggested framework can relieve the limited sample size problem whilst reporting high accuracies by integrating multiplatform data. Key words: gene-expression, partition modelling, composite likelihood, integrative genomics.

1. Introduction The growth of high-throughput technologies such as microarrays and next generation sequencing (NGS) has been accompanied by active research in data analysis, producing new analysis tools and statistical methodologies. Microarray technologies in particular have matured rapidly over the past few years and present major challenges in managing, sharing, analysing and re-using the large amount of data generated [1] despite the considerable international efforts in standardizing data format, content and description [2, 3]. The majority of microarray experiments become available upon publication via public repositories such as Gene Expression Omnibus http://www.ncbi.nlm.nih. gov/geo/ [4] and ArrayExpress http://www.ebi.ac.uk/arrayexpress [5]. Additionally, specialized, mainly in cancer genome, databases have been collecting high-throughput data from the same samples. For instance, The Cancer Genome Atlas (TCGA) project [6] provides a platform to download and analyze data sets from twenty different types of cancer. Applications which link bioinformatic tools and databases have emerged, showing the way to easily visualize and analyze biomedical data. For instance, BioMart software [7] and its cancer specialized version IntOGen, are repositories which store readily combined data sets and provide platforms to easily visualize data. The Galaxy Project https://main.g2.bx.psu.edu offers a web-based platform allowing researchers to perform and share analyses. Similarly, the GenePattern platform provides access to

Proc. R. Soc. A 1–11; doi: 10.1098/rspa.00000000 1 May 2013 This journal is © 2011 The Royal Society

2

Tsiliki et al. 2013

more than 180 tools for genomic analysis to enable reproducible in silico research, http://www.broadinstitute.org/cancer/software/genepattern/. An increasingly interesting research topic is how to combine data sets or results from several studies, given the similarity of experimental protocols, in order to readily increase the sample size of the data, and for that increase the reliability and efficiency of biological investigations. When considering multiple studies, the technological, laboratory-based and biological variability of microarray data should be accounted for. Two major difficulties are that the expression measurements can be absolute or relative values, depending on the technology, and also the challenges associated with cross referencing measurements made between different technologies [6, 8]. Despite these difficulties, the results of integrated analysis clearly demonstrate the potential for increased statistical power and novel discovery by combining data from several studies [9]. Although integration methodologies for gene-expression data have the potential to strengthen or extend the results obtained from the individual studies in a comparatively inexpensive manner, there is little consensus as to how to perform such a data transformation effectively. For example, a simple strategy would be to first apply the platform-specific normalisation methods, then standardize the samples and median center all gene values, however other approaches can be considered when combining data sets from different array platforms. Most prior work in large-scale microarray integration has been performed in one of two contexts, namely statistical meta-analyses [10, 11], which treat each gene in each study independently, and transformation methodologies [12, 13, 14, 15], which transform gene-expression measurements from different studies into a common scale before the unification of the data; alternatively there are studies which combine the two methodologies [16]. According to the latter, integration is performed between differentially expressed genes from the available studies, assuming, for instance, that the parameters which capture the relationship between genes and phenotypes are related across studies [16, 9]. Under this perspective the cross-study integration is tailored to the problem of interest, and potentially relies on fewer assumptions than direct data integration. We propose a new framework based on a two-way Bayesian partitioning model, or else biclustering, that simultaneously clusters rows and columns of data matrix to address the challenging task of integrating gene-expression microarray data. The suggested framework overcomes the weakness of conventional meta-analysis by assuming inherently different normal distributions for different groups of genes expressed under various conditions. Furthermore, we do not consider phenotypic or any other functional information, to make use of the full data. A Gaussian model based on the composite likelihood of the data is used to infer patterns in the data under a fully Bayesian framework. Unlike other studies briefly mentioned above, which used clustering algorithms such as k-means and mixture models that require the specification of the number of clusters, our methodology applies a Markov Chain Monte Carlo (MCMC) search to sample from the posterior distribution of the sample parameters [17]. Based on the useful empirical assumption that the genes are expressed via a binary activation pattern across samples, i.e. they are either regulated or non-regulated in each sample [15], our model assumes that in each data set there exist a group of significantly low gene-expression values. We finally report the posterior probability of each gene value being derived by a normal distribution with zero mean value, marginalized over 100 random samples of the data. This is a common probabilistic scale which unifies disparate gene-expression data, and acts as a summary measure of the significance of each gene.

3 2. Methods We developed a Bayesian two-way partition model, to estimate the optimal partition of multi-platform data given the parameters of the underline distributions for each cluster. Similarly to the model suggested by Wang and Chen [15], we made the useful empirical assumption that the genes are expressed via a binary activation pattern across samples, i.e. they are either regulated or non-regulated in each sample. We assume that the nonregulated genes may contain expression values that are normally distributed around zero with a small variance, and that regulated genes are normally distributed with a cluster specific mean value and variance parameters. Finally, we report the posterior probability of each value being derived by a normal distribution with zero mean value given the data partition, a measure which allows us to unify data sets on a probability scale. (a) Bayesian partition modelling Let us consider a data matrix X = {xsij }p×n×S , where i and j index p genes and n samples respectively, and s = 1, ..., S index the number of different microarray studies considered with Xis = (xsi1 , ..., xsin ). In the context of gene-expression microarray analysis, xsij represents the logarithm of the expression level of gene i measured in sample j and study s. Only the common genes across the S studies are considered. Given X, we seek for an unordered collection of non-empty groups of genes which exhibit similar patterns across specific groups of samples or vice versa, whilst the groups themselves are different. Specifically, observations belonging to the same subset are assumed to be exchangeable, and observations from different subsets are assumed to be independent, we refer to these groups as biclusters. Thus biclustering methods search directly for submatrices of the data matrix X, i.e. biclusters, whose entries meet a predefined criterion. We assume that all genes belonging to a bicluster follow a normal distribution where φ(x|µ, σ 2 ) denotes the probability density function of a normal distribution with mean µ and variance σ 2 . Furthermore we assume that there always exists a bicluster consisting of low gene-expression values, i.e. derived from a normal distribution with a zero mean value. Those gene-expression values could belong to non-regulated genes given the phenotype of interest, i.e. in particular samples (e.g. normal or cancerous). We consider that they form a single bicluster, which we denote by C0 , and follow a normal distribution with zero mean value, C0 = {xsij , xsij ∼ N (0, σ02 )}. Overall we assume that data X consists of K + 1 clusters, including C0 . Each of the K remaining clusters, say the k th , consists of gene-expression values that follow normal distributions with bicluster specific mean and variance parameters, xsij ∼ N(µk , σk2 ). In order to define the number and boundaries of the inferred biclusters, we introduce a s ∈ {0, 1}, i = 1, . . . , n, j = 1, . . . , p, s = 1, . . . , S} which binary indicator matrix R = {rij specify the inclusion of the (i, j)th gene-expression value of the s study to each of the K + 1 estimated biclusters [18, 19]. The gene-expression value xsij is assigned to a bicluster s = 1. Alterations between zero and one values indicate the biclusters’ borders. We when rij assume that a bicluster should at least consist of five values. As can be seen in Figure 1 the binary pattern creates different biclusters, in this case five biclusters are formed (step B), and then one is selected at random with equal probability to be the C0 bicluster (step s = 1 values surrounded by r s = 0, or vise versa. C). A cluster is defined by a group of rij ij In that way K + 1 biclusters are formed. Note that each bicluster yields two marginal s indicators are designed to clusters for rows (genes) and columns (samples), thus the rij

4

Tsiliki et al. 2013

identify a set of samples that have a common pattern of responses across a set of genes [20]. Given R, K, the composite likelihood can be expressed as: P (X|R, K, Θ) =

p S Y n Y Y

φ(xsij |0, σ02 )

s=1 i=1 j=1

K Y

φ(xsij |µk , σk2 )

(2.1)

k=1

2 , σ 2 ). The mean values {µ } and the bicluster where Θ = (µ1 , . . . , µK , σ12 , . . . , σK k 0 membership indicators {rij } are designed to identify a set of samples that have a common pattern of responses across a set of genes. We adopt standard conjugate priors for the partition model parameters, i.e.

P (σk2 ) ∼ InverseGamma(αk , βk ) P (σ02 ) ∼ InverseGamma(α0 , β0 ) P (µk |σk2 ) ∼ N(τk , σk2 ) P (rij ) ∼ Bernoulli(πrij ) The prior parameters αk , βk , α0 , β0 , τk , σk2 are specified based on the sample means and variances. We set Bernoulli’s distribution probability of success equal to πrij = 21 . Additionally we set a uniform prior probability distribution to assign which of the (K + 1) 1 will be the C0 bicluster with P (Co ) = K+1 , also imposing the threshold of each bicluster having at least five gene-expression values. Then, the joint posterior distribution is: P (R, K, Θ|X) ∝ P (X|R, K, Θ)P (σ02 )P (C0 )

p S Y n Y Y s=1 i=1 j=1

s P (rij )

K Y

P (µk |σk2 )P (σk2 ).

k=1

The objective function in this case is the marginal posterior distribution P (R, K|X), which is derived by placing priors on all the parameters in the model and marginalizing over all parameters except the indicator variable R and the number of clusters (K + 1), which are the parameters of interest. Note that, unlike mixture modelling methodologies, our model contains the parameters R and K which are directly relevant to the basic partitioning problem. (b) Computationla strategy We use the MCMC sampler [17] to sample R and K from the posterior distribution. Our algorithm cycles through all rows and columns sequentially, flips the indicator variables of each row or column, and then decide whether to accept the change based on the acceptance rate. The detailed procedure of our algorithm is as follows: 1. Initialize the partition structure: s, (a) randomly assign binary indicators for each (i, j) the gene-expression value rij to be either one or zero, which sets the initial number of (K + 1) biclusters,

5 (b) randomly select one of the (K + 1) biclusters to be the C0 . Alterations of one-zero values indicate the borders and length of each bicluster, whilst we impose the threshold of each bicluster having at least five elements. A bicluster is defined by a group of identical rij values encompassed by 1 − rij values. 2. Cycle through all rows and columns sequentially. At each cycle propose which rij s values to flip with probability pm ∈ (0, 1). s indicator with probability p 3. Propose merging two biclusters with the same rij sm ∈ (0, 1), and split a randomly selected cluster with probability 1 − psm . The proposed change is then accepted with probability equal to the Bayes factor in favour of the proposed model over the current one.

4. Repeat the cycle until convergence. (c) Indexing The above partition model is repeated independently for 100 different random samples of individuals taken from X data. Lets assume S = 2, and the total number of individuals considered is the sum of the two studies, i.e. N = n1 + n2 where n1 , n2 denote the number of individuals examined in the two studies. For all the common p genes in the two studies, we consider 100 samples consisting of 1, 000 individuals randomly select with replacement from N . Then, for each of the formed data matrixces we repeat the above biclustering procedure, and for each xsij value we compute the posterior probability pij of this value belonging to C0 bicluster given R and K parameters, i.e. we test model M0 : xsij ∈ C0 against M1 : xsij ∈ / C0 pij =

p(M0 |R)p(M0 ) = p(M0 |R)p(M0 ) + p(M1 |R)p(M1 )

number of times xij value belongs to C0 over all random samples , number of all random samples which is simlified since we consider p(M0 ) = p(M1 ). Thus, each gene-expression value is now transformed to a posterior probability value which depicts how likely it is for this value to belong to C0 bicluster. pij values are used as an indexing technique to summarize values of very large and diverse data sets, under the same pathology or trait. By employing the suggested methodology data from different studies are unified given the above partition scheme and can then be further analyzed by using the posterior probability data matrix which gives a probability measure of how likely each xsij value is to belong to the non-regulated group of gene values. The methodology was implemented in C++ using the gsl library (http://www.gnu.org/software/gsl/). 3. Results The suggested framework enables the integration of gene-expression results performed under diverse microarray study conditions, therefore relieving concerns from the intractable “large p, small n" problem. Results obtained from our model, based on simulated and publicly available data, are shown to be both reliable and accurate for

6

Tsiliki et al. 2013

the top candidate biomarkers identified, therefore showing encouraging potentials for future biomarker discovery applications. (a) Performance assessment for simulated data To evaluate the performance of our methodology, we simulated data by adopting the strategy followed by Wang and Chen [15]. We generated gene-expression levels from two independent studies S1 and S2 . Each data set consists of 5, 000 genes and 200 samples under two conditions, treatment (40 samples) and control (160 samples). 500 genes are simulated to be differentially expressed between the two conditions. We estimated the differentially expressed genes between the two conditions for each data set using Welch t-test p-values with Benjamini-Hochberg correction [21], and for the integrated data set, denoted by S1 + S2 , given the suggested methodology. In Figure 2 (left hand-side) we plot the 500 true differentially expressed genes for S1 , S2 and S1 + S2 , ranked based on their corrected p-values. A random ranking of the genes is also available (black dashed line). Such plots, which are known as correspondence at the top (CAT) plots [22, 15], assess the agreement of the top differentially expressed genes by showing the proportion of true differentially expressed genes found in each data set. As can be seen in Figure 2, the proportion of estimated differentially expressed genes reported for S1 and S2 are similar, nevertheless the agreement is higher for the integrated data which indicates the reliability of the integrated methodology. We also report the integration-driven discovery rate (IDR) as defined in [23], namely we report the percentage of genes identified as differentially expressed in the integrated data that were not identified in any of the separate studies alone. In Figure 2 (right hand-side) we can observe the IDR values for S1 + S2 data set relative to the threshold value imposed to the corrected Welch t-test p-values. It is evident form the plot that the highly differentially expressed genes found in S1 + S2 are mostly novel entries compared to those of S1 and S2 . (b) Breast cancer biomarker case study In this section empirical gene-expression data sets from three breast cancer studies [24, 25, 26] are integrated under the suggested framework. All the data sets and clinical information files were downloaded from the Gene Expression Omnibus (GEO) repository (http://www.ncbi.nlm.nih.gov/geo/) as well as the relevant published studies. Table ?? summarizes the content of those breast cancer data sets, D1B , D2B and D3B , respectively, and their sources. Data were normalised and log transformed according to platform type so that the mean expression value for each study is zero and the expression values for a given gene are approximately normally distributed under each condition. Within each data set, genes with missing values in > 30% of the samples were removed, and the remaining missing values were imputed using k-nearest neighbors imputation algorithm with k = 10 [27], as implemented in R (http://www.bioconductor.org/packages/release/bioc/html/impute.html). We restrict our analysis to the set of common genes in the available studies recognized by cross-referencing HGNC symbols (Hugo Gene Nomenclature Committee, http:// www.genenames.org/) using the R annotate package (http://www.bioconductor.org/ packages/release/bioc/html/annotate.html), whereas probe IDs mapped to the same gene symbol were averaged. For each data set we run the suggested methodology for 100 samples retaining the proportionality of their ER structure. Each data set is integrated based on

7 the methodology suggested, and the discretised values correspond to the average of the discretised values over the random samples. Particularly, we fit the suggested methodology to the data in Table ??, and report our findings given the Estrogen Receptor (ER) status. ER has proved to be an important risk factor for breast cancer genesis [28], having two categorical values ER+ and ER−. Details about the ER status of each study samples are given in the parentheses of the last column of Table ??. In gene-expression data analysis, the most important goal may be to find a set of genes showing notably similar up-regulation and down-regulation under a set of conditions. Figure 3 shows the IDR plot for the proportion of genes found to be differentially expressed between ER+ and ER− samples in the integrated data, D1B + D2B + D3B , and were not reported as such in none of the individual studies. Similarly to simulated data analysis differentially expressed genes were selected based on corrected Welch t-test p-values. We can observe that the number of common differentially expressed genes increases as the p-value threshold increases, which suggests that our methodology estimated some novel differentially expressed genes when applied to real data as well as simulated data. Additionally, we compared the classification potential of the candidate biomarkers estimated by our framework to gene lists prepared based on three other integrative microarray data methods. We selected a 100 genes’ signature consisting of the most significant differentially expressed genes with respect to ER status, in order to evaluate their significance as candidate biomarkers. We compared our findings to those derived by a meta-analysis method metaMA [11], a cross-study method for the analysis of differentially expressed genes XDE [9], and a transformation method XPN [14]. In all cases we selected the 100 most significant genes with respect to ER. Four classifiers, KNN, support-vector machine (SVM), Naive Bayesian classifier and decision tree were employed to evaluate the estimated gene sets as implemented within the e1071 R package (http://cran.r-project.org/web/packages/e1071/index.html). For each classifier, we performed 10-fold cross-validation tests. Particularly, the data sets were randomly divided into two sets retaining their ER proportionality intact: 90% of samples were assigned to the training set and 10% of the samples were assigned to the validated set. The training set ids were used to train the classifier and the classification performance of the trained classifier was then evaluated on the validated set. We repeated the procedure for 100 times calculating the mean classification accuracy and report in Table ?? the average classification accuracies. We can observe that our methods’ average classification accuracies are higher or comparable to those by metaMA, XDE and XPN methodologies, whereas SVM is appearing to be the most conservative of the four classifiers used. Table 1: Gene-expression data sets. Breast cancer data sets D1B . Sørlie et al. [24] D2B . Sotiriou et al. [25] D3B . Wang et al. [26]

Microarray platforms spotted DNA/cDNA Affymetrix U133A Affymetrix U133A

Accession number GEO: GSE3193 GEO: GSE2990 GEO: GSE2034

Probe sets 9, 216

Num. Samples (ER+, ER−, NA) 58 (44, 13,1)

12, 625

189 (149,34,6)

22, 000

225 (186,69,0)

8

Tsiliki et al. 2013

Table 2: Classification comparison for empirical cancer data sets: the average classification accuracy for ten-fold cross validation over D1B data set for 50 genes selected from the four methodologies. Breast cancer

SVM

Our method metaMA XDE XPN

0.84 ± 0.15 0.87 ± 0.12 0.80 ± 0.15 0.77 ± 0.10

Naive Bayesian classifier 0.94 ± 0.13 0.90 ± 0.10 0.93 ± 0.10 0.93 ± 0.13

Decision tree

10-NN

0.96 ± 0.06 0.90 ± 0.20 0.90 ± 0.09 0.93 ± 0.07

1.00 ± 0.0 1.00 ± 0.0 1.00 ± 0.0 1.00 ± 0.0

4. Discussion Gene expression analysis methods aim mostly at identifying subset of genes that could explain the biological processes underlying microarray experiments. However, as the availability of microarray data increases, there is a growing need for cross evaluation of multiple independent microarray data sets along with careful indexing and summarization of results. Multi-platform transformation methods are applied for that purpose reducing study-specific biases and aiming to yield results which offer improved reliability and validity. Here, we propose a two-way partition model for multi-platform integration that reports the posterior probability of each gene value based on the partition membership and the study effect of each gene, given the assumption that genes are either regulated or non-regulated in each sample. We assumed that expression values that are normally distributed around zero with a small variance are likely to belong to non-regulated genes, whereas values which follow a normal distribution with non-zero mean value and larger variance are more likely to belong to regulated genes. In this context, use of a fully Bayesian model has several desirable features. The model borrows strength across both genes and studies and can thereby provide better estimates of the gene-specific means, variances and effects. The analysis of simulated data showed that our method increased power and reproducibility. Testing the empirical cancer data sets also showed that we can produce comparable results with meta-analysis and other transformation methods, when identifying differentially expressed genes and classifying their samples. One of the major advantages of our method is that we can unify gene-expression data sets based on the estimated partition of the data without taking into account the samples’ phenotype structure, nevertheless that could be imposed to the model. In a similar manner, a possible extension of the model is to filter the data by a particular biological function of interest and in that way resemble other methodologies which aim to integrate only functionally important subset of genes. Further research investigations can be performed towards those directions, since the suggested methodology is conceptually simple to implement.

9 Acknowledgment The authors are members of the BM1006 COST Action, SeqAhead: Next Generation Sequencing Data Analysis Network. We gratefully acknowledge funding from the EU DICODE (Mastering Data-Intensive Collaboration and Decision Making) Collaborative Project (FP7, ICT-2009.4.3, Contract No. 257184), and the 09SYN−13 − 901 EPAN II Cooperation grant (co-funded by the EU and the Greek State).

Data Accessibility The microarray data used for this study were derived from the NCBI Gene Expression Omnibus and ArrayExpress databases with accession numbers: GSE3193, GSE2990, GSE2034.

10

Tsiliki et al. 2013 References

Larsson, O., Sandberg, R., 2006 Lack of correct data format and comparability limits future integrative microarray research, Nat Biotechnol, 24(11): 1322–1323. Whetzel, P.L., Parkinson, H., Causton, H.C., Fan, L., Fostel, J., Fragoso, G., Game, L., Heiskanen, M., Morrison, N., Rocca-Serra, P., et al., 2006 The MGED Ontology: a resource for semantics-based description of microarray experiments, Bioinformatics, 22(7): 866–873. Strauss, E., 2006 Arrays of hope, Cell, 127(4): 657–659. Kim, I.F., Soboleva, A., Tomashevsky, M., Edgar, R., 2007 NCBI GEO: mining tens of millions of expression profilesâĂŞdatabase and tools update, Nucleic Acids Res, D760–765. Parkinson, H., Kapushesky, M., Shojatalab, M., Abeygunawardena, N., Coulson, R., Farne, A., Holloway, E., Kolesnykov, N., Lilja, P., Lukk, M. et al., 2007 ArrayExpress - a public database of microarray experiments and gene expression profiles, Nucleic Acids Res, D747–750. The Cancer Genome Atlas Network, 2012 Comprehensive molecular portrait of human breast tumours, Nature, 490: 61–70. Kasprzyk, A. 2011 BioMart: driving a paradigm change in biological data management, Database, bar049. The MAQC Consortium 2006 The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements, Nat Biotechnol 24(9): 1151–1161. Scharf, R.B., Tjelmeland, H., Parmigiani, G., Nobe, A.B. 2009 A Bayesian model for cross-study differential gene expression, J Am Stat Assoc 104(488): 1295–1310. Fishel, I., Kaufman, A., Ruppin, E. 2007 Meta-analysis of gene expression data: a predictor-based approach. Bioinformatics 23: 1599âĂŞ-1606. Marot, G., Foulley, J.L., Mayer, C.D., Jaffrezic, F. 2009 Moderated effect size and p-value combinations for microarray meta-analyses. Bioinformatics 25(20): 2692–2699. Huttenhower, C., Hibbs, M., Myers, C., Troyanskaya, O.G. 2006 A scalable method for integration and functional analysis of multiple microarray datasets, Bioinformatics 22(23): 2890–2897. Zilliox, M.J., Irizarry, R.A. 2007 A gene expression bar code for microarray data. Nature, 4(11), 911–913. Shabalin, A.A., Tjelmeland, H., Fan, C., Perou, C.M., Nobel, A.B. 2008 Merging two gene-expression studies via cross-platform normalization, Bioinformatics 24(9): 1154–1160. Wang, M., Chen, J.Y. 2010 A GMM-IG framework for selecting genes as expression panel biomarkers. Artif Intell Med 48 (2-3): 75–82. Conlon, E.M. 2008 A Bayesian mixture model for metaanalysis of microarray studies, Funct Integr Genomics 8: 43–53. Robert, C.P., Casella, G. 1999 Monte Carlo statistical methods. New York: Springer. Booth, J.G., Casella, G., Hobert, J.P. 2008 Clustering using objective functions and stochastic search, J R Statist Soc B 70(1): 119–139. Hu, M., Qin, Z.S. 2009 Query Large Scale Microarray Compendium Datasets Using a Model-Based Bayesian Approach with Variable Selection, PLOS One 4(2): e4495. Zhang, J. 2010 A Bayesian model for biclustering with applications, Appl Statist JRSS C 59(4): 635–656. Benjamini, Y., Hochberg, Y. 1995 Controlling the False Discivery Rate: A Practical and Powerful Approach to Multiple Testing, J R Statist Soc B 57(1): 289–300. Irizarry, R.A., Warren, D., Spencer, F., Kim, I.F., Biswal, S., Frank, B.C., Gabrielson, E., Garcia, J.G.N., Geoghegan, J., Germino, G.et al. 2005Multiple-laboratory comparison of microarray platforms, Nature 2: 345–350. Choi, J.K., Yu, U., Kim, S., Yoo, O.J. 2003 Combining multiple microarray studies and modeling interstudy variation, Bioinformatics 19(1): i84–i90. Sørlie, T., Perou, C.M., Tibshirani, R., Aas, T., Geisler, S., Johnsen, H., Hastie, T., Eisen, M.B., van de Rijn, M., Jeffrey, S.S. et al. 2001 Gene expression patterns of breast carcinomas distinguishing tumor subclasses with clinical implications. Proc Natl Acad Sci USA 98(19): 10869–10874. Sotiriou, C., Wirapati, P., Loi, S., Harris, A., Fox, S., Smeds, J., Nordgren, H., Farmer, P., Praz, V., Haibe-Kains, B. et al. 2006 Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. J Natl Cancer Inst 98(4): 262–272. Wang, Y., Klijn, J.G., Zhang, Y., Sieuwerts, A.M., Look, M.P., Yang, F., Talantov, D., Timmermans, M., Meijer-van Gelder, M.E., Yu, J. et al. 2005 Gene-expression profiles to predict distant metastasis

11 of lymph-node-negative primary breast cancer. Lancet 365(9460): 671–679. Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., Altman, R. 2001 Missing value estimation methods for DNA microarrays, Bioinformatics 17: 520–525. vanâĂŹt Veer, L.J., Dai, H., van de Vijver, M.J., He, Y.D., Hart, A.A.M., Mao, M., Peterse, H.L., van der Kooy, K., Marton, M.J., Witteveen, A.T. et al. 2002 Gene expression profiling predicts clinical outcome of breast cancer. Nature 415: 530–536.

Figure captions Figure 1 (Online version in colour): Illustration of the model-based integration scheme for multi-platform gene-expression data. The unified data X are considered, consisting of the set of common genes and different samples of the same pathology. In step (A) gene-expression values are assigned to biclusters based on the binary indicator matrix R, where yellow values denote 0 and blue values denote 1 say. In step (B) we show how blue biclusters in A surrounded by yellow biclusters, and vise versa, are considered are considered to form five individual biclusters, where the different colour scheme denotes the different normal distributions. In step (C) we select at random the dark blue coloured bicluster to be the C0 bicluster. The remaining biclusters follow different normal distributions denoted by the different colours. Figure 2 (Online version in colour): CAT and IDR plots for the simulated data. On the left hand-side, proportions of genes in common between the reported and the 500 true differentially expressed genes on simulated data are shown. On the right-hand side, the IDR plot for the proportions of genes that were found to be differentially expressed only in the integrated data is shown. Figure 3: IDR plot for the proportions of genes that were found to be differentially expressed only in the breast cancer integrated data. IDR was computed for (D1B + D2B + D3B ) data set relative to the threshold imposed to the corrected Welch t-test p-values.

Suggest Documents