Extensive increase of microarray signals in ... - Semantic Scholar

5 downloads 0 Views 450KB Size Report
datasets, the medians of perfect match (PM) probe intensities increased in cancer state and the increases were significant in three datasets, suggesting the ...
Computational Biology and Chemistry 35 (2011) 126–130

Contents lists available at ScienceDirect

Computational Biology and Chemistry journal homepage: www.elsevier.com/locate/compbiolchem

Brief communication

Extensive increase of microarray signals in cancers calls for novel normalization assumptions Dong Wang a,b , Lixin Cheng b , Mingyue Wang b , Ruihong Wu b , Pengfei Li b , Bin Li b , Yuannv Zhang b , Yunyan Gu b , Wenyuan Zhao b , Chenguang Wang a,b,∗ , Zheng Guo a,b,∗ a b

Bioinformatics Centre, School of Life Science, University of Electronic Science and Technology of China, Chengdu, China College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China

a r t i c l e

i n f o

Article history: Received 22 January 2011 Received in revised form 21 April 2011 Accepted 21 April 2011 Keywords: Differential expression Deregulation direction Cancer Normalization

a b s t r a c t When using microarray data for studying a complex disease such as cancer, it is a common practice to normalize data to force all arrays to have the same distribution of probe intensities regardless of the biological groups of samples. The assumption underlying such normalization is that in a disease the majority of genes are not differentially expressed genes (DE genes) and the numbers of up- and downregulated genes are roughly equal. However, accumulated evidences suggest gene expressions could be widely altered in cancer, so we need to evaluate the sensitivities of biological discoveries to violation of the normalization assumption. Here, we analyzed 7 large Affymetrix datasets of pair-matched normal and cancer samples for cancers collected in the NCBI GEO database. We showed that in 6 of these 7 datasets, the medians of perfect match (PM) probe intensities increased in cancer state and the increases were significant in three datasets, suggesting the assumption that all arrays have the same median probe intensities regardless of the biological groups of samples might be misleading. Then, we evaluated the effects of three currently most widely used normalization algorithms (RMA, MAS5.0 and dChip) on the selection of DE genes by comparing them with LVS which relies less on the above-mentioned assumption. The results showed using RMA, MAS5.0 and dChip may produce lots of false results of down-regulated DE genes while missing many up-regulated DE genes. At least for cancer study, normalizing all arrays to have the same distribution of probe intensities regardless of the biological groups of samples might be misleading. Thus, most current normalizations based on unreliable assumptions may distort biological differences between normal and cancer samples. The LVS algorithm might perform relatively well due to that it relies less on the above-mentioned assumption. Also, our results indicate that genes may be widely up-regulated in most human cancer. © 2011 Elsevier Ltd. All rights reserved.

1. Background The application of microarrays has tremendous influences on modern biology researches (Lander, 1999; Mohr et al., 2002; Guo, 2003; Quackenbush, 2006). However, microarray data are subject to multiple sources of technical artifacts including the preparation of the biological samples, the unequal quantities of starting RNA, the hybridization of the samples to the arrays and the quantification of the spot intensities (Zakharkin et al., 2005; Reimers, 2010). To remove experimental variation introduced by these technical artifacts while maintaining biological signals of inter-

∗ Corresponding authors at: College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150086, China. Tel.: +86 045186615933; fax: +86 045186615933. E-mail addresses: [email protected] (Z. Guo), [email protected] (C. Wang) . 1476-9271/$ – see front matter © 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.compbiolchem.2011.04.006

est, various normalization algorithms have been developed (Li and Wong, 2001; Bolstad et al., 2003; Irizarry et al., 2003a,b; Dudley et al., 2009). To remove technical variations introduced by these technical artifacts while maintaining biological variation of interest, researchers have developed various normalization algorithms among which RMA, MAS5.0 and dChip for Affymetrix chips are currently widely applied (Li and Wong, 2001; Bolstad et al., 2003). For example, the quantile normalization implemented in the RMA algorithm forces the probe intensities into the same distribution across all arrays (Bolstad et al., 2003). MAS5.0 normalizes probe intensities by liner scaling based on a reference array and also brings the intensities of all arrays to the same level (Hubbell et al., 2002). DChip normalizes probe intensities by a rank-invariant set, forcing all arrays to have almost the same distribution of probe intensities with a baseline array (Li and Wong, 2001). In microarray experiments for disease study, usually comparing disease samples with normal samples, it is a common practice to use normalization algorithms including the above-mentioned

D. Wang et al. / Computational Biology and Chemistry 35 (2011) 126–130

three to force probe intensities across all arrays to have the same distribution (and thus the same median) regardless of the biological groups of samples. The underlying assumption is that in a disease the majority of genes are not differentially expressed and the numbers of up-regulated and down-regulated genes are roughly equal (Quackenbush, 2006). However, this assumption is rarely checked, so it is never certain that the data are properly normalized, especially when accumulated evidences suggest that the gene expression pattern could be globally altered in a complex disease (Qiu et al., 2005; Klebanov et al., 2006; Zhang et al., 2008, 2009). Taking into account the possibility that a large fraction of genes may be differentially expressed and up- and down-regulated genes may be asymmetric in a disease, several normalization algorithms have been developed recently (Calza et al., 2008; Ni et al., 2008; Dudley et al., 2009; Wu and Aryee, 2010). For example, based on the assumption that a certain fraction of genes have stable expressions across samples regardless of the sample states, the LVS (leastvariant set) algorithm uses a non-linear model to fit a pre-selected set of genes with small variation across all arrays from individual array against those from a reference array (Calza et al., 2008). However, these recently proposed algorithms are rarely used because the global difference of gene expression between normal and cancer states and its effect on the following analyses has not been fully analyzed until now. In this study, using 7 large Affymetrix datasets of pair-matched normal and cancer samples for 6 cancer types, we showed the medians of perfect match (PM) probe intensities increased in cancers in 6 of the 7 datasets and the increases were significant in three datasets, indicating the above-mentioned assumption might not be true at least for microarray data for cancers. Then, we evaluated the effects of three widely applied algorithms, RMA, MAS5.0 and dChip, on the detection of differentially expressed (DE) genes in cancers, in comparison with the LVS algorithm (Calza et al., 2008) which relies less on the above-mentioned assumption. Our results showed that data normalizations based on the unreliable assumption could greatly distort biological differences between normal and cancer samples, producing lots of false results of down-regulated DE genes while missing many truly up-regulated DE genes. Finally, our results showed that gene expressions may be widely up-regulated in most human cancers and data normalizations based on unreliable assumption may distort the biological differences between normal and cancer states. Hence, using normalization methods, such as LVS, that rely less on the above-mentioned assumption could be a better choice for cancer study. 2. Materials and methods 2.1. Microarray datasets From the NCBI GEO database (Barrett et al., 2007), we collected Affymetrix datasets with paired normal and cancer samples according to the following criterion: each dataset includes at least 30 samples for each state (normal or cancer), with all samples generated on the same microarray platform. After excluding one dataset (GSE19188) with unambiguous match-pair information, we finally collected a total of 7 pair-matched datasets for 6 cancer types. All the datasets analyzed in this study are described in Table 1. 2.2. Batch effect analysis For analyzing the batch effect of processing date, using a generalized R2 statistic for categorical data, we computed the correlation between processing dates and sample states (Leek et al., 2010). The correlation ranges from 0% (no confounding) to 100% (completely confounded).

127

Table 1 Microarray datasets analyzed in this study. Dataset

Accession number

Normal vs cancer

The confounding

Colon64 ESCC106 Gastric62 Lung66 Lung88 Pancreatic78 Prostate116

GSE8671 GSE23400 GSE13911 GSE10072 GSE18842 GSE15471 GSE6919

32:32 53:53 31:31 33:33 44:44 39:39 58:58

0.0063 0.0031 0.034 0.032 0 0 0.031

2.3. Data preprocessing and normalization The probe sets were mapped to Entrez genes based on the SOURCE database (downloaded in July, 2010) (Diehn et al., 2003). For a gene represented by multiple probe sets, we calculated its signal intensity in a sample as the mean of intensities of all these probe sets in this sample (Zhang et al., 2009). We only used the PM intensities background adjusted (Irizarry et al., 2003a,b; Calza et al., 2008) because it has been recognized that ignoring MM values is preferable for background correction (Naef et al., 2002; Irizarry et al., 2003a,b). We analyzed three most widely used normalization algorithms for Affymetrix GeneChip data, including RMA (Irizarry et al., 2003a,b), MAS5.0 (Hubbell et al., 2002) and dChip (Li and Wong, 2001). All these algorithms force all arrays to have the same distribution of probe intensities. For comparison, we also analyzed a recently proposed algorithm named LVS which relies less on the above-mentioned assumption (Calza et al., 2008). It firstly preselects a proportion () of genes with small variation across all arrays and then uses a non-linear model to fit these genes from individual arrays against those from a reference array. Here, we took  = 60% or 40% as suggested by the authors. We applied LVS to gene level data after background correction and summarization by RMA. 2.4. Selection of differentially expressed genes and consistency analysis The SAM (significance analysis of microarrays) (samr 1.25 R package) (Tusher et al., 2001) was applied to select differentially expressed (DE) genes in cancers. Then, we evaluated the consistency of deregulation directions (up-regulated or down-regulated) of two lists of DE genes selected by separately using two different normalization algorithms in a dataset. Strictly, we defined that a gene is shared by the two DE gene lists only if it is detected as DE genes with the same regulation direction in the dataset separately normalized by the two algorithms. Suppose k genes are shared between list 1 with length L1 and list 2 with length L2, then the POG (the percentage of overlapping genes) score from list 1 to list 2 is POG12 = k/L1, and the score from list 2 to list 1 is POG21 = k/L2. Then, the average POG score is defined as POG = (POG12 + POG21 )/2 (Zhang et al., 2009). 3. Results 3.1. Different distributions of PM probe intensities in normal and cancer samples As described in Section 2, we unbiased collected 7 large datasets with pair-matched samples for 6 cancer types. Here, considering the processing date as a potential source of batch effects, we used a linear model to examine for each dataset whether the sample states were associated with the processing dates (see Section 2). The results showed the confounding values for the 7 datasets were all very small (all below 0.034), indicating no batch effect of the processing date on the sample states in these datasets (see Section 2).

128

D. Wang et al. / Computational Biology and Chemistry 35 (2011) 126–130

Fig. 1. The distributions of PM probe intensities in normal and cancer samples PM probe intensities were averaged over all samples of each state. In the boxplot graph, the white and grey represent normal state (N) and cancer state (D), respectively. The box stretches from the lower hinge (defined as the 25th percentile) to the upper hinge (the 75th percentile) and the median is shown as a line across the box.

Then, for each of the 7 datasets, we computed the median of intensities of all PM probes in each sample, and then compared the medians between normal and the cancer samples. As shown in Fig. 1, the median of PM probe intensities in cancer samples increased in 6 of the 7 datasets and the increases were significant in three datasets (escc106, gastric62, pancreatic78) (p < 0.05, Wilcoxon Rank Sum test). As shown in Fig. 1, in some other datasets such as lung88, the increases of PM probe intensities in the cancer state were also rather high, and the non-significance was likely due to the low statistical power of analyzing data with relatively large technical and/or biological variations (Ein-Dor et al., 2006; Zhang et al., 2008). The above results indicated that probe intensities tend to increase in all cancers under this study except for colon cancer. Thus, for cancer study, normalizing all arrays to have the same distribution (and thus the same median) of probe intensities regardless of the biological groups of samples might be misleading. 3.2. Normalization greatly affects the detection of DE genes in cancers Then, using the three datasets (escc106, gastric62, pancreatic78) with significant increase of probe intensities in cancer

samples, we evaluated the effects of data normalizations on the selection of DE genes by comparing RMA, MAS5.0 and dChip with LVS. For LVS, we took the parameter  to be 40% and 60% as suggested by the authors (Calza et al., 2008), where  was the pre-defined proportion of genes with small variation across all arrays (see Section 2). We only presented the results by using LVS with  = 40% because the results by using  = 60% were similar. After normalizing data by using each algorithm, we selected DE genes using SAM with FDR = 0.05. In each of the three datasets, the lists of DE genes selected by using RMA, MAS5.0 and dChip were rather inconsistent with the results by using LVS with  = 40%, with the POG (the percentage of overlapping genes) scores ranging from 0.48 to 0.86 (see Table 2). In each dataset, for the overlapping DE genes commonly detected by using LVS and another algorithm, the deregulation directions were highly consistent for the two algorithms, indicating these DE genes could be correctly detected by both algorithms. For example, as shown in Fig. 2, in the escc106, gastric62 and pancreatic78 datasets, 99%, 99% and 91% of the overlapping DE genes commonly detected by using LVS and RMA showed the same deregulation directions for the two algorithms. Similar results were observed when comparing MAS5.0 and dChip with LVS.

Table 2 Comparison of DE genes detected by LVS and other algorithms. Dataset

Escc106 Gastric62 Pancreatic78

LVS ( = 40%)

RMA

DE gene

DE gene

POG

DE gene

MAS5.0 POG

DE gene

POG

8363 10,073 14,145

9301 11,765 15,696

0.86 0.66 0.72

8133 8514 11,838

0.79 0.61 0.72

8411 9762 14,029

0.78 0.48 0.56

dChip

D. Wang et al. / Computational Biology and Chemistry 35 (2011) 126–130

129

the lists of DE genes selected by using  = 40% were highly consistent with the results by using  = 60%, with POG scores ranging from 0.93 to 0.96. For example, in the escc106 dataset, 8149 genes among the 8363 DE genes selected by using  = 40% were included in the 8659 DE genes selected by using  = 60% and all of them showed the same deregulation directions. These results indicated that the detection of DE genes was rather robust when using  = 40% or 60% for LVS. Notably, the percentages of up-regulated DE genes selected by using LVS with  = 40% (or 60%) were 55% (52%), 89% (83%) and 71% (68%) in the three datasets, respectively. This result indicated excessive up-regulation of DE genes in cancers, against the assumption that in a disease the numbers of up- and down-regulated DE genes are roughly equal.

4. Discussion

Fig. 2. The deregulation directions of DE genes detected by LVS ( = 40%) and RMA. The left and right spheres represent the DE genes selected by using LVS and RMA, respectively. The Overlap-consistent represented the overlap of DE genes have the same deregulation directions. The Overlap-inconsistent represented the overlap of DE genes have the different deregulation directions. The Non-overlap-UP represented up-regulated of non-overlap DE genes. The Non-overlap-Down represented the down-regulated of non-overlap DE genes.

However, in each dataset, a large fraction of genes were selected as down-regulated DE genes by using RMA, MAS5.0 or dChip but not by using LVS. For example, as shown in Fig. 2, in the three datasets, about 18–39% of the genes detected as DE genes by using RMA were not detected as DE genes by using LVS, and above 96% of these genes were detected as down-regulated by using RMA. On the other hand, in each dataset, a large fraction of genes were selected as up-regulated DE genes by using LVS but not detected as DE genes by using RMA, MAS5.0 or dChip. For example, as shown in Fig. 2, in the three datasets, there were about 9–28% of the genes detected as DE genes by using LVS were not detected as DE genes by using RMA and above 95% of these genes were detected as upregulated by using LVS. Notably, the effects of data normalizations were not limited to the three datasets with significantly increased medians of PM probe intensities in the cancer state. As shown in Fig. 1, in some other datasets, the increases of the medians of PM probe intensities in the cancer state were also high though not significant. Similar effects of data normalizations could be observed in these datasets. Take the lung88 dataset for example, as shown in Fig. 2, there were 1691 genes detected as DE genes by using RMA but not by using LVS with  = 40%, and 99% of them were detected as down-regulated by using RMA. On the other hand, there were 2917 genes detected as DE genes by using LVS but not RMA and 99% of them were detected as up-regulated. Thus, RMA, MAS5.0 and dChip could produce a large fraction of false down-regulated DE genes and miss another large fraction of up-regulated DE genes in a cancer dataset. Finally, we compared the results of using LVS with different  values. As shown in Table 3, in the three datasets, Table 3 Comparison of DE genes detected by LVS with different . Dataset

DE genes ( = 40%)

DE genes ( = 60%)

Escc106 Gastric62 Pancreatic78

8363 10,073 14,145

8659 10,065 14,483

Overlap genes

POG

8149 9401 13,501

0.96 0.93 0.94

Data normalization based on unreliable assumptions may take the risk of distorting biological signals in high-throughput microarray data, which might do more harm than good for biological conclusions (Calza et al., 2008; Ni et al., 2008; Pelz et al., 2008; van de Wiel et al., 2010). In microarray experiments for comparing normal samples with cancer samples, most current normalization approaches are based on the assumption that in a disease only a few genes are differentially expressed and the numbers of upand down-regulated genes are roughly equal (Li and Wong, 2001; Hubbell et al., 2002; Irizarry et al., 2003a,b). However, in almost all cancer datasets analyzed in this study, the medians of PM probe intensities in cancer samples were higher than those in normal samples, indicating that this assumption may not hold true at least in microarray data for cancers. Our analysis also indicated excessive up-regulation of gene expressions in cancers. We found that using algorithms such as RMA, MAS5.0 and dChip to normalize data based on the above-mentioned assumption could greatly affect the detection of DE genes in cancers. Specifically, they may produce lots of false results of down-regulated genes while missing many truly up-regulated genes in cancer. Obviously, this might also greatly affect some systems biology researches such as using expression correlations between genes to analyze disease related biological networks (Qiu et al., 2005; Reverter et al., 2005; Guo et al., 2007; Yao et al., 2010). To ensure the finding of DE genes was robust towards any specific choice of analytic methodology, some researchers suggested that we can use the overlap signals of various normalizations (Dudley et al., 2009). However, obviously, many DE genes truly related with cancer could be missed by this combination approach. Recently, some oligonucleotide arrays have started to include negative control probes that are designed to have no complementary match to the transcriptome of the targeted samples (Wu and Aryee, 2010). Using the negative control probes, an algorithm named SQN (subset quantile normalization) (Wu and Aryee, 2010) can be applied to normalize data in which the distribution of gene expression is highly asymmetric. For most current microarrays without negative control probes, using normalization methods, such as LVS, that rely less on the above-mentioned assumption could be a reasonable choice for cancer study. Although the LVS algorithm engenders the problem of the pre-selection of a proportion () of genes as a reference (Calza et al., 2008), our results suggested that the detection of DE genes was rather consistent when using  = 40% and 60%. Considering that gene expressions might be widely changed in cancer, we suggest using  = 40% for LVS. Nevertheless, we still need a more robust and unbiased procedure for microarray data normalization. One possibility is that we could choose to perform no data normalization, which can retain the biological variation at the cost of leaving large technical variation reducing the power of finding biological signals (Marshall, 2004). In principal, increasing sample sizes can increase the

130

D. Wang et al. / Computational Biology and Chemistry 35 (2011) 126–130

statistical power of biological discoveries (Klebanov and Yakovlev, 2007; Zhang et al., 2008, 2009). Basically, we should pay more efforts on designing experiments according to statistical principles, using sufficient samples for each biological group and stringently randomizing possible experimental artifacts between the biological groups to avoid batch effects of possible surrogates (Leek et al., 2010). In other fields of using high-throughput arrays for measuring molecular changes, data normalizations based on the traditional assumption of similar signal distributions across samples may also distort real biological signals. For example, recognizing that the total amount of CpG methylation can differ substantially among samples (Laird, 2010), some researchers choose to perform no data normalization for DNA methylation arrays (Christensen et al., 2009; O’Riain et al., 2009). However, others still perform normalization to remove technical variation at the risk of distorting some biological variation (Rakyan et al., 2010; Teschendorff et al., 2010). Similarly, for copy number data, researchers have noticed that most current normalization methods cannot be applied directly because of the unbalanced high genomic gain (Staaf et al., 2007; van de Wiel et al., 2010). Evaluating the influence of data normalization for these high-throughput arrays also warrants our future studies. Acknowledgements This work was supported by the National Natural Science Foundation of China (grant nos. 30970668, 81071646, 91029717), Excellent Youth Foundation of Heilongjiang Province (grant no. JC200808), Natural Science Foundation of Heilongjiang Province of China (grant no. QC2010012) and Scientific Research Fund of Heilongjiang Provincial Education Department (no. 11541156). References Barrett, T., Troup, D.B., Wilhite, S.E., Ledoux, P., Rudnev, D., Evangelista, C., Kim, I.F., Soboleva, A., Tomashevsky, M., Edgar, R., 2007. NCBI GEO: mining tens of millions of expression profiles—database and tools update. Nucleic Acids Res. 35, D760–765. Bolstad, B.M., Irizarry, R.A., Astrand, M., Speed, T.P., 2003. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19, 185–193. Calza, S., Valentini, D., Pawitan, Y., 2008. Normalization of oligonucleotide arrays based on the least-variant set of genes. BMC Bioinform. 9, 140. Christensen, B.C., Houseman, E.A., Marsit, C.J., Zheng, S., Wrensch, M.R., Wiemels, J.L., Nelson, H.H., Karagas, M.R., Padbury, J.F., Bueno, R., Sugarbaker, D.J., Yeh, R.F., Wiencke, J.K., Kelsey, K.T., 2009. Aging and environmental exposures alter tissue-specific DNA methylation dependent upon CpG island context. PLoS Genet. 5, e1000602. Diehn, M., Sherlock, G., Binkley, G., Jin, H., Matese, J.C., Hernandez-Boussard, T., Rees, C.A., Cherry, J.M., Botstein, D., Brown, P.O., Alizadeh, A.A., 2003. SOURCE: a unified genomic resource of functional annotations, ontologies, and gene expression data. Nucleic Acids Res. 31, 219–223. Dudley, J.T., Tibshirani, R., Deshpande, T., Butte, A.J., 2009. Disease signatures are robust across tissues and experiments. Mol. Syst. Biol. 5, 307. Ein-Dor, L., Zuk, O., Domany, E., 2006. Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. Proc. Natl. Acad. Sci. U.S.A. 103, 5923–5928. Guo, Q.M., 2003. DNA microarray and cancer. Curr. Opin. Oncol. 15, 36–43. Guo, Z., Wang, L., Li, Y., Gong, X., Yao, C., Ma, W., Wang, D., Li, Y., Zhu, J., Zhang, M., Yang, D., Rao, S., Wang, J., 2007. Edge-based scoring and searching method for identifying condition-responsive protein–protein interaction sub-network. Bioinformatics 23, 2121–2128. Hubbell, E., Liu, W.M., Mei, R., 2002. Robust estimators for expression analysis. Bioinformatics 18, 1585–1592.

Irizarry, R.A., Bolstad, B.M., Collin, F., Cope, L.M., Hobbs, B., Speed, T.P., 2003a. Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res. 31, e15. Irizarry, R.A., Hobbs, B., Collin, F., Beazer-Barclay, Y.D., Antonellis, K.J., Scherf, U., Speed, T.P., 2003b. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4, 249–264. Klebanov, L., Jordan, C., Yakovlev, A., 2006. A new type of stochastic dependence revealed in gene expression data. Stat. Appl. Genet. Mol. Biol. 5, Article7. Klebanov, L., Yakovlev, A., 2007. How high is the level of technical noise in microarray data? Biol. Direct. 2, 9. Laird, P.W., 2010. Principles and challenges of genome-wide DNA methylation analysis. Nat. Rev. Genet. 11, 191–203. Lander, E.S., 1999. Array of hope. Nat. Genet. 21, 3–4. Leek, J.T., Scharpf, R.B., Bravo, H.C., Simcha, D., Langmead, B., Johnson, W.E., Geman, D., Baggerly, K., Irizarry, R.A., 2010. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 11, 733–739. Li, C., Wong, W.H., 2001. Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc. Natl. Acad. Sci. U.S.A. 98, 31–36. Marshall, E., 2004. Getting the noise out of gene arrays. Science 306, 630–631. Mohr, S., Leikauf, G.D., Keith, G., Rihn, B.H., 2002. Microarrays as cancer keys: an array of possibilities. J. Clin. Oncol. 20, 3165–3175. Naef, F., Hacker, C.R., Patil, N., Magnasco, M., 2002. Empirical characterization of the expression ratio noise structure in high-density oligonucleotide arrays. Genome Biol. 3, RESEARCH0018. Ni, T.T., Lemon, W.J., Shyr, Y., Zhong, T.P., 2008. Use of normalization methods for analysis of microarrays containing a high degree of gene effects. BMC Bioinform. 9, 505. O’Riain, C., O’Shea, D.M., Yang, Y., Le Dieu, R., Gribben, J.G., Summers, K., YeboahAfari, J., Bhaw-Rosun, L., Fleischmann, C., Mein, C.A., Crook, T., Smith, P., Kelly, G., Rosenwald, A., Ott, G., Campo, E., Rimsza, L.M., Smeland, E.B., Chan, W.C., Johnson, N., Gascoyne, R.D., Reimer, S., Braziel, R.M., Wright, G.W., Staudt, L.M., Lister, T.A., Fitzgibbon, J., 2009. Array-based DNA methylation profiling in follicular lymphoma. Leukemia 23, 1858–1866. Pelz, C.R., Kulesz-Martin, M., Bagby, G., Sears, R.C., 2008. Global rank-invariant set normalization (GRSN) to reduce systematic distortions in microarray data. BMC Bioinform. 9, 520. Qiu, X., Brooks, A.I., Klebanov, L., Yakovlev, N., 2005. The effects of normalization on the correlation structure of microarray data. BMC Bioinform. 6, 120. Quackenbush, J., 2006. Microarray analysis and tumor classification. N. Engl. J. Med. 354, 2463–2472. Rakyan, V.K., Down, T.A., Maslau, S., Andrew, T., Yang, T.P., Beyan, H., Whittaker, P., McCann, O.T., Finer, S., Valdes, A.M., Leslie, R.D., Deloukas, P., Spector, T.D., 2010. Human aging-associated DNA hypermethylation occurs preferentially at bivalent chromatin domains. Genome Res. 20, 434–439. Reimers, M., 2010. Making informed choices about microarray data analysis. PLoS Comput. Biol. 6, e1000786. Reverter, A., Barris, W., McWilliam, S., Byrne, K.A., Wang, Y.H., Tan, S.H., Hudson, N., Dalrymple, B.P., 2005. Validation of alternative methods of data normalization in gene co-expression studies. Bioinformatics 21, 1112–1120. Staaf, J., Jonsson, G., Ringner, M., Vallon-Christersson, J., 2007. Normalization of array-CGH data: influence of copy number imbalances. BMC Genomics 8, 382. Teschendorff, A.E., Menon, U., Gentry-Maharaj, A., Ramus, S.J., Weisenberger, D.J., Shen, H., Campan, M., Noushmehr, H., Bell, C.G., Maxwell, A.P., Savage, D.A., Mueller-Holzner, E., Marth, C., Kocjan, G., Gayther, S.A., Jones, A., Beck, S., Wagner, W., Laird, P.W., Jacobs, I.J., Widschwendter, M., 2010. Age-dependent DNA methylation of genes that are suppressed in stem cells is a hallmark of cancer. Genome Res. 20, 440–446. Tusher, V.G., Tibshirani, R., Chu, G., 2001. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. U.S.A. 98, 5116–5121. van de Wiel, M.A., Picard, F., van Wieringen, W.N., Ylstra, B., 2010. Preprocessing and downstream analysis of microarray DNA copy number profiles. Brief Bioinform.. Wu, Z., Aryee, M.J., 2010. Subset quantile normalization using negative control features. J. Comput. Biol. 17, 1385–1395. Yao, C., Li, H., Zhou, C., Zhang, L., Zou, J., Guo, Z., 2010. Multi-level reproducibility of signature hubs in human interactome for breast cancer metastasis. BMC Syst. Biol. 4, 151. Zakharkin, S.O., Kim, K., Mehta, T., Chen, L., Barnes, S., Scheirer, K.E., Parrish, R.S., Allison, D.B., Page, G.P., 2005. Sources of variation in Affymetrix microarray experiments. BMC Bioinform. 6, 214. Zhang, M., Yao, C., Guo, Z., Zou, J., Zhang, L., Xiao, H., Wang, D., Yang, D., Gong, X., Zhu, J., Li, Y., Li, X., 2008. Apparently low reproducibility of true differential expression discoveries in microarray studies. Bioinformatics 24, 2057–2063. Zhang, M., Zhang, L., Zou, J., Yao, C., Xiao, H., Liu, Q., Wang, J., Wang, D., Wang, C., Guo, Z., 2009. Evaluating reproducibility of differential expression discoveries in microarray studies by considering correlated molecular changes. Bioinformatics 25, 1662–1668.

Suggest Documents