February 3, 2005 NONPARAMETRIC METHODS FOR MICROARRAY DATA BASED ON EXCHANGEABILITY AND BORROWED POWER Mei-Ling Ting Lee1,2,3, G.A. Whitmore4, Harry Bj¨ orkbacka5 , Mason W. Freeman2,6 1: 2: 3: 4: 5: 6:
Department of Medicine, Brigham and Women’s Hospital, Boston, USA Harvard Medical School, Boston, USA Biostatistics Department, Harvard School of Public Health, Boston, USA McGill University, Montreal, Canada Wallenberg laboratory, University Hospital MAS, Lund University, Sweden Massachusetts General Hospital, Boston, USA
Abstract: This article proposes nonparametric inference procedures for analyzing microarray gene expression data that are reliable, robust and simple to implement. They are conceptually transparent and require no special-purpose software. The analysis begins by normalizing gene expression data in a unique way. The resulting adjusted observations consist of gene-treatment interaction terms (representing differential expression) and error terms. The error terms are considered to be exchangeable, which is the only substantial assumption. Thus, under a family null hypothesis of no differential expression, the adjusted observations are exchangeable and all permutations of the observations are equally probable. The investigator may use the adjusted observations directly in a distribution-free test method or use their ranks in a rank-based method, where the ranking is taken over the whole data set. For the latter, the essential steps are as follows. (1) Calculate a Wilcoxon rank-sum difference or a corresponding Kruskal-Wallis rank statistic for each gene. (2) Randomly permute the observations and repeat the previous step. (3) Independently repeat the random permutation a suitable number of times. Under the exchangeability assumption, the permutation statistics are independent random draws from a null cumulative distribution function (c.d.f.) approximated by the empirical c.d.f.. Reference to the empirical c..d.f. tells if the test statistic for a gene is outlying and, hence, shows differential expression. This feature is judged by using an appropriate rejection region or computing a P -value for each test statistic, taking account of multiple testing. The distribution-free analog of the rank-based approach is also available and has parallel steps which are described in the article. The proposed nonparametric analysis tends to give good results with no additional refinement, although a few refinements are presented that may interest some investigators. The implementation is illustrated with a case application involving differential gene expression in wild-type and knockout mice of an E.coli lipopolysaccharide (LPS) endotoxin treatment, relative to a baseline untreated condition. Keywords: distribution-free, exchangeable random variables, false discovery rate, gene expression, microarray, multiple testing, nonparametric methods, normalization, rank methods, SAM, statistical analysis Corresponding Author: Mei-Ling Ting Lee, Ph.D. Channing Laboratory, BWH/HMS, 181 Longwood Avenue, Boston, MA, 02115-5804 Tel: (617)-525-2732, Fax: (617)-731-1541, Email:
[email protected] 1
1
Introduction
Statistical methods for the analysis of microarray gene expression data have been evolving since the earliest uses of microarray technology. Growing experience and understanding of the technology and data have brought analytical techniques to an advanced stage of development. Now scientists and analysts should reasonably expect that standardization and simplification can make data analysis more routinely accessible at low cost. This article describes nonparametric methods for analyzing microarray gene expression data that are reliable, robust and simple to implement. They derive their simplicity and strength from an assumption of exchangeable error terms and draw their power from information pooled across all genes in the array set. They employ randomization principles that are free of strong assumptions and these lead to straightforward and dependable inferences. The proposed procedures reflect accumulated experience and best practices to date. Implementation is easily carried out with a few standard commands in almost any conventional statistical software package. The approach includes sensible steps in data preparation, takes appropriate account of the experimental design, accommodates multiple testing and requires no major investment in advanced statistical modeling or computer programming. The analytical components of the proposed system will seem familiar and have recognizable names, such as ‘rank sum statistic’ and ‘permutation test’. These names, however, only describe the classes of techniques. Their form of implementation is new. The assembly of tools, taken together, constitutes a conceptually sound synthesis of ideas and methods. The final result is a useful and streamlined analytical system of practical value to research analysts. Use of the proposed procedures is illustrated with a case application involving differential gene expression in wild-type and knockout mice of an E.coli lipopolysaccharide (LPS) endotoxin treatment, relative to a baseline untreated condition.
2
Preliminary Data Preparation
Before describing our nonparametric methods, we first set the stage by describing the data and the preliminary data preparation. The discussion considers a cDNA microarray data set but the methods can be applied with equal effect and ease to microarray data from other technology platforms. The analysis begins with transformation and normalization of the gene expression data, which prepares them for subsequent analysis. An assortment of steps for data preparation are available. Good techniques are described in Lee et al (2000, 2002a), Tseng et al (2001), Yang et al (2002), Kerr et al (2002) and Lee (2004). Our methods do not require a particular form of data preparation, as long as it is statistically and scientifically sound. To prepare for the discussion of our methods, however, we adopt a simple variant of the most common data preparation strategy. 1. Transformation. The raw intensity readings from the microarray technology are usually subjected to a mathematical transformation of either background-corrected intensity 2
readings or foreground-intensity readings (i.e., readings without background correction). We adopt the logarithmic transformation in our case application, which is a common choice. Missing or inadmissible readings, caused by defective spots or background intensity exceeding foreground intensity in the case of background-corrected readings, may be omitted or given imputed values, as deemed most appropriate. For the sequel, we assume that genes with missing values are omitted or their missing values are imputed. Thus, for retained genes, we have a complete data set of gene expression readings. The transformed observations are denoted by Ygtj where g = 1, . . . , G denote the genes retained for study, t = 1, . . . , T denote the treatments, and j = 1, . . . , J denote the replicated readings on each gene-treatment combination. Thus, as missing values for retained genes are assumed to be imputed, there are a total of N = GT J observations available for analysis. 2. Normalization. Normalization is intended to remove sources of extraneous variability from the transformed intensities. Our approach to normalization uses a special adaptation of two-stage ANOVA normalization. It must be emphasized that the reference to ANOVA here is simply in terms of correcting gene expression readings for systematic effects using least squares fitting but entails no reliance on ANOVA-based inference. The least squares criterion can also be replaced by fitting based on other criteria, such as least absolute deviations. The two-stage approach simplifies computation, as we demonstrate shortly. This approach was first introduced by Wolfinger et al (2001) and is discussed in Lee (2004). The first stage of the two-stage technique fits an ANOVA model that contains main effects and interaction terms for all experimental factors except the gene factor, as follows. (1)
Ygtj = (All main effects and interaction terms not involving gene g) + Ugtj
(1)
(1)
The model also includes a constant term. Here the Ugtj are the residuals of the firststage model. These residuals correspond to transformed intensities that are centered (i.e., averaged to zero) for all combinations of experimental factors in the model except those involving genes. The factors taken into account should include array slide, dye and experimental treatment (i.e., specimen type or experimental condition), as well as others such as pintip, operator, and other potential sources of experimental variability. (1)
The second stage of the two-stage technique is fitted to the residuals Ugtj from the first stage. The fitting of the second-stage model is done one gene at a time. This strategy eases the computational burden and is the benefit of using a two-stage approach. The effects usually included in the second-stage ANOVA for each gene are the main effect for that gene and all gene-factor interaction terms that are considered relevant. The genefactor interactions taken into account in the second stage might include, for example, gene-dye, gene-array, gene-treatment and similar interactions. We modify the two-stage
3
approach with one important exception. The one exception is that the estimated genetreatment interaction terms, denoted here by Iˆgt , t = 1, . . . , T , are not included in the fitted model. Thus, the following modified model is fitted for each gene g. (1) (2) Ugtj = (Gene main effect and all gene interactions except Iˆgt ) + Ugtj
(2)
(2)
Here the Ugtj are the residuals of the modified second-stage model. These residuals are transformed intensities that have been centered for the levels of all experimental factors except the gene-treatment interaction. We have not included the gene-treatment interaction terms Iˆgt in the second stage of normalization (the exception mentioned earlier) for a reason that is unique to our new analytical approach and will be explained later. Of course, the gene-treatment interactions are the quantities of central scientific interest in a microarray study and will be taken into consideration subsequently. To simplify the (2) subsequent notation, we will write the second-stage residuals Ugtj as Ugtj . The Ugtj are the normalized observations that enter our nonparametric analytical methods.
3
Exchangeable Error Model
The normalized observations Ugtj , coming out of the modified second-stage ANOVA, have the following statistical model. Ugtj = Igt + gtj
(3)
The only systematic effects that may be present are the gene-treatment interactions Igt . The gtj are error terms. The key assumption we make is that the error terms gtj are exchangeable across all genes g, treatments t and replicates j. Exchangeability means that the error terms have a joint distribution which is invariant under any permutation of its arguments. The assumption implies that the error terms are probabilistic replicas of each other and any rearrangement of them is uninformative. This assumption is a new feature of our approach and has significant consequences. We return subsequently to examine the reasonableness of this assumption more closely, to consider diagnostic tests for the property and to explore potential remedies where the assumption might not hold. The null hypothesis of interest for each gene g is H0 : Igt = 0,
t = 1, . . . , T.
(4)
The hypothesis states that the mean expression intensity for gene g does not differ across the T treatments. Subsequently, we denote the full gene set by G and the subset for which H0 holds by G0 , with cardinalities G and G0 ≤ G, respectively. We refer to the set of hypotheses for which H0 is true as the family of true null hypotheses.
4
As the error terms in model (3) are assumed to be exchangeable, it follows that the observations Ugtj are realizations of exchangeable random variables for all genes g for which H0 is true. Although the Ugtj will be exchangeable under these conditions, they are not generally independent because normalization forces them to adhere to mathematical constraints. For example, normalization at the second stage implies that the Ugtj must sum to zero across each gene and across the factor levels of each gene-factor interaction. Because all observations Ugtj are exchangeable for g ∈ G0, it follows that all permutations of these observations across the set G0 are equally probable.
4
Choice of Test Statistic
We adopt a nonparametric approach to inference and offer the analyst a choice of the type of nonparametric statistic to be used for testing H0 . The choice is between using the Ugtj observations directly in a distribution-free fashion or using their ranks. Both choices are described next. With either choice, we denote the test statistics for the genes by Vg∗ , g = 1, . . . , G. 1. Distribution-free Method Without Ranking: If there are two treatment levels (T = 2), we calculate a difference of treatment sums or means, as follows. Vg∗
=
J X j=1
Ug2j −
J X
Ug1j = J U g2 − U g1
(5)
j=1
Here the averages U gt , t = 1, 2, are taken over subscript j. If there are multiple treatment levels (T ≥ 2), we calculate a treatment mean square statistic (MST) in the standard fashion, as follows. Vg∗
T J X = MSTg = (U gt − U g )2 T − 1 t=1
(6)
Here the averages U gt and U g are calculated over the subscript j and subscripts (t, j), respectively. 2. Rank-based Method: With this method, we first obtain the ranks Rgtj = rank(Ugtj ). It is important to note that the ranking is over the whole data set. Thus, the Rgtj take on the natural numbers {1, 2, . . . , N } where, as defined earlier, N = GT J is the total number of observations for the gene set. We assume the ranking is from smallest (most negative) to largest (most positive). The ranks are used to calculate a test statistic Vg∗ for each gene g. If there are two treatment levels (T = 2), we calculate a Wilcoxon difference of rank sums, using the ranks Rgtj for each gene. If there are multiple treatment levels (T ≥ 2), 5
we calculate a Kruskal-Wallis rank statistic, using the ranks Rgtj for each gene. In either case, the calculational formulas correspond to the test statistics in (5) and (6) but now with ranks Rgtj being used in lieu of the normalized observations Ugtj . As the observations Ugtj are realizations of exchangeable random variables for all genes g for which H0 is true, it follows that all test statistics Vg∗ , g = 1, . . . , G0 , are exchangeable, whether they are derived under the distribution-free method or the rank-based method. This fact allows us to proceed with permutation testing. Ranking over the whole data set may seem novel, even odd, but the technique gives us the borrowing power that we need for sensitive results, especially when there is little redundancy or replication at the gene level. As we have already pointed out, the global ranking is justified under the exchangeability assumption.
5
Permutation Tests
We now apply permutation testing to our differential gene expression statistics Vg∗ . For general discussions of permutation tests, refer to Westfall and Young (1993) and Good (2000). For issues relating to permutation tests for microarray data, see Dutoit et al (2002) and Lee (2004). The permutation tests of the statistics Vg∗ proceed in the same basic way whether these are based on the normalized observations Ugtj or their ranks Rgtj . In anticipation of multiple testing, we postulate initially that H0 holds true for all genes so G0 = G and, therefore, G0 = G. We subsequently describe how this requirement can be relaxed in implementing the testing methodology. The permutation test approach involves three simple steps: 1. Randomly permute the full set of observations Ugtj or their ranks Rgtj , as the case may be. Re-calculate the test statistic for each gene g. We now denote this re-calculated statistic by Vg(b) for the bth permutation. Independently repeat the permutation B times. The repetition gives the sets of statistics {Vg(b) , g = 1, . . . , G} for b = 1, . . . , B. Under the exchangeability assumption for the family of null hypotheses, the set of statistics {Vg(b) , g = 1, . . . , G, } are exchangeable and the B sets of these statistics are independent random samples from a null cumulative distribution function (c.d.f.) F0(v). C.d.f. F0 (v) is approximated by the empirical c.d.f. F 0(v) of the permutation statistics Vg(b) , g = 1, . . . , G, b = 1, . . . , B. 2. By reference to the empirical c.d.f. F 0 (v) computed in step (1), we decide if the test statistic Vg∗ for gene g is outlying and, hence, signals a differentially expressed gene (rejecting H0 ). This decision may be taken by using an appropriate rejection region or computing a P -value for each test statistic Vg∗ . This step gives a list of differentially expressed genes. 3. Errors of inference and the false discovery rate (FDR) are evaluated and controlled by counting the frequency of occurrence of outlying test statistics Vg(b) among the randomly permuted data sets. This step takes proper account of the multiple testing that is inherent in the methodology. 6
The false discovery rate (FDR) can be estimated directly from the permutation test results by comparing the count of permuted test statistics Vg(b) that lie in the rejection region (step 3) with the count of genes that are declared as differentially expressed (step 2). The case application shows how the three steps are implemented, as well as the calculation of the FDR. See Benjamini and Hochberg (1995) for an explanation of the FDR.
6
Improved Error Control Through Test Iteration
The analytical system described in the preceding section is sufficiently powerful to give a reliable list of the most dominant differentially expressed genes with no additional computation or analysis. However, for those investigators who wish to refine the results, the following test iteration might be beneficial. The basic methodology assumes that no gene is differentially expressed under the family of null hypotheses. As some genes are generally expected to be differentially expressed (so G0 < G), a step-wise testing algorithm can be employed to make the methodology less conservative. The essential logic of such an algorithm is as follows. If one or more genes are declared to be differentially expressed when the basic test methodology described earlier is applied then these can be removed from the gene set and the permutation test repeated. In essence, if H0 is rejected for a gene g then the inference implies that observations Ugtj are no longer exchangeable with those for which H0 is true. The empirical c.d.f. F 0 (v) that is estimated when such declared genes are removed will generally have a little less spread in the tail regions and, hence, some additional genes may have P -values that are smaller than the specified threshold. Several iterations of this step-wise approach may be applied until the list of genes that are declared as expressed has a stable membership. Our experience shows that one iterative step is usually enough to achieve a refined final list. As would be expected, we generally find that the iteration adds only a few genes to the initial list of declared genes and these few are marginally significant.
7
Case Application
We now demonstrate the proposed analytical system using a case application. The application considers differential gene expression in wild-type and knockout mice of an E.coli lipopolysaccharide (LPS) endotoxin treatment, relative to a baseline untreated condition (control). Thus, differential expression here means that the extent of up- or down-regulation of a gene because of LPS treatment differs between wild-type and knockout mice. The LPS experiment was done by treating macrophages isolated from the two genotypes. Design and preliminary data preparation. The experimental design for this cDNA study is a 2 × 2 reverse-color latin-square, as shown in Table 1. Two arrays were prepared for each mouse. The first array had the treatment sample (LPS) read on the red channel and the control sample (None) read on the green channel. The second array had the 7
color channels for treatment and control reversed. Only the 3498 genes with a complete set of readings are considered in the analysis reported here; thus, G = 3498. Four wild-type mice and four knockout mice were used. A complete sample for a gene therefore consists of 8(4) = 32 intensity readings. The intensity readings were reduced to 16 intensity ratios by taking the treatment intensity over the control intensity on the same spot. The intensity ratios were transformed to logarithms with base 2. The analysis was done with both background-adjusted and unadjusted intensities. In the latter case, the intensity ratio is the ratio of foreground intensities. The results reported here are for foreground-intensity ratios. The log-intensity ratios were normalized across all genes by centering their values for each level of the treatment, dye and slide factors (the first-stage ANOVA step). The readings were also centered for each gene and gene-dye combination (the second-stage ANOVA step). The gene-dye interaction removes the dye-order effect from the pair of intensity ratios calculated for each mouse. The resulting normalized log-intensities correspond to the Ugtj in (3). It is widely recognized that it is important to distinguish between biological and technical replicates and our analysis here has taken this distinction into account. The two reversecolor arrays prepared for each mouse are the technical replicates. These two arrays yielded four intensity readings for each gene, which we converted to two intensity ratios, one for the spot on each array. The two ratios are comparable except for dye-order. Our second-stage ANOVA normalization included a gene-dye interaction and this interaction takes account of the dye-order effect at the gene level. These preliminary adjustments absorb effects attributable to technical and biological variability and leave residuals that have an exchangeable error structure. For situations where questions may remain about the exchangeability assumption, we discuss checking procedures and remedies that can be applied in the next section. Computing test statistics, permutation testing and study findings. For the rank-based method, a rejection region defined by differences of rank sums outside of the interval ±200, 000 was used for declaring a gene as differentially expressed. We explain the reason for choosing this cutoff momentarily. A list of 112 genes (not presented here) were declared as differentially expressed based on this statistical test. Negative values of the test statistic (82 genes) indicate genes that are up-regulated in wild-type mice while positive values (30 genes) indicate up-regulation in knockout mice. For the distribution-free method, the rejection region was defined by differences of normalized log-intensity sums outside the interval ±4.5. With this method a list of 70 genes (also not presented here) were declared as differentially expressed. The sign of the difference statistic is similarly interpreted as in the case of the rank-sum difference. This list shows 52 genes up-regulated in wild-type mice and 18 up-regulated in knockout mice. We now assess the control of inferential errors that has been achieved in this analysis. The permutation procedure described in the basic analytical system was repeated B = 300 times. For the rank-based method, the 300 repeats generated 1749 false positives, which 8
implies that the list of 112 declared genes has an expected count of 1749/300 = 5.83 false positives. The corresponding estimate of the FDR is 5.83/112 = 0.052. For the distribution-free method, the 300 repeats generated 1003 false positives, which implies that the list of 70 declared genes has an expected count of 1003/300 = 3.3 false positives. The corresponding estimate of the FDR is 3.3/70 = 0.048. Observe that the chosen rejection regions for the two methods give roughly equivalent controls on the FDR, namely, about 5 percent for each method. An analyst may wish to examine the response of the number of positives and FDR to changing levels of the cutoff level. Table 2 illustrates this responsiveness for the rank-based method in this case application. The table shows the average count of false positives, the count of positives and the FDR for a selection of cutoff levels ranging from ±160, 000 to ±300, 000. The results reported above for the cutoff of ±200, 000 appear in line 3 of the table. As expected, as the rejection criterion becomes more stringent (i.e., the cutoff level is increased) the false positive rate declines until almost all positives are likely to be true positives. Table 2 also shows the division of positives between genes up-regulated in wild-type and knockout mice. The false positives are not divided accordingly but easily could be. They would divide almost equally between the two types of mice. If this were done, the FDR could be separated into components representing false discoveries of up-regulation in each type of mouse. We choose not to show this detail. A comparison of the gene lists generated from the distribution-free and rank-based statistics shows considerable correspondence but also a few differences. Clearly the lengths of the lists differ but, in addition, the lists vary in the top positions. Table 3 shows the top ten genes from the list for the rank-based method, selected using the absolute value of their test statistics. All of these ten happen to have negative rank-sum differences, indicating up-regulation in the wild-type mouse. Table 4 shows the top ten genes from the list for the distribution-free method. A comparison of the lists in Tables 3 and 4 shows that the top ten positions have seven genes in common, which illustrates that the two methods yield reasonably consistent but not identical results. A similar look at the top 10 genes in the knockout up-regulated group (the largest positive statistics) shows the two methods have six genes in common (not displayed here). Some variability in such lists is to be expected because ranks Rgtj reflect only the rank order of the residuals Ugtj and not their actual values. Top-ten lists like those shown in Tables 3 and 4 may suggest greater inconsistency in methods than is actually the case. Here, for example, one need only look down the rank-based gene list to the 11th, 12th and 17th positions to find the three genes which appear in Table 4 but not in Table 3. Figure 1 presents descriptions and functions for the thirteen distinct genes listed in Tables 3 and 4. Most of the genes mediate host inflammatory responses in innate immunity. These genes contain several well-characterized inflammatory cytokines such as IL-1beta, IL-1alpha and TNF-alpha, as well as several important chemokines such as KC, MIP2, JE and MIP-1alpha, all of which are known to be critical for host defense against 9
invading pathogens – see Abbas et al (2000). For M15131 (IL-1beta) and X02611 (TNFalpha), qPCR and ELISA data confirm the differential up-regulation of expression in the wild-type mouse, relative to the knock-out mouse, in response to LPS treatment. The discovery of differentially expressed genes suggests that these might now be removed from the family of genes for which H0 is assumed to hold. If this is done, the permutation analysis can be repeated (as described in Section 6) to obtain a more refined estimate of the empirical c.d.f. F 0(v). This re-analysis is not fully reported here but we note for the rank-sum method that the re-analysis using cutoff ±200, 000 shrinks the estimate of the expected count of false positives in the list of 112 declared genes to 4.71 from 5.83. The change is not dramatic but may be considered worth capturing for a final refined analysis. Note that the rejection region was left unchanged for this re-analysis so the list of declared genes is unaltered. Comparison with the SAM software. To give the reader a sense of how our methodology compares with other approaches that have been proposed for microarray data, we have processed the data set using the SAM (Significance Analysis of Microarrays) software package (Tusher et al, 2001). We used the two class data format option in SAM, with eight observations for each type of mouse (16 in total). The number of permutations option was set at 300. A fragment of the SAM output appears in Table 5. The table shows the gene identifier, score statistic d, numerator r and denominator s + s0 for the ten genes having the highest absolute scores. Note that d = r/(s + s0 ). The top-scoring genes are all upregulated in the wild-type mouse (as was the case in Tables 3 and 4). Not unexpectedly, the results in Table 5 match those in Tables 3 and 4 closely. Nine of the 10 genes in Table 3 are found in the top 10 list from SAM. The 10th gene (X16800) in Table 3 appears in the 11th position in the SAM output (not shown). Eight of the 10 genes in Table 4 are found in the top 10 list from SAM. The 8th and 9th genes (X70058 and M19681) appear in the 12th and 28th positions in the SAM output. A look at our method based on the difference of normalized log-intensity sums is seen to correspond exactly to selecting genes according to their r values in the SAM output. In fact, the log-intensity sums in Table 4 are simply a multiple of eight times the values of r for the matching genes in Table 5. ’Eight’ here refers to the number of log-intensity ratios for each type of mouse. For gene M15131, for example, r = −4.50049225 and the matching log-intensity sum in Table 4 is 8(−4.50049225) = −36.00394. Gene M19681, which appears at the 28th position in the SAM output, receives a relatively lower d score because its standard deviation s in the denominator happens to be comparatively large. We note that score statistics d in SAM will give an identical gene list to our method for differences of normalized log-intensity sums if the exchangeability factor s0 in SAM is set to a very large value. Interestingly also, our rank sum statistics can also be computed using SAM by replacing the normalized intensities by their ranks, where the ranking is over the entire data set. Our rank sums in Table 3 are a multiple of eight times the r statistics that would come out of SAM if the input data were ranks. Although SAM output can be used to compute our test statistics, our inference methodology, including the estimation of the number of false positives and the false discovery rate, does not 10
correspond to that in SAM. The intrinsic logic of our permutation procedure differs from that in SAM as well as the fact that we permute different test statistics.
8
Additional Remarks
The discussion of preliminary data preparation has stated that the analyst must choose between using foreground intensities or background-corrected intensities from cDNA arrays. The case application uses foreground intensities. There is some evidence that background correction may, in fact, be counterproductive. Background intensities may retain some signal and, hence, some linear combination of foreground and background intensities may be more informative than either measure taken on its own. A similar issue arises with data from oligonucleotide arrays. See Lee (2004: 63-64) and references cited there for more information on this point. Our methodology may be employed with any combination of foreground and background intensities that is chosen as the response measure. Also, in relation to preliminary data preparation, we chose not to carry out any intensity-dependent normalization because of findings reported in Lee and Whitmore (2004) who argue that it is not suitable. The assumption of exchangeability of the error terms gtj is the only substantive assumption required by this proposed analytical system. The assumption can be checked by examining the transformed and normalized readings Ugtj for those genes that are concluded to be undifferentially expressed, i.e., those genes g declared to be in set G0 . As exchangeability requires that these readings be probabilistic replicas, the check can take the form of comparisons of empirical distributions of the Ugtj for different cross-classifications of the data. For example, a cross-classification by array should show the same distribution pattern for each array. As the readings are centered on zero, the absolute values |Ugtj | should also be exchangeable for all g ∈ G0. A comparison of the empirical distributions of these absolute values under different cross-classifications would highlight potential differences in variability of the gtj . Remedial actions based on transformations and corrective adjustments can be applied if the assumption of exchangeability is found wanting. Our presentation has stressed the benefits of borrowing information strength from other genes when making an inference about any single gene. These benefits are widely recognized in the field of microarray data analysis. The benefit is especially clear when considering the possible use of t- and F -test statistics at the individual gene level. These statistics have genelevel mean square errors (MSE) in their denominators. In microarray studies, these MSEs often have few or no degrees of freedom and, hence, may be unstable or undefined. Moreover, some inferential methods also require related assumptions of normality and independent error terms, which may be untenable. Our error exchangeability assumption implies that the entire collection of residuals provides information about the inherent variability that all genes have in common. Exchangeability is a weaker assumption than independence and normality is not required. The methodology in the SAM software and our proposed methodology rely on borrowing power from other genes and, in part, do this borrowing in the form of a shrinkage estimator for MSE. In SAM, the exchangeability factor s0 is determined from observations 11
for all genes while, in our method, the empirical c.d.f. F 0(v) provides the basis for judging significance and is derived from the whole family of test statistics Vg∗. See Lee (2004: 130-134) for a more detailed discussion of these issues. The analytical system here has been presented only in terms of conducting significance tests for the presence or absence of differential gene expression. Where an investigator wishes to know the extent of differential expression of a gene, expressed as a log-difference or foldchange, such an enquiry is easily accommodated. The query is basically one of checking whether a particular value of Igt is plausible. The methods presented here can be adapted easily to construct confidence intervals for such parameter estimates. The analyst need only adjust the initial values of Ugtj obtained for gene g by subtracting anticipated values of Igt , i.e., need only revise Ugtj to Ugtj − Igt, and then redo the analysis. The interval of values L ≤ Igt ≤ U in this adjustment that do not lead to a conclusion of differential expression for gene g constitute the desired confidence interval. Our implementation of the methodology in this case application has employed the Stata software package. Our Stata program for normalizing the data, computing the test statistics (of either type) and conducting the permutation testing has fewer than 100 command lines (a 3k text file). The execution time is also very fast. Missing values (including their imputation) need to be handled separately, in advance of applying our methodology. The reader is referred to Lee (2004: 85-91) for some suggested approaches to this issue. An important advantage of our slim computational package is that it allows the user to incorporate customized analytical features with only a small investment of effort and modest programming expertise. In summary, our proposed analytical system offers several advantages. First, it is easy to apply. The proposed procedures can be implemented using any standard commercial statistical software package with no special subroutines. Second, its intrinsic logic is transparent. Third, it has a flexible structure that accommodates a great range of microarray study designs and technologies. Fourth, the underlying inferences, being nonparametric, do not rely on strong distributional assumptions. Hence, they are relatively robust while remaining quite efficient. Acknowledgements: This research was supported in part by NIH grants HG02510, DK63665, HL72358 (Lee), HL66678, HL72358 (Freeman) the Swedish Research Council (Bj¨orkbacka) and a grant from the Natural Sciences and Engineering Research Council of Canada (Whitmore). The work was partially completed while the authors (Lee, Whitmore) were visiting the Institute for Mathematical Sciences, National University of Singapore in 2004. The visit was supported by the Institute and by a grant (No. 01/1/21/19/217) from BMRC-A*STAR of Singapore. Cited References 1. Abbas, A. K., A. H. Lichtman and J. S. Pober. (2000) Cellular and Molecular Immunology, 4th ed., W. B. Saunders Company, New York, USA. ISBN 0-7216-8233-2. 2. Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing, Journal of the Royal Statistical Society, B, 57, 289-300.
12
3. Dudoit, S., Yang, Y. H., Callow, M. J. and Speed, T. P. (2002). Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments, Statistica Sinica, 12, 111-139. 4. Good, P. (2000). Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses, 2nd edition, Springer-Verlag, New York. 5. Kerr, M.K., Afshari, C.A., Bennett, L., Bushel, P., Martinez, J., Walker, N.J., and Churchill, G.A. (2002). Statistical analysis of a gene expression microarray experiment with replication. Statistica Sinica, 12, 203-218, 2002. 6. Lee, M.-L. T., F. C. Kuo, G.A. Whitmore and J. L. Sklar. Importance of replication in microarray gene expression studies: Statistical methods and evidence from repetitive cDNA hybridizations, Proceedings of the National Academy of Sciences 2000, 97, 98349839. 7. Lee, M.-L. T., W. Lu, G.A. Whitmore and D. Beier (2002a). Models for microarray gene expression data, Journal of Biopharmaceutical Statistics, 12(1), 1-19. 8. Lee, M.-L. T. and G. A. Whitmore (2002b). Power and sample size for DNA microarray studies, Statistics in Medicine, 21, 3543-3570. 9. Lee, M.-L. T. (2004). Analysis of Microarray Gene Expression Data, Kluwer Academic Publishers. ISBN 0-7923-7087-2. 10. Lee, M.-L. T. and G. A. Whitmore (2004). Intensity dependent normalization in microarray analysis: a note of concern, Bernoulli, 10, 1-7. 11. Tseng, G.C., Oh, M.-K., Rohlin, L., Liao, J.C., and Wong, W.H. (2001). Issues in cDNA microarray analysis: quality filtering, channel normalization, models of variation and assessment of gene effects. Nucleid Acids Research, 29, 2549-2557. 12. Tusher, V. G., Tibshirani, R. and Chu, G. (2001). Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Academy of Sciences, USA, 98, 5116-5121. 13. Westfall, P. H. and Young, S.S. (1993). Re-sampling Based Multiple Testing: Examples and Methods for P-value Adjustment, John Wiley & Sons, New York. 14. Wolfinger, R. D., Gibson G., Wolfinger E. D., Bennett L., Hamadeh H., Bushel P., Afshari C. and Paules R. S. (2001). Assessing gene significance from cDNA microarray gene expression data via mixed models, Journal of Computational Biology, 8, 625-637. 15. Yang, Y.H., Dudoit, S., Luu, P., Lin, D.M., Peng, V., Ngai, J., and Speed, T.P. (2002). Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleid Acids Research, 30, No.4, e15.
13
Table 1. The 2 × 2 reverse-color latin square design used in the case study. The design is repeated for four wild-type mice, four knockout mice and G = 3498 genes. ‘LPS’ refers to treatment; ‘None’ refers to control.
Dye Red Green
Array 1st 2nd LPS None None LPS
Table 2. Estimated FDR as a function of the cutoff level used for the rank-sum difference in the case application. Average counts of false positives per permutation are derived from B = 300 permutations. The total counts of positives are divided between those genes that are up-regulated in wild-type (WT) and knockout (KO) mice. FDR equals column 2 divided by column 5. Test Statistic Average Count of Cutoff False Positives (1) (2) ±160, 000 43.75 ±180, 000 16.53 ±200, 000 5.83 ±220, 000 1.78 ±240, 000 0.48 ±260, 000 0.11 ±280, 000 0.03 ±300, 000 0.01
Up-reg. in WT (3) 128 97 82 65 53 42 34 31
Positives Up-reg. in KO (4) 100 47 30 19 12 7 5 5
Total (5) 228 144 112 84 65 49 39 36
Estimated FDR (6) 0.192 0.115 0.052 0.021 0.007 0.002 0.001 0.000
Table 3. Ten genes with greatest differential expression based on the difference of rank sums. All ten genes are up-regulated in the wild-type mouse.
1 2 3 4 5 6 7 8 9 10
Gene Identifier Rank Sum M15131 -447422 M73061 -446426 AB047391 -445740 X53798 -441115 NM 026594 -440162 -432329 NM 011336 X62701 -428365 J04596 -425155 X01450 -420369 X61800 -412433
14
Table 4. Ten genes with greatest differential expression based on the difference of normalized log-intensity sums. All ten genes are up-regulated in the wild-type mouse.
1 2 3 4 5 6 7 8 9 10
Gene Identifier Log-intensity Sum M15131 -36.00394 -34.13284 NM 026594 M73061 -25.51745 X53798 -23.13819 X01450 -19.57546 AB047391 -19.25631 J04596 -18.32873 X70058 -17.86110 M19681 -17.60596 X02611 -16.00998
Table 5. Ten genes with the largest d scores from the SAM software package. All ten genes are up-regulated in the wild-type mouse.
1 2 3 4 5 6 7 8 9 10
Gene Identifier AB047391 M15131 M73061 X53798 X62701 NM 011336 NM 026594 J04596 X01450 X02611
Score d -12.49664609 -10.81284575 -10.04084346 -7.955424066 -6.573814936 -6.493496579 -6.452448687 -5.878272546 -5.816948536 -5.649918582
15
Numerator r -2.4070384 -4.50049225 -3.189681525 -2.892273425 -1.569125613 -1.512345513 -4.266604963 -2.291091513 -2.446932338 -2.001247638
Denominator s + s0 0.192614753 0.416217188 0.317670676 0.363559931 0.238693305 0.232901565 0.661238108 0.389755918 0.420655662 0.354208226
Figure 1: Descriptions and functions of the thirteen genes listed in Tables 3 and 4.
16