Resampling-based Multiple Testing with Applications ... - OhioLINK ETD

0 downloads 0 Views 615KB Size Report
data analysis brings challenges to the multiple hypothesis testing field. We study ... testing error rates in fixed-effects general linear models. Meanwhile, from ...... als, and center the bootstrapped residuals at the average (subtract the average of ...... bution of the test statistics can be expressed in terms of ka(FX) and ka(FY ).
Resampling-based Multiple Testing with Applications to Microarray Data Analysis DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Dongmei Li, B.A., M.S. ***** The Ohio State University 2009

Dissertation Committee:

Approved by

Dr. Jason C. Hsu, Adviser Dr. Elizabeth Stasny Dr. William Notz Dr. Steve MacEachern

Adviser Graduate Program in Biostatistics The Ohio State University

c Copyright by

Dongmei Li 2009

ABSTRACT

In microarray data analysis, resampling methods are widely used to discover significantly differentially expressed genes under different biological conditions when the distributions of test statistics are unknown. When sample size is small, however, simultaneous testing of thousands, or even millions, of null hypotheses in microarray data analysis brings challenges to the multiple hypothesis testing field. We study small sample behavior of three commonly used resampling methods, including permutation tests, post-pivot resampling methods, and pre-pivot resampling methods in multiple hypothesis testing. We show the model-based pre-pivot resampling methods have the largest maximum number of unique resampled test statistic values, which tend to produce more reliable P-values than the other two resampling methods. To avoid problems with the application of the three resampling methods in practice, we propose new conditions, based on the Partitioning Principle, to control the multiple testing error rates in fixed-effects general linear models. Meanwhile, from both theoretical results and simulation studies, we show the discrepancies between the true expected values of order statistics and the expected values of order statistics estimated by permutation in the Significant Analysis of Microarrays (SAM) procedure. Moreover, we show the conditions for SAM to control the expected number of false

ii

rejections in the permutation-based SAM procedure. We also propose a more powerful adaptive two-step procedure to control the expected number of false rejections with larger critical values than the Bonferroni procedure.

iii

This is dedicated to my dear husband Zidian Xie, my cute daughter Catherine Xie, my cute son Matthew Xie, and my dear parents.

iv

ACKNOWLEDGMENTS

I would like to express my heartfelt gratitude to my advisor Professor Jason C. Hsu for his encouragement, constant guidance and extreme patience. Without his advice, it would have been impossible for me to finish this dissertation. A special thanks goes to Professor Elizabeth Stasny, Graduate Studies Chairs in Statistics, who carefully proofread my papers and gave me tons of help during my Ph.D. study. I would also like to thank my other committee members, Professor William Notz and Professor Steve MacEachern for their thoughtful questions and advice. I am enormously grateful to my parents, my husband and my kids for their support and love, especially my husband Zidian Xie, who always support me whenever I need him.

v

VITA

1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.A. Pomology, Laiyang Agriculture College, China 2001 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M.S. Biophysics, China Agriculture University, China 2006 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M.S. Statistics, The Ohio State University, U.S.A. 2001-present . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Graduate Teaching and Research Associate, The Ohio State University.

PUBLICATIONS Research Publications Violeta Calian, Dongmei Li, and Jason C. Hsu. Partitioning to Uncover Conditions for Permutation Tests to Control Multiple Testing Error Rates. Biometrical Journal, 50 (5): 756-766, 2008. DOI:10.1002/bimj.200710471.

FIELDS OF STUDY Major Field: Biostatistics

vi

TABLE OF CONTENTS

Page Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ii

Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iv

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

v

Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vi

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xi

Chapters: 1.

Multiple hypotheses testing and resampling methods . . . . . . . . . . . 1.1 Multiple hypotheses testing . . . . . . . . . 1.1.1 Introduction . . . . . . . . . . . . . 1.1.2 Two definitions of Type I error rate 1.1.3 Familywise Error Rate (FWER) . . 1.1.4 False Discovery Rate (FDR) . . . . . 1.1.5 Multiple testing principles . . . . . . 1.2 Resampling methods . . . . . . . . . . . . . 1.2.1 Permutation tests . . . . . . . . . . 1.2.2 Bootstrap methods . . . . . . . . . .

2.

. . . . . . . . .

1 1 2 3 5 6 8 9 11

Small sample behavior of resampling methods . . . . . . . . . . . . . . .

17

2.1 Tomato microarray example . . . . . . . . . . . . . . . . . . . . . . 2.2 Conditions for getting adjusted P-values of zero using the post-pivot resampling method . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

vii

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

1

21

2.2.1

Conditions for getting adjusted P-values of zero with a sample size of two . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Conditions for getting adjusted P-values of zero with a sample size of three . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Conditions for getting adjusted P-values of zero using the pre-pivot resampling method . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Discreteness of resampled test statistics’ distributions . . . . . . . . 2.4.1 Paired samples . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Two independent samples . . . . . . . . . . . . . . . . . . . 2.4.3 Multiple independent samples . . . . . . . . . . . . . . . . . 2.4.4 General linear mixed-effects models . . . . . . . . . . . . . . 3.

4.

21 24 29 31 31 33 36 39

Conditions for resampling methods to control multiple testing error rates

46

3.1 Two-group comparison . . . . . . . . . . . . . . . . . . . 3.1.1 Permutation tests . . . . . . . . . . . . . . . . . 3.1.2 Post-pivot resampling method . . . . . . . . . . . 3.1.3 Pre-pivot resampling method . . . . . . . . . . . 3.2 Fixed-effects general linear model . . . . . . . . . . . . . 3.3 Estimating the test statistic’s null distribution . . . . . . 3.3.1 Permutation tests . . . . . . . . . . . . . . . . . 3.3.2 Pre-pivot resampling method . . . . . . . . . . . 3.3.3 Post-pivot resampling method . . . . . . . . . . . 3.4 Estimating critical values for strong control of FWER . 3.4.1 Permutation tests . . . . . . . . . . . . . . . . . 3.4.2 Pre-pivot resampling method . . . . . . . . . . . 3.4.3 Post-pivot resampling method . . . . . . . . . . . 3.5 Shortcuts of partitioning tests using resampling methods 3.5.1 Permutation tests . . . . . . . . . . . . . . . . . 3.5.2 Pre-pivot resampling method . . . . . . . . . . . 3.5.3 Post-pivot resampling method . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

46 47 50 52 54 55 55 62 66 69 69 71 72 73 76 78 79

Conditions for Significant Analysis of Microarrays (SAM) to control the empirical FDR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

80

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

4.1 Introduction to Significant Analysis of Microarrays (SAM) method 4.2 Discrepancies between true expected values of order statistics and expected values estimated by permutation . . . . . . . . . . . . . . 4.2.1 Effect of unequal variances-covariance matrices and sample sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Effect of higher order cumulants with equal sample sizes . .

viii

83 84 84 88

4.3 Conditions for controlling the expected number of false rejections in SAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.4 An adaptive two-step procedure controlling the expected number of false rejections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.

Concluding remarks

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

ix

LIST OF TABLES

Table

Page

1.1

Summary of possible outcomes from testing k null hypotheses . . . .

3

2.1

Adjusted P-values calculated from formula (2.1) for the permutation test, post-pivot resampling method and pre-pivot resampling method

20

Maximum number of unique resampled test statistic values for the permutation test, post-pivot resampling method and pre-pivot resampling method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

2.2

x

LIST OF FIGURES

Figure 2.1

Page

Null distribution of maxi=1,2,3 |Ti | for k = 3 and n = 3. Observed test statistics and resampled test statistics from permutation test, postpivot resampling and pre-pivot resampling methods. . . . . . . . . . .

20

4.1

Q-Q plot of the true expected values of order statistics against the expected values estimated by permutation for unequal variance and sample sizes. Dashed line in the Q-Q plot is the 45 degree diagonal line. 86

4.2

Q-Q plot of the true expected values of order statistics against the expected values estimated by permutation for unequal correlations and sample sizes. Dashed line in the Q-Q plot is the 45 degree diagonal line. 87

4.3

Q-Q plot of the true expected values of order statistics against the expected values estimated by permutation for unequal skewness. Dashed line in the Q-Q plot is the 45 degree diagonal line. . . . . . . . . . . .

95

Q-Q plot of the true expected values of order statistics against the expected values estimated by permutation for unequal third order cross cumulants. Dashed line in the Q-Q plot is the 45 degree diagonal line.

97

4.4

xi

CHAPTER 1

MULTIPLE HYPOTHESES TESTING AND RESAMPLING METHODS

1.1 1.1.1

Multiple hypotheses testing Introduction

With the rapid development of biotechnology, microarray technology became widely used in biomedical and biological fields to identify differentially expressed genes and transcription factor binding sites, and map complex traits using single nucleotide polymorphisms (SNPs) (Kulesh et al. (1987), Schena et al. (1995), Lashkari et al. (1997), Pollack et al. (1999), Buck and Lieb (2004), Mei et al. (2000), HehirKwa et al. (2007)). Having thousands, even millions, of genes on a small array makes multiple comparisons a hot topic in today’s statistics field because thousands, even millions, of hypotheses need to be tested simultaneously. Without multiplicity adjustment, if each hypothesis is tested at level α, the probability of rejecting at least one true null hypothesis will increase enormously when testing multiple hypotheses. If, for example, 20 hypotheses are tested simultaneously and each hypothesis is tested at 5%, the probability of rejecting at least one true null hypothesis will be 64%, assuming all the test statistics are independent. Therefore, 1

in order to make the multiplicity adjustment, a multiple hypotheses testing procedure need to control a certain type of error rate at a level of α. A popular multiple testing error rate being controlled in many multiple hypotheses testing procedures is the family-wise error rate (FWER) (Hochberg and Tamhane (1987), Shaffer (1995)), which is defined as the probability of at least one false rejection. Another less stringent multiple testing error rate commonly used is the false discovery rate (FDR) (Benjamini and Hochberg (1995)), which is defined as the proportion of falsely rejected null hypotheses.

1.1.2

Two definitions of Type I error rate

Suppose k genes are probed to compare expression levels between high risk and low risk patients. Let µHi , µLi , i = 1, . . . , k, denote the expected (logarithms of) expression levels of the ith gene of a randomly sampled patient from the high risk and low risk groups respectively. Let θi = µHi − µLi denote the difference of expected (logarithm of) expression levels of the ith gene between the high risk group and the low risk group. To determine which of the genes are differentially expressed in expectation between the high risk and low risk patients, we need to test the following null hypotheses: H0i : θi = 0,

i = 1, . . . , k.

(1.1)

There are two different ways to define the Type I error rate when testing a single null hypothesis. Let θ = (θ1 , θ2 , . . . , θk ), and let Σ denote generically all nuisance parameters that the observed expression levels depend on, such as covariance of the expression levels for each of the high risk group and low risk group. Let θ0 = (θ10 , . . . , θk0 ) and Σ0 be a collection of all (unknown) true parameter values. A traditional definition of

2

the Type I error rate given by Casella and Berger (1990) or Berger (1993) is supθi=0 Pθ ,Σ Σ {Reject H0i }, where the supremum is taken over all possible θ and Σ subject to θi = 0. Another definition of the Type I error rate, given by Pollard and van der Laan (2005), is Pθ0 ,Σ Σ0 {Reject H0i }, where θi0 = 0, θ 0 = (θ10 , . . . , θk0 ), and Σ0 represents the set of all (unknown) true parameter values. The first definition of Type I error rate is more widely used than the second definition. The second definition of Type I error rate can only be controlled asymptotically since the true parameter values are unknown in microarray data analysis.

1.1.3

Familywise Error Rate (FWER)

When we are testing k null hypotheses simultaneously, the summary of possible outcomes is shown in Table 1.1.

Table 1.1: Summary of possible outcomes from testing k null hypotheses Number not rejected Number rejected True null hypotheses U V k0 Non-true null hypotheses T S k − k0 Total k−R R k

In Table 1.1, V denotes the number of incorrectly rejected true null hypotheses when testing k null hypotheses; R denotes the number of hypotheses rejected among 3

those k null hypotheses; k0 denotes the number of true null hypotheses; and k − k0 denotes the number of false null hypotheses. FWER is defined as the probability of rejecting at least one true null hypothesis (at least one false rejection). FWER has the following expression: F W ER = P {V ≥ 1}.

(1.2)

There are two kinds of control of FWER. One is strong control of FWER, which controls the probability of at least one false rejection under any combination of true and false null hypotheses (controls the supremum). The other is weak control of FWER, which controls the probability of at least one false rejection under the complete null hypothesis H0C : ∩ki=1 H0i with k0 = k (Westfall and Young (1993), Lehmann and Romano (2005)). In microarray experiments, since it is rare that no gene is differentially expressed, to control FWER strongly is more appropriate than weakly. Strong control of FWER is desired to minimize the number of false rejections in some cases, such as selecting genes to build diagnostic or prognostic chips for diseases. An example is the MammaPrint developed by Agendia, which is based on the well-known Amsterdam 70-gene breast cancer gene signature (van ’t Veer et al. (2002), van de Vijver et al. (2002), Buyse et al. (2006), Glas et al. (2006)). MammaPrint is used to predict whether existing breast cancer will metastasize (spread to other parts of a patient’s body). The multiple testing procedure proposed by Pollard and van der Laan (2005) has a strong asymptotic control of FWER. It controls the error rate αn for a sample of size n. It has limsupn→∞ αn ≤ α under the true data generating distribution when the sample size n goes to infinity.

4

1.1.4

False Discovery Rate (FDR)

The concept of false discovery rate (FDR) was first proposed by Benjamini and Hochberg (1995) to reduce the stringency of strong FWER control. FDR is more widely used than FWER in bioinformatics studies because the investigators are more interested in finding all potential genes that are differentially expressed even if some genes could be falsely identified (Benjamini and Yekutieli (2001), Storey (2002), Storey and Tibshirani (2003b), Storey and Tibshirani (2003a), Benjamini et al. (2006), Strimmer (2008)). FDR is defined as the expected proportion of erroneously rejected null hypotheses among all rejected null hypotheses V F DR = E( |R > 0)P r(R > 0). R Benjamini and Hochberg (1995) also presented four alternative formulations of FDR: (1) Positive FDR V pF DR = E( |R > 0). R The pFDR is recommended by Storey (2002) who argued that pFDR is a more appropriate error measure to use compared to FDR. (2) Conditional FDR V cF DR = E( |R = r), R where r is the observed number of rejected null hypotheses. (3) Marginal FDR mF DR = E(V )/E(R).

5

(4) Empirical FDR F dr = E(V )/r. Benjamini and Hochberg (1995) argued that all four FDRs can not be controlled when all null hypotheses are true (k0 = k). If k0 = k and even if a single null hypothesis is rejected, V /R = 1 and FDR cannot be controlled. Controlling pFDR, cFDR, mFDR and Fdr has the same problem-they are identically 1 when k0 = k. Tsai et al. (2003) showed that pFDR, cFDR and mFDR are equivalent under the Bayesian framework, in which the number of true null hypotheses is modeled as a random variable. The significant analysis of microarray (SAM) method that will be discussed in chapter 4 estimates the empirical FDR.

1.1.5

Multiple testing principles

A general principle of multiple testing is the Partitioning Principle proposed by Stefansson et al. (1988), and further refined by Finner and Strassburger (2002). Both Holm (1979)’s step-down method and Hochberg (1988)’s step-up method are special cases of partition testing (Huang and Hsu (2007)). The principle of partition testing is to partition the parameter space into disjoint subspaces, test each partitioning null hypothesis at level α, and collate the results across the subspaces, as follows: Let P = {1, . . . , k}, and consider testing H0i : θi = 0, i = 1, . . . , k. To control FWER strongly, the Partitioning Principle states: ∗ P1: For each I ⊆ {1, . . . , k}, I 6= ∅, form H0I : θi = 0 for all i ∈ I and θj 6= 0 for

j ∈ / I. In total, there are 2k parameter subspaces and 2k − 1 null hypotheses to be tested.

6

∗ P2: Test each H0I at level α. Since all the null hypotheses are disjoint, at most

one null hypothesis is true. Therefore, no multiplicity adjustment is required for each ∗ H0I . ∗ P3: For each i, infer θi 6= 0 if and only if all H0I with i ∈ I are rejected since H0i ∗ is the union of H0I with i ∈ I.

Taking k = 3 as an example, the parameter space Θ = {θ1 , θ2 , θ3 } will be partitioned into eight disjoint subspaces: Θ1 = {θ1 = 0 and θ2 = 0 and θ3 = 0} Θ2 = {θ1 = 0 and θ2 = 0 and θ3 6= 0} Θ3 = {θ1 = 0 and θ2 6= 0 and θ3 = 0} ··· Θ7 = {θ1 6= 0 and θ2 6= 0 and θ3 = 0} Θ8 = {θ1 6= 0 and θ2 6= 0 and θ3 6= 0} ∗ Next, we will test each of the following H0I ’s at level α:

∗ H0{123} : θ1 = 0 and θ2 = 0 and θ3 = 0 ∗ : θ1 = 0 and θ2 = 0 and θ3 6= 0 H0{12} ∗ : θ1 = 0 and θ2 6= 0 and θ3 = 0 H0{13}

··· ∗ H0{2} : θ1 6= 0 and θ2 = 0 and θ3 6= 0 ∗ H0{3} : θ1 6= 0 and θ2 6= 0 and θ3 = 0 ∗ Finally, infer θi 6= 0 if and only if all H0I involving θi = 0 are rejected.

Another multiple testing principle similar to the Partitioning Principle, is the closed testing principle (Marcus et al. (1976)). The closed testing principle states: 7

C1: For each I ∈ {1, . . . , k}, form the intersection null hypothesis H0I : θi = 0 for all i ∈ I. C2: Test each H0I at level α. C3: For each i, infer θi 6= 0 if and only if all H0I with i ∈ I are rejected. Compared to the partition testing procedure, the closed testing procedure tests less restrictive hypotheses. However, the closed testing procedure still controls FWER ∗ strongly because a level-α test for H0I is also a level-α test for H0I .

To test H0 : θi = 0 (i = 1, . . . , k) using the test statistic Ti = θˆi (i = 1, . . . , k), we will test 2k − 1 null hypotheses in accordance with the Partitioning Principle. Here is a typical partitioning null hypothesis: ∗ : θ1 = 0 and H0{12···t}

θt+1 6= 0 and

···

···

and θt = 0 and

and θk 6= 0 (1 ≤ t ≤ k).

The above null hypothesis can be simplified as H0{12···t} : θ1 = 0 and θ2 = 0 and

···

and θt = 0 (1 ≤ t ≤ k)

according to the closed testing principle. It still controls FWER strongly because a ∗ . level-α test for H0{12···t} is also a level-α test for H0{12···t}

The test statistic for testing H0{12···t} is maxi=1,...,t |Ti | = maxi=1,...,t |θˆi | because H0{12···t} is an Union-Intersection test (Casella and Berger (1990)), and the rejection region for a Union-Intersection test is ∪i∈{1,...,t} {|Ti | > c} = {maxi=1,...,t |Ti | > c} (where c is the critical value for testing H0{12···t} ).

1.2

Resampling methods

Resampling methods can be used to estimate the precision of sample statistics (mean, median, percentiles), perform significance tests, and validate models (Westfall 8

and Young (1993), Efron and Tibshirani (1994), Davison and Hinkley (1997), Good (2005)). The commonly used resampling techniques include permutation tests and bootstrap methods. Two different bootstrap methods, the post-pivot resampling method and the pre-pivot resampling method, will be introduced in this section. Westfall and Young (1993) introduced procedures using resamplings to adjust Pvalues in multiple testings to control multiple testing error rates.

1.2.1

Permutation tests

A permutation test is a type of non-parametric statistical significance test in which a reference distribution is constructed by calculating all possible values of test statistics from permuted observations under a null hypothesis. The theory of permutation tests is based on the works of Fisher and Pitman in the 1930s (Good (2005)). Compared to parametric testing procedures, the fewer distributional assumptions and the simpler procedures make permutation tests more attractive to many researchers and statisticians. For example, when comparing the means of two populations, a two-sample t-test assumes that the sampling distribution of the difference between sample averages is normal, which is not true in most cases. The t-test is only valid when both populations have independent or joint normal distributions. In contrast, the permutation test is distribution-free so that it can give exact P-values when the sample size is small. The permutation test permutes the labels of observations between two groups, and obtains the P-value by calculating the proportion of test statistic values from resamples that are as extreme or more extreme than the

9

observed test statistic value. In microarray data analysis, when the correlations between genes are considered in the joint distribution of test statistics, the parametric form of a multivariate t distribution becomes very complex and difficult to calculate. In contrast, the permutation test is easy to conduct and avoids complex calculations. To carry out a permutation test based on a test statistic that measures the size of an effect of interest, we proceed as follows: 1. Compute the test statistic for the observed data set. 2. Permute the original data in a way that matches the null hypothesis to get permuted resamples, and construct the reference distribution using the test statistics calculated from permuted resamples. 3. Calculate the critical value of a level α test based on the upper α percentile of the reference distribution, or obtain the P-value by computing the proportion of permutation test statistics that are as extreme or more extreme than the observed test statistic. Permutation tests can be used in a wide variety of settings. For example, Fisher’s exact test (a permutation test) is used to detect the association between a row variable and a column variable for small, sparse, or unbalanced data sets. Ein-Dor et al. (2005) used a permutation test for selecting genes which expression profiles are significantly correlated with breast cancer survival status. Based on random permutations of time points, Ptitsyn et al. (2006) applied the permutation test to identifying a periodic pattern in relatively short time series using microarray technology. The periodic process is important for modulating and coordinating the transcription of genes governing key metabolic pathways. Churchill and Doerge (1994) used a permutation

10

test based on the permutation of observed quantitative traits to determine the quantitative trait loci. To identify significant changes in gene expression in microarray experiments, Tusher et al. (2001) used permutations of the repeated measurements in the significance analysis of microarrays (SAM) procedure. For two-group comparisons, permuting the labels of observations between two groups requires an assumption that two populations are identical when the null hypothesis is true-that is, not only are their means the same, but also their spreads and shapes. Pollard and van der Laan (2005) demonstrated that, if both the correlation structures and the sample sizes are different between two populations, then a permutation test does not control the type I error rate at its nominal significance level for detecting differentially expressed genes between two groups. When comparing two groups and finding significant predictor variables in fixed-effects general linear models, the conditions for permutation tests to control multiple testing error rates will be further discussed in chapter 3. For testing hypotheses about a single population, comparing populations that differ even under the null hypothesis, or testing general relationships, permutation tests cannot be used because we do not know how to resample in a way that matches the null hypothesis in these settings. Hence, bootstrap methods should be used instead.

1.2.2

Bootstrap methods

The bootstrap method was first introduced by Efron (1979) and further discussed by Efron and Tibshirani (1994).

11

The bootstrap method is a way of approximating the sampling distribution from just one sample. Instead of taking many simple random samples from the population to find the sampling distribution of a sample statistic, the bootstrap method repeatedly resamples with replacement from one random sample. The bootstrap distribution of a statistic collects values of the statistic from many resamples, and gives information about the sampling distribution of the statistic. For example, the bootstrap distribution of a sample mean is obtained from the resampled means calculated from hundreds of resamples with replacement from a single original sample. The bootstrap distribution of a sample mean has the following mean and standard error: X ¯ boot = 1 · ¯∗ meanboot = X X B

SEboot =

r

1 X ¯∗ (X − meanboot )2 B−1

¯ ∗ is the sample mean of each bootstrap resample and B is the number of where X resamples. Since a bootstrap distribution of a statistic generates from a single original sample, it is centered at the value of the sample statistic rather than the parameter value. Bootstrap distributions include two sources of random variation: one is from choosing an original sample at random from the population, and the other is from choosing bootstrap resamples at random from the original sample, which introduces little additional variation. Bootstrap methods are asymptotically valid (as original sample size goes to ∞). Efron (1979) showed that the bootstrap method can (asymptotically) correctly estimate the variance of a sample median, and the error rates in a linear discrimination 12

problem (outperforming cross-validation). Freedman (1981) showed that the bootstrap approximation to the distribution of least square estimates is valid. Hall (1986) showed the bootstrap method’s reduction of error coverage probability, from O(n−1/2 ) to O(n−1 ), which makes the bootstrap method one order more accurate than the delta method. Bootstrap methods are widely used in all kinds of data analysis. Davison and Hinkley (1997) illustrated the application of bootstrap methods in stratified data; finite populations; censored and missing data; linear, nonlinear, and smooth regression models; classification; time series and spatial problems. For example, by using Efron’s bootstrap resampling method, Liu et al. (2004) analyzed the performance of artificial neural networks (ANNs) in the area of feature classification for the analysis of mammographic masses to achieve more accurate results. The feature classification in mammography is used to discover the salient information that can be used to discriminate benign from malignant masses. In microarray data analysis, there are two commonly used bootstrap methods, including the post-pivot resampling method and the pre-pivot resampling method. Both methods can control FWER asymptotically and give similar results in a fixedeffects general linear model with i.i.d. errors. In two-group comparisons, the null distribution estimated by the pre-pivot resampling method has more resampled test statistic values than that estimated by the post-pivot resampling method under a reasonable assumption (the distributions of the errors are exchangeable) for microarray data.

13

Post-pivot resampling method The post-pivot resampling method was introduced by Pollard and van der Laan (2005) to estimate the null distribution of test statistics in multiple hypotheses testing to achieve asymptotic multiple testing error rate control. The post-pivot resampling method obtains the asymptotically correct null distribution of the test statistic (based on the true data generating distribution) from centered and/or scaled resampled test statistics. In microarray data analysis with two or more treatment groups, the post-pivot resampling method resamples the observed data within each group, calculates the resampled test statistics from each resample, centers and/or scales the resampled test statistics (subtracts the average of resampled test statistics and/or divides the standard deviation of resampled test statistics), and estimates the test statistic’s null distribution from the centered and/or scaled resampled test statistics. To carry out a hypothesis test based on a test statistic that measures the location difference between two populations, the post-pivot resampling method proceeds as follows: 1. Compute the test statistic for the observed data set. 2. Resample the data with replacement within each group to obtain bootstrap resamples, compute the test statistic for each resampled data set, and construct the reference distribution using the centered and/or scaled resampled test statistics. 3. Calculate the critical value of a level-α test based on the upper α percentile of the reference distribution, or obtain the P-value by computing the proportion of bootstrapped test statistics that are as extreme or more extreme than the observed test statistic. 14

Pre-pivot resampling method The pre-pivot resampling method fits a model to the observed data first and then estimates the test statistic’s null distribution by bootstrapping the centered residuals (subtract the sample mean of residuals) (Freedman (1981)). Under an assumption that the model fits the data well, the pre-pivot resampling method can provide asymptotically valid results, i.e., it can control multiple testing error rates asymptotically when testing multiple null hypotheses. In microarray data analysis, the pre-pivot resampling method estimates the null distributions of test statistics by bootstrapping residuals from a probe level or a gene level model with treatment effects. The way that the residuals are re-sampled with replacement (bootstrapped) depends on the assumptions about the residuals. The residuals can be re-sampled across treatments under the assumption of same distributions across treatments, but not across genes. If the distributions are the same across genes, then residuals across treatments and genes can be pooled together for resampling with replacement. To carry out a hypothesis test based on a test statistic that measures the location difference of two populations, the pre-pivot resampling method has the following procedure: 1. Compute the test statistic for the observed data set. 2. Fit a one-way model to the observed data, and compute the residuals from the one-way model (subtract the sample mean from each observation within each group). 3. Combine the residuals of two groups together under an assumption that the distributions of the residuals are the same for these two groups.

15

4. Resample the pooled residuals with replacement to get bootstrapped residuals, and center the bootstrapped residuals at the average (subtract the average of bootstrapped residuals) if the average of those bootstrapped residuals is not zero. 5. Add the centered bootstrapped residuals from each resample back to the oneway model, and recompute the test statistic for each resample. Then, the test statistics from all resamples form the reference distribution. 6. Calculate the critical value of a level-α test based on the upper α percentile of the reference distribution, or obtain the P-value by computing the proportion of bootstrapped test statistics as extreme or more extreme than the observed test statistic.

16

CHAPTER 2

SMALL SAMPLE BEHAVIOR OF RESAMPLING METHODS

Resampling techniques are popular in microarray data analysis. In this chapter, we will discuss the small sample behavior of three popular resampling techniques for multiple testing: the permutation test, the post-pivot resampling method, and the pre-pivot resampling method. We will show that when the sample size is small, for matched pairs, a permutation test is unlikely to give small P-values, while both post-pivot and pre-pivot resampling methods might give P-values of zero for the same data, even adjusting for multiplicity. The discreteness of the test statistics’ null distributions estimated by the above three resampling methods will be compared based on the maximum number of unique test statistic values.

2.1

Tomato microarray example

A biology professor in the Department of Horticulture and Crop Science at the Ohio State University wishes to identify differentially expressed genes between control tomato plants and mutant tomato plants at different tomato fruit developmental stages (flower bud, flower, and fruit). Lee et al. (2000) recommended that at least three replicates should be used in designing experiments by using cDNA microarrays, 17

particularly when gene expression data from single specimens will be analyzed. In the tomato microarray experiment, there are three paired samples at each stage (three plants in the control group and three plants in the mutant group). Suppose we only have three genes at the fruit stage and wish to learn which genes are differently expressed between the mutant group and the control group using the single step maxT method, a method based on resampling techniques, for the multiplicity adjustment. Let Xij (i =1, 2, 3, and j =1, 2, 3) denote the gene expression levels for the ith gene, jth sample in the control group, and Yij (i =1, 2, 3, and j =1, 2, 3) denote the gene expression levels for the ith gene, jth sample in the treatment group. For the i.i.d.

i.i.d.

ith gene, Xij ∼ FXi , and Yij ∼ FYi . Let dij = xij − yij denote the observed paired difference for the ith gene, jth paired sample, θi denote the true paired difference between the paired samples. To identify the differentially expressed genes among these three genes, we will test the null hypotheses H0 : θi = 0 (i =1, 2, 3) using the test statistics Ti = d¯i (i =1, 2, 3). The raw P-values are calculated according to the following formula using resampling methods: Raw Pi =

♯{b : |Ti,b | ≥ |Ti |} , B

for i = 1, . . . , k.

The single step maxT method based on resampling techniques will be used to calculate the adjusted P-values for adjusting multiplicity when we are testing three null hypotheses simultaneously. The formula for calculating maxT adjusted P-values with monotonicity enforced is (cf Westfall and Young (1993)): Adjusted Pi =

♯ {b : maxi=1,2,3 |Ti,b | ≥ |Ti |} , B

18

for i = 1, 2, 3,

(2.1)

where Ti,b denotes the resampled test statistic for the ith gene, bth resampling, and B is the total number of resamplings (b = 1, . . . , B). Figure 2.1 shows the absolute values of the observed test statistics |Ti | and the maximums of the absolute values of resampled test statistics maxi=1,2,3 |Ti,b | from three resampling methods. The dots denote the observed test statistics; the rectangles denote the maximums of resampled test statistics from the permutation test; the diamonds denote the maximums of resampled test statistics from the post-pivot resampling method; and the triangles denote the maximums of resampled test statistics from the pre-pivot resampling method. As shown in Figure 2.1, the permutation test always produces permutated test statistics that are greater than or equal to the observed test statistic. Thus, it is unlikely that the permutation test gives zero adjusted P-values. In contrast, for either the pre-pivot or post-pivot resampling method, there is a high probability that the observed test statistic is far from the resampled test statistics. Therefore, we might get zero adjusted P-values using these two resampling methods. Based on the formula of the single step maxT method for calculating adjusted P-values, we can get the adjusted P-values for all three genes. Table 2.1 summarizes the adjusted P-values obtained from the permutation test, the post-pivot resampling method, and the pre-pivot resampling method for three tomato fruit genes based on Figure 2.1. Based on the null distribution of max|T | estimated from the permutation test (the rectangles), we can observe that the adjusted P-value for gene 1 is 0.75 since 6 out of 8 max|T | values (square in Figure 2.1) are greater than or equal to |T1 | (dot in Figure 2.1). Similarly, the adjusted P-values for gene 2 and gene 3 are both 0.25 based on the permutation test. Using the post-pivot resampling method, the adjusted 19

Figure 2.1: Null distribution of maxi=1,2,3 |Ti | for k = 3 and n = 3. Observed test statistics and resampled test statistics from permutation test, post-pivot resampling and pre-pivot resampling methods.

P-value for gene 1 is 0.30 since 3 out of 10 max|T | values (diamond in Figure 2.1) are greater than or equal to |T1 | (dot in Figure 2.1). For gene 2 and gene 3, however, there is no resampled max|T | value from the post-pivot resampling method that is greater than or equal to either |T2 | or |T3 | (dots in Figure 2.1). Thus, the adjusted P-values for gene 2 and gene 3 are both zero using the post-pivot resampling method. We obtain the same adjusted P-values from the pre-pivot resampling method as from the post-pivot resampling method for all three fruit genes.

Table 2.1: Adjusted P-values calculated from formula (2.1) for the permutation test, post-pivot resampling method and pre-pivot resampling method Permutation Post-pivot resampling Pre-pivot resampling gene 1 gene 2 gene 3

6/8=0.75 2/8=0.25 2/8=0.25

3/10 =0.30 0/10 =0 0/10 =0

20

3/10 =0.30 0/10 =0 0/10 =0

Strikingly, for matched pairs, the permuted test statistics (unstandardized or standardized) with complete enumerations always have a mean of zero. The reason is that one sample in each pair can be assigned either zero or one as their group label. When the labels are switched, the signs of the test statistics are also switched. Thus, the positive signs and negative signs cancel each other out so that the mean of all permuted test statistics will be equal to zero. For standardized test statistics, since the MSEs are always the same for paired permuted samples when labels switched, the mean of all permuted test statistics is also zero.

2.2

Conditions for getting adjusted P-values of zero using the post-pivot resampling method

The tomato microarray example suggests that P-values of zero may occur often even after multiplicity adjustment. Therefore, we need to explore the conditions for getting an adjusted P-value of zero using the post-pivot and pre-pivot resampling methods for paired samples with small sample sizes (2 or 3 each).

2.2.1

Conditions for getting adjusted P-values of zero with a sample size of two

To expand three genes in our tomato microarray example to k genes, let Xij (i = 1, 2, . . . , k; and j = 1, 2, . . . , n) denote the gene expression levels for the ith gene, jth sample in the control group, and Yij (i = 1, 2, . . . , k and j = 1, 2, . . . , n) denote the gene expression levels for the ith gene jth sample in the mutant group. For the i.i.d.

i.i.d.

ith gene, Xij ∼ FXi , and Yij ∼ FYi . Assume dij = xij − yij are the observed paired differences for the ith gene in the jth paired sample. We wish to determine which genes are differentially expressed

21

among those k genes by testing the k null hypotheses H0 : θi = 0 (i = 1, . . . , k) using the test statistics Ti = d¯i . When the sample size n is two, the observed differences are dij = xij − yij (i =1, 2, . . . , k and j =1 and 2). For the first two genes, we have the following observation matrix 



d11 d12 d21 d22

.

The observed test statistics are T1 = (d11 + d12 )/2 and T2 = (d21 + d22 )/2. The resampled test statistics matrix is shown as follows using the post-pivot resampling method: 

d11 d21

d11 +d12 2 d21 +d22 2

d11 +d12 2 d21 +d22 2

 d12 . d22

We can get the following matrix after subtracting the average in each row:  d11 −d12

2 d21 −d22 2

0 0 0 0

d12 −d11 2 d22 −d21 2



.

To get a raw P-value of zero for the first gene, we need to have 

12 12 | > | d11 −d | | d11 +d 2 2 d11 +d12 | 2 | > 0.

Similarly, we need to have the following relationship to have a raw P-value of zero for the second gene: 

22 22 | d21 +d | > | d21 −d | 2 2 d21 +d22 | 2 | > 0.

Therefore, the necessary and sufficient conditions for getting a raw P-value of zero for the ith gene are: either 

di1 > 0 di2 > 0

22

or 

di1 < 0 di2 < 0

for i=1, 2. Using the single step maxT method, we can get the necessary and sufficient conditions for getting an adjusted P-value of zero for the first gene as follows: either

or

  d11 > 0 d12 > 0  d11 + d12 > |d21 − d22 |

  d11 < 0 d12 < 0  d11 + d12 < −|d21 − d22 |.

Similarly, we can get the necessary and sufficient conditions for getting an adjusted P-value of zero for the second gene as follows: either

or

  d21 > 0 d22 > 0  d21 + d22 > |d11 − d12 |

  d21 < 0 d22 < 0  d21 + d22 < −|d11 − d12 |.

In other words, to have both raw P-values of zero and adjusted P-values of zero with a sample size of two for two genes, the conditions are: 1. To have raw P-values of zero, the necessary and sufficient condition is that both observations are in the same direction (either both are bigger than zero or both are smaller than zero). 2. To have adjusted P-values of zero, the necessary and sufficient conditions that need to be satisfied are: (a) Both observations for the same gene are in the same direction. 23

(b) The sum of two observations for one gene is either bigger than the absolute difference of two observations of the other gene (in the positive direction) or smaller than the negative value of the absolute difference of two observations of the other gene (in the negative direction). If k genes are considered, the necessary and sufficient conditions for the ith gene to have a raw P-value of zero with a sample size of two are: either 

di1 > 0 di2 > 0



di1 < 0 di2 < 0,

or

for i = 1, 2, . . . , k. For getting an adjusted P-value of zero for the ith gene with a sample size of two, the necessary and sufficient conditions are: either

or

  di1 > 0 di2 > 0  di1 + di2 > maxj6=i,j=1,2,...,n |dj1 − dj2|

  di1 < 0 di2 < 0  di1 + di2 < −maxj6=i,j=1,2,...,n |dj1 − dj2|,

for i = 1, 2, . . . , k.

2.2.2

Conditions for getting adjusted P-values of zero with a sample size of three

When the sample size increases from two to three for each group, the observed differences are dij = xij − yij (i = 1, 2, . . . , k and j = 1, 2 and 3). The observed

24

difference matrix for the first two genes is: 

d11 d12 d13 d21 d22 d23



.

T1 = (d11 + d12 + d13 )/3 and T2 = (d21 + d22 + d23 )/3 will be our observed test statistics for the first two genes when the sample size is three, and there will be 3 × 3 × 3 = 27 complete bootstrap resampled test statistics. The ten bootstrap resamples that will give ten unique test  1 1  1  1  1  1  2  2  2 3

statistic values are:  1 1 1 2  1 3  2 2  2 3 , 3 3  2 2  2 3  3 3 3 3

where 1 is the label for the first paired difference, 2 is the label for the second paired difference, and 3 is the label for the third paired difference. If the bootstrap resamplings all come from the first paired difference, then we will have the following resampled difference matrix for the first two genes: 

d11 d11 d11 d21 d21 d21



.

The resampled test statistics computed from the above difference matrix are T1,b=1 = d11 and T2,b=1 = d21 . If the bootstrap resamplings include the first paired difference twice and the second paired difference once, then the resampled difference matrix is: 

d11 d11 d12 d21 d21 d22 25



.

The resampled test statistics computed from the above difference matrix are T1,b=2 = (2d11 + d12 )/3 and T2,b=2 = (2d21 + d22 )/3. In the post-pivot resampling method, we subtract the average of all resampled test statistics, which is T1 = (d11 +d12 +d13 )/3 for the first gene and T2 = (d21 +d22 +d23 )/3 for the second gene respectively, from each resampled test statistic to get the reference distribution Z b for both genes: 

2d11 −d12 −d13 3 2d21 −d22 −d23 3

d11 −d13 3 d21 −d23 3

··· 0 ··· 0

2d12 −d11 −d13 3 2d22 −d21 −d23 3

d13 −d12 3 d23 −d22 3

··· ···

2d13 −d11 −d12 3 2d23 −d21 −d22 3



.

According to the formula for calculating raw P-values, if all |Z1,b| < |T1 |, the raw P-value of the first gene is equal to zero. To have |Z1,b | < |T1 |, the following relationships need to be satisfied:  |(d11 − d13 )/3| < |(d11 + d12 + d13 )/3|     |(d11 − d12 )/3| < |(d11 + d12 + d13 )/3|      |(d12 − d13 )/3| < |(d11 + d12 + d13 )/3| |(2d11 − d12 − d13 )/3| < |(d11 + d12 + d13 )/3|   |(2d12 − d11 − d13 )/3| < |(d11 + d12 + d13 )/3|     |(2d13 − d11 − d12 )/3| < |(d11 + d12 + d13 )/3|    0 < |(d11 + d12 + d13 )/3|

From the above equations, we derive the following necessary and sufficient condi-

tions for the first gene to have a raw P-value of zero: either

or

 d11 > max(−2d12 , −2d13 )     d12 > max(−2d11 , −2d13 )    d13 > max(−2d11 , −2d12 ) d11 + d12 > d213     d + d13 > d212    11 d11 + d13 > d212   d11 < 0 d12 < 0  d13 < 0, 26

For the second gene, to have |Z2,b| < |T2 |, the following relationships need to be satisfied:

 |(d21 − d23 )/3| < |(d21 + d22 + d23 )/3|     |(d21 − d22 )/3| < |(d21 + d22 + d23 )/3|      |(d22 − d23 )/3| < |(d21 + d22 + d23 )/3| |(2d21 − d22 − d23 )/3| < |(d21 + d22 + d23 )/3|   |(2d22 − d21 − d23 )/3| < |(d21 + d22 + d23 )/3|     |(2d23 − d21 − d22 )/3| < |(d21 + d22 + d23 )/3|    0 < |(d21 + d22 + d23 )/3|

From the above equations, the necessary and sufficient conditions for the second gene to have a raw P-value of zero are: either

or

 d21 > max(−2d22 , −2d23 )     d22 > max(−2d21 , −2d23 )    d23 > max(−2d21 , −2d22 ) d21 + d22 > d223     d + d23 > d222    21 d21 + d23 > d222   d21 < 0 d22 < 0  d23 < 0,

If we expand the two-genes case to k-genes case, the necessary and sufficient conditions for the ith gene to have a raw P-value of zero are shown as follows using the post-pivot resampling method: either

or

for i = 1, 2, . . . , k.

 di1 > maxi=1,2,...,k (−2di2 , −2di3 )     di2 > maxi=1,2,...,k (−2di1 , −2di3 )    d > max i3 i=1,2,...,k (−2di1 , −2di2 ) di1 + di2 > d2i3    di2    di1 + di3 > d2  di1 + di3 > 2i2   di1 < 0 di2 < 0  di3 < 0, 27

To have an adjusted P-value of zero for the first gene when we only have two genes,                   

the following relationships need to be satisfied: max(|(d11 − d13 )/3|, |(d21 − d23 )/3|) < |(d11 + d12 + d13 )/3| max(|(d11 − d12 )/3|, |(d21 − d22 )/3|) < |(d11 + d12 + d13 )/3| max(|(d12 − d13 )/3|, |(d22 − d23 )/3|) < |(d11 + d12 + d13 )/3| max(|(2d11 − d12 − d13 )/3|, |(2d21 − d22 − d23 )/3|) < |(d11 + d12 + d13 )/3| max(|(2d12 − d11 − d13 )/3|, |(2d22 − d21 − d23 )/3|) < |(d11 + d12 + d13 )/3| max(|(2d13 − d11 − d12 )/3|, |(2d23 − d21 − d22 )/3|) < |(d11 + d12 + d13 )/3| 0 < |(d11 + d12 + d13 )/3|

The above equations give us the following necessary and sufficient conditions for getting an adjusted P-value of zero for the first gene: either                       

or           

d11 d12 d13 d11 + d12 d11 + d13 d11 + d13 d11 + d12 + d13

d11 d12 d13 d11 + d12 + d13

> > > > > > >

< < <
> > > > > >

maxi=1,2,...,k (−2di2 , −2di3 ) maxi=1,2,...,k (−2di1 , −2di3 ) maxi=1,2,...,k (−2di1 , −2di2 ) di3 2 di2 2 di2 2

maxl6=i,l=1,2,...,k (|dl1 − dl3 | + |dl1 − dl2 |, |dl1 − dl2 | + |dl2 − dl3 |, |dl1 − dl3 | + |dl2 − dl3 |)

or di1 di2 di3 di1 + di2 + di3

< < <
ck , then infer θ[k] 6= 0 and go to step 2; otherwise stop. Step 2. If |T[k−1] | > ck−1 , then infer θ[k−1] 6= 0 and go to step 3; otherwise stop. ··· Step k: If |T[1] | > c1 , then infer θ[1] 6= 0 and stop; otherwise stop. The critical values ck , ck−1 , . . . , c1 can be estimated by the following permutation test: For the pth permutation (resample without replacement), p = 1, . . . , P : 1. Permute the group labels (here we use 0 to denote the low risk group and 1 to denote the high risk group) of the data vectors X1 , · · · , Xm , Y1 , · · · , Yn . p p 2. Compute the test statistics |T[1] |, . . . , |T[k] | based on the permuted data. p p p 3. Find maxi=1,...,k |T[i]p |, maxi=1,...,k−1 |T[i]p |, . . . , max(|T[1] |, |T[2] |) and |T[1] |.

47

Repeat the above steps P times, and compute the upper α quantiles of the distrip p p butions of the maxi=1,...,k |T[i]p |, maxi=1,...,k−1|T[i]p |, . . . , max(|T[1] |, |T[2] |) and |T[1] |. Set

them as critical values ck , ck−1 , . . . , c1 respectively. To control FWER strongly, the permutation distribution of the test statistics T (T = (T1 , . . . , Tk )) needs to be the same as the true distribution of the test statistics T under the null hypothesis. Let ka (FX ) and ka (FY ), a = 1, 2, 3, . . . , denote the cumulants of FX and FY respectively (assuming they exist). Huang et al. (2006) showed that the permutation distribution and the true distribution of the test statistics can be expressed in terms of ka (FX ) and ka (FY ). ¯ − Y¯ has cumuTheorem 3.1. (1) The true distribution of the test statistics T = X lants T ) = m1−a ka (FX ) + (−1)a n1−a ka (FY ). ka (T

(3.1)

(2) For a given permutation with r elements relabeled, the distribution (Pr ) of the ¯ r − Y¯ r obtained by a permutation has cumulants test statistics T r = X T r ) = ka (T T ) − r( ka (T

1 (−1)a − )(ka (FX ) − ka (FY )). ma na

(3.2)

Comparing the cumulants in the permutation distribution with the true distribution, we obtain the following results immediately. Corollary 3.2. (1) If m = n, i.e., two groups have the same sample size, then the true and permutation distributions of the test statistics T have the same even order cumulants. (2) The true and permutation distributions of the test statistics do not necessarily have the same odd order cumulants even if m = n, unless ka (FX ) = ka (FY ) for all odd a’s. 48

Therefore, despite whether the sample sizes for the high risk group and the low risk group are equal or not, the data distributions must have matching cumulants at the least favorable configuration (LFC) to make the true and permutation distribution of the test statistics the same. The LFC is the configuration where the supremum is taken. A sufficient condition for this to occur is the marginal-determine-the-joint (MDJ) condition proposed by Xu and Hsu (2007). The MDJ condition is used to connect the marginal distributions with the joint distribution. In partition testing, the null hypotheses are the equivalence of two marginal distributions as follows: P H0I : FXi = FYi

f or

i∈I

and FXj 6= FYj

f or

j∈ /I

for each I ⊆ {1, . . . , k}. The null hypotheses being tested by permutation testing are, however perm H0I : FXI = FY I ,

where FXI and FY I are the joint distributions of the expression levels of genes with indices in I from the low risk and high risk groups, respectively. perm P Thus, a level-α test for H0I would be a level-α test for H0I only if the following

marginal-determine-the-joint (MDJ) condition holds: MDJ let Ij , j = 1, . . . , n, be any collection of disjointed subsets of {1, . . . , k}, Ij ⊆ {1, . . . , k}. If the marginal distributions of the observations are identical for two groups, FXIj = FY Ij for all j = 1, . . . , n, then the joint distributions are identical as well, FXI U = FY I U where I U = ∪j=1,...,n Ij .

49

3.1.2

Post-pivot resampling method

The post-pivot resampling method resamples (with replacement) data within each treatment group instead of across treatment groups, and estimates the null distribution of test statistics by centering (and scaling) the resampled test statistics. A step-down FWER-controlling multiple test procedure proceeds as follows: Suppose the test statistics are ordered such that |T[1] | ≤ |T[2] | ≤ · · · ≤ |T[k]|. Step 1. If |T[k] | > ck , then infer θ[k] 6= 0 and go to step 2; otherwise stop. Step 2. If |T[k−1] | > ck−1 , then infer θ[k−1] 6= 0 and go to step 3; otherwise stop. ··· Step k: If |T[1] | > c1 , then infer θ[1] 6= 0 and stop; otherwise stop. The critical values ck , ck−1 , . . . , c1 can be obtained by the following steps. For the bth bootstrap (resample with replacement), b = 1, . . . , B: 1. Resample the data vectors X1 , · · · , Xm and Y1 , · · · , Yn with replacement within the low risk group and high risk group respectively. 2. Compute the test statistics T1b , . . . , Tkb based on the resampled data. 3. Repeat step 1 and step 2 B (B ≤ mm nn ) times and get all resampled test statistics. 4. Center the resampled test statistics T1b , . . . , Tkb to get centered test statistics P b Z1b , . . . , Zkb (where Zib = Tib − B b=1 (Ti )/B for i = 1, . . . , k).

b b b b b | for |, . . . , max(|Z[1] |, |Z[2] |) and |Z[1] |, maxi=1,...,k−1|Z[i] 5. Find maxi=1,...,k |Z[i]

each resample. b 6. Compute the upper α quantiles of the distributions of the maxi=1,...,k |Z[i] |, b b b b |. Set them as the critical values of |) and |Z[1] |, |Z[2] |, . . . , max(|Z[1] maxi=1,...,k−1|Z[i]

ck , ck−1, . . . , c1 respectively. 50

The post-pivot step-down FWER-controlling multiple testing procedure controls FWER asymptotically at level α. It means that, for a sample of size n, the error rate αn has the property limsupn→∞ αn ≤ α under the true data generating distribution. Let S0 = {j : θj (P ) = 0} denote a set of true null hypotheses (P denotes the true data generating distribution), Q denote the true distribution of test statistics (Qn is an estimate of Q), and Q0 denote the null distribution of test statistics. Given a vector of cutoff values c, the random variables V (c|Q) and R(c|Q) are defined as: V (c|Q) =

X

j∈S0

R(c|Q) =

k X j=1

I(|Tjn | > cj ) I(|Tjn | > cj ),

where Tn ∼ Q.

Let Vn = V (c|Qn (P )) denote the number of false positives of the multiple testing procedure and Rn = R(c|Qn (P )) denote the total number of rejected null hypotheses in the same multiple testing procedure. For a discrete distribution F on {0, . . . , k}, we define a real valued parameter η(F ) ∈ (0, 1) to represent a particular multiple testing error rate, where F represents a candidate for the distribution of Vn . Let FVn denote the cumulative distribution of the random variable Vn . We wish to have η(FVn ) ≤ α at least asymptotically. FWER can be written as a function of the distribution of FVn , shown below: η(FVn ) = 1 − FVn (0) = P r(Vn > 0). The distance measure we will use between two cumulative distribution functions F1 and F2 on {0, . . . , k} is defined as d(F1 , F2 ) = maxj∈{0,...,k}|F1 ({j}) − F2 ({j})|. 51

Similarly, we will test the null hypotheses H0i : θi = µXi − µYi = 0,

i = 1, . . . , k,

using the test statistics ¯ i − Y¯i , Ti = X

i = 1, . . . , k.

Pollard and van der Laan (2005) proved that the post-pivot step-down multiple testing procedure controls FWER asymptotically at level α if the following assumptions are satisfied. 1. Uniform Continuity: if d(Fn , Gn ) → 0, then η(Fn ) − η(Gn ) → 0; D

2. When the centered test statistics is Zn → Z, it has a limiting distribution of Q0 ≡ N(0, Σ(P )); 3. Let Q0n be an estimate of Q0 , define c0n ≡ c(Q0n , α) and c0 ≡ c(Q0 , α). c0n → c0 in probability for n → ∞.

3.1.3

Pre-pivot resampling method

Under an assumption that the residuals from two groups have the same distribution, the pre-pivot resampling method resamples (with replacement) residuals of a model across treatment groups. We add the centered residuals back to get resampled observations, and recompute the test statistics for each resampled observations to get the estimated test statistic’s null distribution. Based on the pre-pivot resampling method, a step-down FWER-controlling multiple test procedure would proceed as follows: Suppose that the test statistics are ordered such that |T[1] | ≤ |T[2] | ≤ · · · ≤ |T[k] |. 52

Step 1. If |T[k] | > ck , then infer θ[k] 6= 0 and go to step 2; otherwise stop. Step 2. If |T[k−1] | > ck−1 , then infer θ[k−1] 6= 0 and go to step 3; otherwise stop. ··· Step k: If |T[1] | > c1 , then infer θ[1] 6= 0 and stop; otherwise stop. The critical values ck , ck−1 , . . . , c1 can be obtained by the following steps. For the bth bootstrap (resample with replacements), b = 1, . . . , B: 1. Calculate the residuals for both groups by subtracting the average from each ¯ i , Xi2 − X ¯ i , . . . , Xim − X ¯ i , Yi1 − observation within each group for each gene ((Xi1 − X Y¯i , Yi2 − Y¯i , . . . , Yin − Y¯i ) for i = 1, . . . , k). 2. Resample the above m + n residual vectors with replacements B (B ≤ mm nn ) times. 3. For each gene, check whether the (B(m + n)) resampled residuals have an estimated mean (average) of 0. If not, center the resampled residuals at the estimated mean. 4. For each gene, add the centered residuals back to the original observations to get resampled observations for each of those B resamples. 5. Calculate the test statistics |Tib∗ | (i = 1, . . . , k and b = 1, . . . , B) for each gene within each of the resampled observations. b∗ b∗ b∗ |) and |T[1] | for |, |T[2] 6. Find maxi=1,...,k |T[i]b∗ |, maxi=1,...,k−1 |T[i]b∗ |, . . . , max(|T[1]

each resampled sample. 7. Compute the upper α quantiles of the distributions of the maxi=1,...,k |T[i]b∗ |, b∗ b∗ b∗ maxi=1,...,k−1|T[i]b∗ |, . . . , max(|T[1] |, |T[2] |) and |T[1] |. Set them as the critical values

ck , ck−1, . . . , c1 respectively.

53

The pre-pivot step-down FWER-controlling multiple testing procedure also controls FWER asymptotically at a level of α under certain conditions. Two-group comparisons are special cases of fixed-effects general linear models. Therefore, for controlling FWER asymptotically at a level of α, the conditions for two-group comparisons are exactly the same as that for fixed-effects general linear models.

3.2

Fixed-effects general linear model

Consider a fixed-effects general linear model with the form: y = Xβ + ǫ

(3.3)



In this equation, y = (Y1 , . . . , Yn ) is an n×1 data vector of observed responses, and β ′

= (β0 , β1 , . . . , βk ) is a (k +1)×1 vector of unknown parameters that can be estimated from the data. In addition, X is an n × (k + 1) data matrix of full rank (k + 1) ≤ n ′

of known predictors, while ǫ = (ǫ1 , . . . , ǫn ) is an n × 1 random vector of unobserved errors. The matrix X can be written as  1 X11 · · · X1k  1 X21 · · · X2k  X =  .. .. .. ..  . . . . 1 Xn1 · · · Xnk

    



Here, the first column of X is the vector 1n = (1, . . . , 1) so that the first coefficient β0 is the intercept. We can write X = (1n , X∗ ), where X∗ is an n × k matrix. There are k coefficients β1 , . . . , βk that correspond to the explanatory variables X1 , . . . , Xk . We assume E(ǫ) = 0, i = 1, . . . , n. ′

Using the least squares approach, when r(X X) = (k + 1), there is only a unique ordinary least squares estimator of β denoted by ′ ′ βˆ = (X X)−1 X y,

54



with a mean of β and a variance-covariance matrix of σ 2 (X X)−1 . To determine whether those predictors have significant effects on the response variable, we need to test multiple null hypotheses H0i : βi = 0 (i = 1, . . . , k). The test statistics are |Ti | = |βˆi | (i = 1, . . . , k).

3.3

Estimating the test statistic’s null distribution

To control the Type I error rate in each partitioning null hypothesis, we need to know the test statistic’s null distribution (distribution of test statistics under the null hypothesis). However, the test statistic’s null distribution is unknown in most cases. Resampling techniques are commonly used to estimate the test statistic’s null distribution in multiple hypotheses testing.

3.3.1

Permutation tests

Permutation tests are popular resampling techniques for hypothesis testing when the distribution of the test statistic is unknown. The conditions for permutation tests to be valid for two-group comparisons were discussed by Xu and Hsu (2007). Compared to two-group comparisons, fixed-effects general linear models are more widely used in microarray data analysis. We (Calian et al. (2008)) showed, for the first time, that the test statistic’s distribution estimated by permutation tests is identical to the test statistic’s true distribution under a typical partition null hypothesis. For the fixed-effects general linear model (3.3), the permutation test estimates the distribution of the test statistic maxγ=1,...,t |Tγ | = maxγ=1,...,t |βˆγ | under a typical partition null hypothesis H0{12···t} by the following steps: ′

Step 1. Permute y = (Y1 , . . . , Yn ) data vector to get a permuted response vector y p = (Y1p , . . . , Ynp ) ; ′

55

′ ′ ′ Step 2. Calculate permuted test statistics βˆIp = (βˆ1p , . . . , βˆtp ) based on (X X)−1 X y p ;

Step 3. Repeat steps 1 and 2 for all n! possible permutations; Step 4. Get the joint permutation distribution of βˆ by combining all n! joint permutation distributions of βˆIp for p = 1, . . . , n!; Step 5. Find the permutation distribution of maxγ=1,...,t |Tγ | = maxγ=1,...,t |βˆγ | based on the joint permutation distribution of {βˆ1 , . . . , βˆt }. As shown in theorem 3.3, if the errors in the model (3.3) are i.i.d. distributed and the test statistics are ordinary least squares estimates, the test statistic maxγ=1,...,t |Tγ |’s distribution estimated by permutation tests is identical to its true distribution under a typical partitioning null hypothesis H0{12···t} Theorem 3.3. In the fixed-effects general linear model (3.3), assume (1) The errors ǫ1 , · · · , ǫn are i.i.d., and (2) The test statistics are simply the ordinary least squares estimates (OLS), Ti = βˆi , i = 1, . . . , k. Then, (a) Average over all permutations of Y , one has βˆ0 = Y¯ , βˆi = 0, i = 1, . . . , k; (b) The test statistic maxγ=1,...,t |Tγ |’s distribution estimated by permutation tests is identical to its true distribution under a typical partitioning null hypothesis H0{12···t} . Proof. First we show that the average of all permuted Y is Y¯ 1 , where Y¯ = (1/n) ′

and 1 = (1, . . . , 1) .

Pn

i=1

Among all the permuted Y , those with the i-th element being y1 appear (n − 1)! times, and those with the i-th element being y2 appear (n − 1)! times, etc. Thus, the

56

Yi

sum of all permutations of Y is 

 1 n X   yi )  ...  (n − 1)!( i=1 1

Since there are n! permutations, the average of the permuted Y is Y¯ 1. For any Y , βˆ is the projection of Y onto the subspace spanned by columns of the design matrix X. Because X remains fixed, the average of this projection of the permuted Y is the same as the projection of the average of the permuted Y . From the above results, the average of the permuted Y is Y¯ 1 . For such a data vector, with an intercept term in the model, βˆ0 = Y¯ , βˆi = 0, i = 1, . . . , k, proving (a). For any data vector, the variance-covariance matrix (second order cumulant) of ′ βˆ is σ 2 (X X)−1 . Therefore, the true distribution and the permutation distribution

have the same variance-covariance matrix. For higher order cumulants, the following three properties about cumulants (p1, p2, and p3) will be used in the proof of Theorem 3.3. (p1) Multilinearity of cumulants: for any two random vectors Z, W and any matrix M so that Z = MW, the components of the cumulants ka (Z) and ka (W) (which are, in general, tensors of order a) are related by: (ka (Z))j1 j2 ...ja =

X

Mj1 l1 Mj2 l2 . . . Mja la (ka (W))l1 l2 ...la

l1 l2 ...la

(p2) Any permutation matrix A(P ) satisfies: A(p) (A(p) )T = (A(p) )T A(p) = I (p3) When the errors are i.i.d., the joint distribution of (Y1 , . . . , Yn ) has higher order (a ≥ 2) cumulants which are diagonal tensors: ka (Y)j1 j2 ...ja ≡ ka (Yj1 , . . . , Yj1 ) 57

with non-zero elements ka (Y )jj...j ≡ ka (yj ). Only when the errors are also identically distributed, write ka (Y)jj...j = ka (ǫ) ≡ ka (y). Since βˆ = ΩY with Ω ≡ (X X)−1 X , using (p1), we find that the true distribution ′



of βˆ has the following cumulants: ˆ i i ...ia = (ka (β)) 1 2

X

Ωi1 j1 Ωi2 j2 . . . Ωia ja (ka (Y))j1 j2 ...ja

j1 j2 ...ja

which are, for independent errors: ˆ i i ...ia = (ka (β)) 1 2

X

Ωi1 j Ωi2 j . . . Ωia j ka (yj )

j

and ˆ i i ...ia = ka (y) (ka (β)) 1 2

X

Ωi1 j Ωi2 j . . . Ωia j

j

if errors are also identically distributed.

Let βˆpermute denote the permutation distribution of the test statistic. At any (p) given permutation, the cumulants of the distribution of βˆpermute defined in terms of a

permutation matrix A(p) are: (p) βˆpermute = Ω(A(p) Y).

Using (p1), one can write: (p)

(ka (βˆpermuted ))i1 i2 ...ia =

X

Ωi1 j1 Ωi2 j2 . . . Ωia ja

j1 j2 ...ja

X

(p)

(p)

Aj1 l1 . . . Aja la (ka (Y))l1 l2 ...la .

l1 l2 ...la

When errors are independent, this gives: (p)

ka (βˆpermuted ))i1 i2 ...ia =

X

Ωi1 j(p) Ωi2 j(p) . . . Ωia j(p) k(yl(p) )

(j(p),l(p))

where (j(p), l(p)) are all pairs that correspond to A(p) 6= 0 for any permutation matrix A(p) . As shown in the following, we obtain common higher order cumulants only when the errors are also identically distributed. 58

From (p3), we have: (p) ka (βˆpermuted ))i1 i2 ...ia =

X

Ωi1 j1 . . . Ωia ja

j1 j2 ...ja

X

(p)

(p)

(Aj1 l . . . Aja l )ka (y)

l

and by the permutation matrix structure we obtain: (p) ka (βˆpermuted ))i1 i2 ...ia =

X

ˆ i i ...ia . Ωi1 j . . . Ωia j (ka (y)) = (ka (β)) 1 2

j

Therefore, the joint distribution of {βˆ1p , . . . , βˆtp } estimated by the permutation test is exactly the same as the true joint distribution of {βˆ1 , . . . , βˆt } under a typical partition null hypothesis H0{12···t} . Since the distribution of maxγ=1,...,t |βˆγ | is a one-to-one function of the joint distribution of {βˆ1 , . . . , βˆt }, the distribution of maxγ=1,...,t |Tγ | = maxγ=1,...,t |βˆγ | under the null hypothesis H0{12···t} is exactly the same as the distribution of maxγ=1,...,t |Tγp | = maxγ=1,...,t |βˆγp |. ∗ To control the Type I error rate for testing H0I at level α, we need to have

∗ supH0{12···t} P {maxγ=1,...,t |Tγ | > c} ≤ α.

Since the joint distribution of {βˆ1 , . . . , βˆt } only depends on {β1 , . . . , βt }, ∗ ∗ supH0{12···t} P {maxγ=1,...,t |Tγ | > c} = supH0{12···t} P {maxγ=1,...,t |Tγ | > cp } ≤ α,

∗ where cp (the critical value for testing H0I ) is just the upper αth quantile of

maxγ=1,...,t |Tγp | estimated by the permutation test. ∗ Therefore, a permutation test, rejecting H0I when maxi∈I |βˆi | > c, can correctly

estimate the distribution of maxγ=1,...,t |Tγ | under a typical partition null hypothesis H0{12···t} . 59

An example of testing H0i : βi = 0 (i = 1, . . . , k) using permutation tests is the analysis of quantitative trait loci (QTL) by Churchill and Doerge (1994). In the QTL setting, Xri is a categorical predictor indicating whether the allele linked with ith marker is in a recurrent or non-recurrent state for the r th observation. In Churchill and Doerge (1994)’s paper, the test statistics used in permutation test for testing QTL effects is the LOD score in Lander and Botstein (1989)’s paper. Lander and Botstein (1989) gave the LOD score for testing the QTL effects based on a single marker. The regression equation involving only a single marker is simply φi = a + bgi + ǫi ,

(3.4)

where φi is the phenotype; gi is an indicator variable (with values 1 and 0) for allele status; and ǫi is i.i.d. normal random variable with mean 0 and variance σ 2 . a, b and σ 2 are unknown parameters. Here, b denotes the estimated phenotypic effect of a single allele substitution at a putative QTL. To test whether there is a QTL effect, they tested H0 : b = 0 using the following LOD score: LOD = log10 (

L(ˆ a, ˆb, σ ˆ 2) ), 2 L(ˆ µA , 0, σ ˆB1 )

(3.5)

where L(a, b, σ 2 ) =

Y i

z((φi − (a + bgi )), σ 2 ),

Here, z is the probability density of a normal distribution with mean 0 and variance σ 2 ; 2 (ˆ a, ˆb, σ ˆ 2 ) are maximum likelihood estimates (MLE); (ˆ µA , 0, σ ˆB1 ) are the constrained

MLEs under the null hypothesis H0 : b = 0. The estimated MLEs for a, b, σ 2 , µA

60

2 and σB1 are

a ˆ = φ¯ − ˆb¯ g Pn ¯ i − g¯) (φ − φ)(g i=1 ˆb = Pn i ¯)2 i=1 (gi − g Pn ˆ − ˆbgi )2 2 i=1 (φi − a σ ˆ = n µ ˆA = φ¯ Pn ¯2 2 i=1 (φi − φ) σ ˆB1 = . n The formula of the LOD score shows that the test used by Lander and Botstein (1989) is a likelihood ratio test, which is equivalent to the t test for testing a single null hypothesis. When the sample size n is large enough, the LOD score is asymptotically distributed as 1/2(log10 e)χ2 , where χ2 denotes the χ2 distribution with 1 d.f.. Therefore, the conclusion of Theorem 3.1 still holds when the LOD score is used as the test statistic for testing the QTL effect based on a single marker. It is interesting to observe that the sum squares of treatment (SSTreatment, de2 noted by σ ˆexp in Lander and Botstein’s paper) happens to be proportional to ˆb2 for

61

the regression model involving only a single marker. 2 2 2 σ ˆexp =σ ˆB1 −σ ˆres Pn ¯ 2 Pn (φi − a ˆ − ˆbgi )2 i=1 i=1 (φi − φ) − = n n Pn ¯ ˆ ¯ (φ − φ + φ − a ˆ − bg ˆ + ˆbgi ) i i i )(φi − φ − φi + a = i=1 n Pn ˆ ¯ ˆb2 (gi − g¯)2 ] [2 b(φ − φ)(g − g ¯ ) − i i = i=1 n P ¯ i − g¯) − ˆb2 Pn (gi − g¯)2 2ˆb ni=1 (φi − φ)(g i=1 = n P P 2ˆb2 ni=1 (gi − g¯)2 − ˆb2 ni=1 (gi − g¯)2 = n P n 2 2 ˆb ¯) i=1 (gi − g = n ˆb2 = . 4

The last equality holds because in the backcross (B1) population, there are equal amounts of recurrent and non-recurrent states and (gi − g¯)2 = (± 21 )2 = 41 . Under the ˆ where assumption of complete co-dominance and no epistasis, we will have ˆb = 21 δ, δ denotes the phenotypic effect of a QTL. Therefore, the variance explained by the 2 QTL can be written as σexp = δ 2 /16 as presented by Lander and Botstein (1989) 2 2 since both σ ˆexp and ˆb are unbiased estimators for σexp and b.

3.3.2

Pre-pivot resampling method

The pre-pivot resampling method is a bootstrap method. The bootstrap method, first introduced by Efron (1979), is another resampling technique commonly used to estimate the unknown test statistics distributions. The bootstrap method consists of approximating the test statistic’s distribution by the empirical distribution of the data, and then resampling the data with replacement to obtain the estimated test statistic’s distribution that is asymptotically consistent. 62

The pre-pivot resampling method resamples the residuals of the fixed-effects general linear model with replacement, and estimates the joint distribution of the test statistics {T1 , . . . , Tt } = {βˆ1 , . . . , βˆt } under a typical partition null hypothesis H0{12···t} according to the following several steps: Step 1. Get the estimated residuals (ˆ ǫ) from the fixed-effects general linear model, ˆ i.e., ǫˆ = y − X β; Step 2. Check whether E(ˆ ǫ) = 0. If the mean is not equal to 0, then center the P residuals at the mean, i.e., compute ǫˆi − µ ˆ n , where µ ˆn = (1/n) ni=1 ǫˆi ;

Step 3. Resample the n centered residuals with replacement to get the bootstrap

residuals ǫ∗1 , . . . , ǫ∗n ; Step 4. Get the resampled y ∗ values by y ∗ = X βˆ + ǫ∗ , and calculate the resampled ′ ′ ordinary least square estimates βˆ∗ = (X X)−1 X y ∗;

Step 5. Repeat Step 3 and Step 4 for B (B ≤ nn ) times; √ ˆ is asymptotically consistent to the joint Then, the joint distribution of n(βˆ∗ − β) √ ′ distribution of n(βˆ − β), provided n is large and n · trace(X X)−1 is small as shown in Theorem 3.4. Suppose 1 ′ X X → V, n

which is positive definite.

√ If we also assume that the elements of X are uniformly small compared to n, then √ ˆ n(β − β) is asymptotically normal with mean 0 and variance-covariance matrix σ 2 V −1 . To satisfy this condition, the sample size nγ (γ = 1, . . . , r) in each treatment group should go to infinity at the same rate, i.e., for a sequence of sample size (nν1 , . . . , nνr ), ν = 1, 2, . . . , with total sample size Nν = nν1 + · · · + nνr 63

such that nνr → λγ Nν where

P

as ν → ∞

λγ = 1 and λγ > 0 ((Lehmann, 1999)).

Theorem 3.4. In the fixed-effects general linear model (3.3), assume (a) The errors are i.i.d., and ′

(b) n1 X X → V , which is positive definite, then, the joint distribution of the test statistics {T1 , . . . , Tt }, estimated by the prepivot resampling method, converges to the true joint distribution of the test statistics asymptotically under a typical partitioning null hypothesis H0{12···t} . Proof. Let ρ˜r be the Mallows metric on Fr,s = {G ∈ FRs :

R

kxkr dG(x) < ∞}. The

Mallow’s distance between two distributions H and G in Fr,s is defined as ρ˜r (H, G) = infτX,Y (EkX − Y kr )1/r , where τX,Y is the collection of all possible joint distributions of the pairs (X, Y ) that have marginal distributions H and G respectively. For random variables U and V , which have distributions H and G in Fr,s respectively, we define ρ˜r (U, V ) = ρ˜r (H, G). The ρ˜r -convergence is stronger than the convergence in distribution. Bickel and Freedman (1981) proved that the Mallow’s distance between a distribution F ∈ Fr,p and its empirical distribution Fn converges to 0 a.s., i.e., a.s.

ρ˜r (Fn , F ) → 0;

(3.6)

a.s. ρ˜2 (Fˆn , Fn ) → 0,

(3.7)

Freedman (1981) also proved that

64

where Fˆn is the estimated empirical distribution using the bootstrap method. In the fixed-effects general linear model, Fˆn is the estimated empirical distribution of the centered residuals ˆǫ1 , . . . , ǫˆn . √ ˆ and √n(βˆ − β) in the form of residuals. Next, we express n(βˆ∗ − β) √

ˆ = n(βˆ∗ − β)



=



=



=



′ ′ ˆ n((X X)−1 X Y ∗ − β) ′ ′ ˆ n((X X)−1 X (X βˆ + ǫ∗ ) − β) ′



ˆ n(βˆ + (X X)−1 X ǫ∗ − β) ′



n((X X)−1 X ǫ∗ ).

Similarly, we have √

n(βˆ − β) =







n((X X)−1 X ǫ).

Therefore, we have √ √ √ √ ′ ′ ′ ′ ˆ n(βˆ − β)) = ρ˜2 ( n((X X)−1 X ǫ∗ , n((X X)−1 X ǫ)) ρ˜2 ( n(βˆ∗ − β), q ≤ n · trace{X ′ X}−1 · ρ˜2 (Fˆn , F )2 q p ′ −1 ≤ 2n · trace{X X} · ρ˜2 (Fˆn , Fn )2 + ρ˜2 (Fn , F )2 = o(1) a.s.

(3.8)

The first inequality holds by Theorem 2.1 in Freedman (1981). The first term n · ′

trace{X X}−1 = O(1) by assumption (b); the second term and the third term go to √ ˆ √n(βˆ − β)) zero a.s. by (4) and (5) respectively. Therefore, we have ρ˜2 ( n(βˆ∗ − β), √ ˆ for Tn = β. ˆ goes to o(1) a.s., showing the strong ρ˜2 - consistency of n(βˆ∗ − β) Thus, the joint distribution of the test statistics {T1 , . . . , Tt }, estimated by the prepivot resampling method, converges to the true joint distribution of the test statistics asymptotically. 65

The test statistic for testing H0{12···t} is maxγ=1,...,t |Tγ |. Next, we show that the distribution of maxγ=1,...,t |Tγ |, estimated by the pre-pivot resampling method, converges to the true distribution of maxγ=1,...,t |Tγ | asymptotically. Theorem 3.5. In the fixed-effects general linear model (3.3), assume (i) The errors are i.i.d., and ′

(ii) n1 X X → V , which is positive definite, then, the distribution of the test statistic maxγ=1,...,t |Tγ |, estimated by the pre-pivot resampling method, converges to the true distribution of the test statistic asymptotically under a typical partitioning null hypothesis H0{12···t} . Proof. Under assumption (i), it is trivial that maxγ=1,...,t |βˆγ | is a continuous function based on {βˆ1 , βˆ2 , . . . , βˆt }. According to theorem 3.4, the joint distribution of the test statistics {T1 , . . . , Tt } estimated by the pre-pivot resampling method has a strong ρ˜2 consistency under a typical partition null hypothesis H0{12···t} . Thus, by the continuous mapping theorem, we have √ √ ˆ∗ −M ˆ ) a.s. ˆ − M), n(M → n(M

(3.9)

where M = maxγ=1,...,t |βγ |. Therefore, the distribution of the test statistic maxγ=1,...,t |Tγ |, estimated by the pre-pivot resampling method, converges to the true distribution of the test statistic asymptotically under a typical partition null hypothesis H0{12···t} .

3.3.3

Post-pivot resampling method

Similar to the permutation test, the post-pivot resampling method also resamples the observed data. However, it will resample with replacement and center and/or 66

scale the resampled test statistics to estimate the test statistic’s null distribution. To estimate the joint distribution of the test statistics {T1 , . . . , Tt } = {βˆ1 , . . . , βˆt } under a typical partitioning null hypothesis H0{12···t} , the post-pivot resampling method proceeds as follows: ′

Step 1. Resample the y = (Y1 , . . . , Yn ) data vector with replacement to get a ′

bootstrap response vector y ♯ = (Y1♯ , . . . , Yn♯ ) ; Step 2. Calculate the resampled test statistics βˆ♯ = (βˆ1♯ , . . . , βˆk♯ ) based on (X X)−1 X y ♯ ; ′



Step 3. Repeat step 1 and step 2 for B (B ≤ nn ) times; Step 4. Center the resampled test statistics at the sample average, i.e. βˆc♯ = βˆ♯ − EGn (βˆ♯ ) (where Gn denotes the empirical distribution of y and G denotes the true distribution of y). √ Then, the joint distribution of n(βˆ♯ −EGn (βˆ♯ )) converges to the joint distribution √ ′ of n(βˆ − β) asymptotically, provided n is large and n · trace(X X)−1 is small. The √ limiting distribution of n(βˆ − β) is normal with mean 0 and variance-covariance matrix σ 2 V −1 . Theorem 3.6. In the fixed effects general linear model (3.3), assume (a) The errors are i.i.d., and ′

(b) n1 X X → V , which is positive definite, then, the joint distribution of the test statistics {T1 , . . . , Tt }, estimated by the postpivot resampling method, converges to the true joint distribution of the test statistics asymptotically under a typical partitioning null hypothesis H0{12···t} . Proof. First, I will introduce two properties about Mallow’s distance.

67



(P1). Let U and V be random vectors and E(Ui ) = E(Vi ) (i = 1, . . . , n). Let A be a m × n matrix of scalars. Then, ′

ρ˜2 (AU, AV )2 ≤ trace(AA ) · ρ˜2 (Ui , Vi)2 . (P2). If EkUk2 < ∞ and EkV k2 < ∞, [˜ ρ2 (U, V )]2 = [˜ ρ2 (U − EU, V − EV )]2 + kEU − EV k2 . In the post-pivot resampling method, ′ ′ AU = βˆ♯ = (X X)−1 X Y ♯ , ′ ′ AV = βˆ = (X X)−1 X Y,

EU = E(Y ♯ ) = Xβ, EV = E(Y ) = Xβ. a.s. By (4), EGn (βˆ♯ ) → EG (βˆ♯ ) = β, and EkY ♯ k2 < ∞ and EkY k2 < ∞. Therefore,

√ √ √ √ ρ˜2 ( n(βˆ♯ − EGn (βˆ♯ ), n(βˆ − β)) = ρ˜2 ( n(βˆ♯ − EG (βˆ♯ ), n(βˆ − β))

√ √ = ρ˜2 ( n(βˆ♯ − β), n(βˆ − β)) q √ √ ˆ 2 ρ2 ( nβˆ♯ , nβ)] ≤ [˜ q √ √ = [˜ ρ2 ( n(X ′ X)−1 X ′ Y ♯ , n(X ′ X)−1 X ′ Y )]2 ≤

p

n · trace{X ′ X}−1 · ρ˜2 (Gn , G)2

= o(1) a.s.

(3.10)

By (P2), the first inequality holds. By (P1), the second inequality holds. From assumption (b) and (4), we get the last equality. Therefore, we can show the strong √ √ ρ˜2 − consistency of n(βˆ♯ − EGn (βˆ♯ )) to n(βˆ − β). 68

Thus, the joint distribution of the test statistics {T1 , . . . , Tt }, estimated by the post-pivot resampling method, converges to the true joint distribution of the test statistics asymptotically under a typical partitioning null hypothesis H0{12···t} . Using the similar proof as that for the pre-pivot resampling method, we can show that the distribution of the test statistic maxγ=1,...,t |Tγ |, estimated by the post-pivot resampling method, converges to the true distribution of the test statistic asymptotically under a typical partitioning null hypothesis H0{12···t} .

3.4

Estimating critical values for strong control of FWER

In the previous section (3.3), we discussed the procedures that use the Partitioning Principle to achieve a strong control of FWER for testing multiple null hypotheses. According to the partitioning principle, each partitioning null hypothesis will be ∗ tested at a level of α. For a typical partitioning null hypothesis H0{12···t} , we need ∗ supH0{12···t} P {maxγ=1,...,t |Tγ | > c} ≤ α to control FWER strongly.

We also proved that the null distribution of the test statistic (maxγ=1,...,t |Tγ |) can be estimated either exactly or asymptotically using the resampling methods. In this section, we will focus on estimating the critical value for a typical partition null hypothesis based on the estimated test statistic’s null distribution.

3.4.1

Permutation tests

ˆ denote Let G denote the cumulative distribution function of maxγ=1,...,t |Tγ | and G the cumulative distribution function of maxγ=1,...,t |Tγp |. The following theorem gives ∗ the critical value estimated by the permutation test for testing H0{12···t} .

69

Theorem 3.7. The critical value determined by the permutation test for testing ∗ H0{1···t} at a level of α is given by

ˆ −1 (1 − α). cp = G

(3.11)

Proof. In previous sections, we showed that the test statistic’s distribution estimated by the permutation test is exactly the same as the true test statistic’s distribution ˆ = G under the null under the null hypothesis H0{1···t} . Therefore, we will have G ∗ hypothesis H0{1···t} . ∗ For the fixed-effects general linear model (3.3) with i.i.d. errors, when H0{1···t} is

true, the joint distribution of βˆi (i ∈ {1 · · · t}) only depends on βi (i ∈ {1 · · · t}), but not βj (j ∈ / {1 · · · t}). Thus, we can estimate the critical value from the test statistic’s permutation distribution as follows: ∗ supH0{12···t} P {maxγ=1,...,t |Tγ | > cp } ≤ α ∗ = supH0{12···t} P {maxγ=1,...,t |Tγp | > cp } ≤ α

= supH0{12···t} P {maxγ=1,...,t |Tγp | > cp } ≤ α = PH0{12···t} {maxγ=1,...,t |Tγp | > cp } ≤ α ˆ p) ≤ α ⇔ 1 − G(c ˆ p) ≥ 1 − α ⇔ G(c ˆ −1 (1 − α). ⇔ cp = G

(3.12)

Therefore, the critical value determined by the permutation test is just the (1 − α)th percentile of the maxγ=1,...,t |Tγp | distribution (the maxγ=1,...,t |Tγ | distribution ∗ ). under the null hypothesis H0{1···t}

70

3.4.2

Pre-pivot resampling method

The pre-pivot resampling method controls FWER asymptotically. We showed that the maxγ=1,...,t |Tγ | distribution, estimated by the pre-pivot resampling method, converges to the true test statistic’s distribution asymptotically under a typical partition ∗ null hypothesis H0{1···t} .

Let G∗ denote the cumulative distribution function of maxγ=1,...,t |Tγ∗ | = maxγ=1,...,t |βˆγ∗ |. Then, the critical value estimated by the pre-pivot resampling method was given in ∗ Theorem 3.8 for testing H0{1···t} at a level of α.

Theorem 3.8. The critical value determined by the pre-pivot resampling method for ∗ asymptotic control of type I error rate at a level of α for testing H0{1···t} is given by

c∗ = G∗−1 (1 − α).

(3.13)

Proof. ∗ supH0{12···t} P {maxγ=1,...,t |Tγ | > c∗ } ≤ α ∗ = limn→∞ supH0{12···t} P {maxγ=1,...,t |Tγ∗ | > c∗ } ≤ α

= limn→∞ supH0{12···t} P {maxγ=1,...,t |Tγ∗ | > c∗ } ≤ α = limn→∞ PH0{12···t} {maxγ=1,...,t |Tγ∗ | > c∗ } ≤ α ⇔ limn→∞ (1 − G∗ (c∗ )) ≤ α ⇔ 1 − limn→∞ G∗ (c∗ ) ≤ α ⇔ limn→∞ G∗ (c∗ ) ≥ 1 − α ⇒ c∗ = G∗−1 (1 − α) for asymptotic control.

(3.14)

The first equality holds due to the asymptotic convergence of the maxγ=1,...,t |Tγ∗ | distribution estimated by the pre-pivot resampling method to the true maxγ=1,...,t |Tγ | 71

distribution. The joint distribution of βˆi (i ∈ {1 · · · t}) only depends on βi (i ∈ {1 · · · t}), but not on βj (j ∈ / {1 · · · t}), which establishes the third equality. Thus, the critical value determined by the pre-pivot resampling method is just the (1 − α)th percentile of the maxγ=1,...,t |Tγ∗ | distribution.

3.4.3

Post-pivot resampling method

Similar to the pre-pivot resampling method, the post-pivot resampling method only controls FWER asymptotically. We have shown that the distribution of maxγ=1,...,t |Tγ |, estimated by the post-pivot resampling method, converges to the true distribution of ∗ maxγ=1,...,t |Tγ | asymptotically under a typical partitioning null hypothesis H0{1···t} .

Let G♯ denote the cumulative distribution function of maxγ=1,...,t |Tγ♯ | = maxγ=1,...,t |βˆγ♯ |. Then, the theorem below will give the critical value estimated by the post-pivot resampling method for asymptotic type I error rate control. Theorem 3.9. The critical value determined by the post-pivot resampling method for ∗ controlling the type I error rate asymptotically at a level of α for testing H0{1···t} is

given by c♯ = G♯−1 (1 − α).

72

(3.15)

Proof. ∗ supH0{12···t} P {maxγ=1,...,t |Tγ | > c♯ } ≤ α ∗ = limn→∞ supH0{12···t} P {maxγ=1,...,t |Tγ♯ | > c♯ } ≤ α

= limn→∞ supH0{12···t} P {maxγ=1,...,t |Tγ♯ | > c♯ } ≤ α = limn→∞ PH0{12···t} {maxγ=1,...,t |Tγ♯ | > c♯ } ≤ α ⇔ limn→∞ (1 − G♯ (c♯ )) ≤ α ⇔ 1 − limn→∞ G♯ (c♯ ) ≤ α ⇔ limn→∞ G♯ (c♯ ) ≥ 1 − α ⇒ c♯ = G♯−1 (1 − α) for asymptotic control.

(3.16)

Therefore, the critical value determined by the post-pivot resampling method is the (1 − α)th percentile of the maxγ=1,...,t |Tγ♯ | distribution for controlling the type I error rate at a level of α asymptotically.

3.5

Shortcuts of partitioning tests using resampling methods

According to the Partitioning Principle, for testing k null hypotheses, we need to test 2k − 1 partitioning null hypotheses. When the number of k is very large, such as thousands even millions in the microarray data analysis, the partitioning tests become computationally impracticable. Under certain conditions, there are shortcuts for partitioning tests that can reduce the number of tests from 2k − 1 to at most k. The reason that partitioning tests have ∗ shortcuts is that the rejection of some H0I ’s can cause the rejection of many other ∗ H0I ’s without actual testing.

73

The shortcuts are usually in the form of step-wise tests (step-down or step-up tests). For step-down shortcuts, the form of the test statistic for each partition∗ ∗ ing hypothesis H0I need to be maxT type, and the decision rule is: Reject H0I if

maxi∈I Ti > cI . Xu and Hsu (2007) showed three conditions for step-down shortcuts of partitioning tests with the gFWER control. A more precise set of the sufficient conditions for a step-down shortcut to be valid are shown as follows ((Calian et al., 2008)): ∗ ∗ D1: The test for H0I is of the form: reject H0I if maxi∈I Ti > cI ; ∗ P {maxi∈I Ti > cI } ≤ α; D2: supH0I

D3: The values of the test statistics Ti , i = 1, . . . , k, are not re-computed for ∗ different H0I ;

D4: Critical values cI have the property that if J ⊂ I then cJ ≤ cI . Condition D2 is used to control the FWER strongly for shortcuts of partitioning tests. Conditions D1, D3 and D4 make the partitioning tests computationally practicable. Let [1], [2], . . . , [k] be random indices for 1, 2, . . . , k, such that the test statistics T1 , T2 , . . . , Tk have an order T[1] ≤ T[2] ≤ · · · ≤ T[k] . ∗ If T[k] > c{[1],...,[k]}, all partitioning null hypotheses H0J with [k] ⊂ J ⊆ {[1], . . . , [k]} ∗ will be rejected. There are actually 2k−1 such null hypotheses H0J . Thus, H0[k] can

be rejected. Otherwise, if T[k] ≤ c{[1],...,[k]}, one can accept H0[k] and stop. ∗ If T[k−1] > c{[1],...,[k−1]}, all partitioning null hypotheses H0J with [k − 1] ⊂ J ⊆ ∗ {[1], . . . , [k − 1]} will be rejected. There are actually 2k−2 such null hypotheses H0J .

Thus, H0[k−1] can be rejected. Otherwise, if T[k−1] ≤ c{[1],...,[k−1]}, one can accept H0[k−1] and stop. 74

We continue this process until T[1] is compared with c[1] for testing H0[1] . There is only 20 = 1 null hypothesis in this step. Altogether, with k steps, there are 20 + 21 + 22 + · · · + 2k−1 = 2k − 1 null hypotheses in the step-down shortcut, which is just the complete set of partitioning null hypotheses. For example, if we want to test four null hypotheses H0i : θi = 0 (i =1, 2, 3, 4) using the test statistic values T1 , T2 , T3 and T4 calculated from the data, we need to test 24 − 1 = 15 partitioning null hypotheses in total. Suppose T2 < T3 < T4 < T1 , then we have [1]=2, [2]=3, [3]=4, and [4]=1. With ∗ will be rejected. the conditions D1-D4 satisfied, if T[4] = T1 > c{1,2,3,4} , then H0{1,2,3,4} ∗ ∗ ∗ ∗ ∗ ∗ ∗ , and H0{1} , H0{1,4} , H0{1,3} , H0{1,2} , H0{1,3,4} , H0{1,2,4} We do not need to test H0{1,2,3}

because all these null hypotheses will be automatically rejected. There are 23 = 8 such partitioning null hypotheses to be rejected once T[4] = T1 > c{1,2,3,4} . Finally, all the above rejects lead to the rejection of the null hypothesis H01 : θ1 = 0. Once the above eight null hypotheses are rejected, we will compare T[3] = T4 with c{[1],[2],[3]} = c{2,3,4} . If T4 > c{2,3,4} , then there are 22 = 4 partitioning null hypotheses ∗ ∗ ∗ ∗ ) to be rejected. The rejections will lead to the , H0{2,4} , H0{3,4} and H0{4} (H0{2,3,4}

final rejection of the null hypothesis H04 : θ4 = 0. ∗ ∗ In step 3, we will reject 21 = 2 partitioning null hypotheses H0{2,3} and H0{3} if

T3 > c{2,3} . This will lead to the final rejection of the null hypothesis H03 : θ3 = 0. ∗ . If T2 > In the last step, we only have 20 = 1 partitioning null hypothesis H0{2}

c{2} , then we will reject H0{2} and finally reject the null hypothesis H02 : θ2 = 0. 75

By making the above four comparisons, we actually tested all 8 + 4 + 2 + 1 = 15 partitioning null hypotheses. Thus, the partitioning tests have the following step-down shortcut when the conditions D1-D4 are satisfied: Step 1: If T[k] > c{[1],...,[k]}, then infer θ[k] ∈ / Θ[k] and go to step 2; otherwise stop. Step 2: If T[k−1] > c{[1],...,[k−1]}, then infer θ[k−1] ∈ / Θ[k−1] and go to step 3; otherwise stop. ··· Step k: If T[1] > c{[1]} , then infer θ[1] ∈ / Θ[1] and stop; otherwise stop. For the fixed-effects general linear model (3.3) with i.i.d. errors, the conditions D1-D3 are all satisfied when the test statistics are Ti = βˆi for testing H0i : βi = 0 (i = 1, . . . , k). Thus, we will show whether the condition D4 is satisfied or not for the resampling methods, and the corresponding shortcuts for each resampling method.

3.5.1

Permutation tests

In section 3.3, for the fixed-effects general linear model (3.3) with i.i.d. errors, we showed that the permutation distribution of the test statistic maxγ=1,...,t |Tγ | is ∗ identical to its true distribution under a typical partitioning null hypothesis H0{12···t} .

Next, we will show that the critical values estimated by the permutation test satisfy the condition D4. Suppose J ⊆ I, then max(|Tγp |, γ ∈ I) ≥ max(|Tγp |, γ ∈ J) Therefore, the (1-α)th percentile of max(|Tγp |, γ ∈ I) is at least as large as the (1-α)th percentile of max(|Tγp |, γ ∈ J), i.e., cJ ≤ cI . 76

In the step-down shortcut, we need to determine k critical values from k permutation distributions of maxT type test statistics. In step 1, we determine c{[1],...,[k]} based on the permutation distribution of maxi=1,...,k |Ti | ˆ k denote the distribution of maxi=1,...,k |Ti | (maxi=1,...,k |βˆi | in model (3.3) setting). Let G ∗ ˆ −1 (1 − α). under the partitioning null hypothesis H0{12···k} , then cˆ{[1],...,[k]} = G k

In step 2, c{[1],...,[k−1]} can be determined from the permutation distribution of ˆ k−1 denote the distribution of maxi=[1],...,[k−1]|Ti | maxi=[1],...,[k−1]|Ti | (maxγ=[1],...,[k−1]|βˆi |). Let G ∗ ˆ −1 (1 − α). under the partitioning null hypothesis H0{[1][2]···[k−1]} , then cˆ{[1],...,[k−1]} = G k−1

Similarly, we can determine all of the critical values for k steps. In step k, cˆ{[1]} = ˆ −1 (1 − α), where G ˆ 1 denotes the permutation distribution of |T[1] | = |βˆ[1] |. G 1 The above process can be summarized in the following algorithm. 1. Calculate the test statistic values T1 , T2 , . . . , Tk from the data. 2. Order the test statistic values |Ti | (i = 1, . . . , k) such that |T[1] | ≤ |T[2] | ≤ · · · ≤ |T[k] |. 3. Permute (resample without replacement) the observed data to get the permuted data. 4. Calculate the test statistics |T[i]p | (i = 1, . . . , k) based on the permuted data. p p p 5. Compute maxi=1,...,k |T[i]p |, maxi=1,...,k−1 |T[i]p |, . . . , max(T[1] , T[2] ) and T[1] .

6. Repeat Step 3 to Step 5 for n! times. 7. The critical values (c{[1],...,[k]}, c{[1],...,[k−1]}, . . . , and c{[1]} ) determined from the permutation distributions are the (1 − α)th percentiles of the distributions of p ˆ −1 (1 − α), G ˆ −1 (1 − α), . . . , G ˆ −1 (1 − maxi=1,...,k |T[i]p |, maxi=1,...,k−1 |T[i]p |, . . . , and T[1] (G 1 k k−1

α)) respectively.

77

3.5.2

Pre-pivot resampling method

For the fixed-effects general linear model, we showed that the distribution of the test statistic maxγ=1,...,t |Tγ |, estimated by the pre-pivot resampling method, converges to its true distribution asymptotically under a typical partitioning null hypothesis ∗ H0{12···t} .

Next, we show that the critical values determined by the pre-pivot resampling method also satisfy the condition D4. Suppose J ⊆ I, then max(|Tγ∗ |, γ ∈ I) ≥ max(|Tγ∗ |, γ ∈ J) Therefore, the (1-α)th percentile of max(|Tγ∗ |, γ ∈ I) is at least as large as the (1-α)th percentile of max(|Tγ∗ |, γ ∈ J), i.e. cJ ≤ cI . The critical values in the step-down shortcuts can be determined by the pre-pivot resampling method according to the following algorithm. 1. Calculate the test statistics T1 , T2 , . . . , Tk . 2. Order the test statistic values |Ti | (i = 1, . . . , k) such that |T[1] | ≤ |T[2] | ≤ · · · ≤ |T[k] |. 3. Calculate the test statistic values |T[i]∗ | (i = 1, . . . , k) using the pre-pivot resampling method according to steps 1-5 in section 3.3.2. ∗ ∗ ∗ 4. Compute maxi=1,...,k |T[i]∗ |, maxi=1,...,k−1 |T[i]∗ |, . . . , max(T[1] , T[2] ) and T[1] .

5. Repeat Step 3 and Step 4 for B (B ≤ nn ) times. 6. The critical values (c{[1],...,[k]}, c{[1],...,[k−1]}, . . . , and c{[1]} ) determined by the pre-pivot resampling method are the (1 − α)th percentiles of the distributions of

78

∗−1 ∗−1 ∗ maxi=1,...,k |T[i]∗ |, maxi=1,...,k−1 |T[i]∗ |, . . . , and T[1] (G∗−1 k (1−α), Gk−1 (1−α), . . . , G1 (1−

α)) respectively.

3.5.3

Post-pivot resampling method

In the previous section, we proved the asymptotic convergency of the distribution of the test statistic maxγ=1,...,t |Tγ |, estimated by the post-pivot resampling method, to the true distribution of maxγ=1,...,t |Tγ | under a typical partitioning null hypothesis H0{12···t} . Similar to the pre-pivot resampling method, we have cJ ≤ cI when J ⊆ I holds for the post-pivot resampling method. The following algorithm shows the steps of estimating the critical values in the step-down shortcuts by the post-pivot resampling method. 1. Calculate the test statistics T1 , T2 , . . . , Tk . 2. Order the test statistic values |Ti | (i = 1, . . . , k) such that |T[1] | ≤ |T[2] | ≤ · · · ≤ |T[k] |. 3. Calculate the test statistics |T[i]♯ | (i = 1, . . . , k) using the post-pivot resampling method according to steps 1-4 in section 3.3.3. ♯ ♯ ♯ 4. Compute maxi=1,...,k |T[i]♯ |, maxi=1,...,k−1 |T[i]∗ |, . . . , max(T[1] , T[2] ) and T[1] .

5. Repeat Step 3 and Step 4 for B (B ≤ nn ) times. 6. The critical values (c{[1],...,[k]}, c{[1],...,[k−1]}, . . . , and c{[1]} ) determined by the post-pivot resampling method are the (1 − α)th percentiles of the distributions of ♯ ♯−1 ♯−1 maxi=1,...,k |T[i]♯ |, maxi=1,...,k−1 |T[i]♯ |, . . . , and T[1] (G♯−1 k (1−α), Gk−1 (1−α), . . . , G1 (1−

α)) respectively.

79

CHAPTER 4

CONDITIONS FOR SIGNIFICANT ANALYSIS OF MICROARRAYS (SAM) TO CONTROL THE EMPIRICAL FDR

SAM (Tusher et al. (2001)) is frequently used in the biological sciences to identify genes whose expression levels have significantly changed between different biological states in microarray experiments. From the website of Stanford University (Chu et al. (2000)), the SAM software can be freely downloaded (http://wwwstat.stanford.edu/ tibs/SAM/), which makes SAM a popular method for identifying significantly expressed genes. In SAM, Tusher et al. (2001) used permutation to estimate the empirical FDR (Fdr), which is defined as the ratio of the expected number of falsely rejected null hypotheses to the observed number of rejected null hypotheses. The SAM software aims to control the Fdr at a desired nominal level α such as 5% or 10 %. To identify significantly changed genes in expression between two different biological states, the test statistics used in SAM (Tusher et al. (2001)) are ti =

x¯i − y¯i , si + s0

i = 1, . . . , k,

80

(4.1)

where x¯i and y¯i are defined as the average expression levels of the ith gene in two different biological states, respectively. The standard error of x¯i − y¯i is si : v ( m ) u n X u 1/m + 1/n X (xij − x¯i ) − (yil − y¯i) , si = t m + n − 2 j=1 l=1

(4.2)

where xij are the levels of expression of the ith gene, jth sample in the first biological state (j = 1, . . . , m), and yil are the levels of expression of the ith gene, lth sample in the second biological state. s0 is a small positive constant, which is chosen to minimize the coefficient of variation. It was reported in recent literature, however, that the Fdr is not well controlled by SAM. Pan (2003) found the permutation-based SAM over-estimates the number of false positives. The simulation studies by Dudoit et al. (2003) showed the nominal Fdr estimated by SAM was much smaller than the actual Fdr in some simulations, and greater than 1 in other simulations. Xie et al. (2005) also pointed out that SAM tends to overestimate Fdr. Larsson et al. (2005) used real data examples to argue for caution when using SAM. Zhang (2007) evaluated the SAM R-package SAM 2.20 and showed the poor estimation of Fdr by SAM, and pointed out that SAM 2.20 may produce erroneous and conflicting results under certain situations. There are some misconceptions about the poor performance of SAM regarding the permutation method used. For example, Zhang (2007) provided one reason for SAM’s poor estimation of Fdr is that the test statistic’s null distribution estimated from permutation may not have a mean of zero, which leads to over-dispersed null scores. However, for the SAM procedure with two independent samples, when the sample sizes of those two groups are equal, the permutated test statistics (unstandardized or

81

standardized like the SAM test statistics) with complete enumerations have a mean of zero. Theorem 4.1. For comparing two independent samples with equal sample sizes using either unstandardized or standardized t-test statistics, permuting the labels of the two groups with a complete enumeration makes the mean of the permuted test statistics equal to zero. Proof. The reason is intuitive. For a complete enumeration with equal numbers of the two group labels of zeros and ones, one permutation with one set of labels always has its opposite set of labels (with zeros and ones switched). Thus, the positive permuted test statistics and the corresponding negative permuted test statistics cancel each other out (since we always have the same MSE for standardized test statistics), resulting in a zero mean of all permuted test statistics with a complete enumeration.

In this chapter, first we will show the discrepancies between true expected values of order statistics and expected values estimated by permutation in SAM through simulation studies. Then, we will show conditions for the equivalence of the expected number of false rejections estimated by permutation to the true expected number of false rejections in SAM. For simplicity, we use unstandardized test statistics ti = x¯i −¯ yi in SAM. Another reason we choose unstandardized test statistics is to avoid the possibility of near zero standard errors calculated from some permutation resamples since those standard errors make the test statistics close to infinity and lead to invalid results. Finally, we will propose a more powerful adaptive two-step procedure that controls the expected number of false rejections at a pre-determined constant µ. 82

4.1

Introduction to Significant Analysis of Microarrays (SAM) method

The Significance Analysis of Microarrays (SAM) was first introduced by Tusher et al. (2001) for identifying genes with statistically significant changes in expression by assimilating a set of gene-specific t tests. In SAM, each gene is assigned a score on the basis of its change in gene expression relative to the standard deviation of repeated measurements for that gene. Then, a scatter plot of the observed relative difference versus the expected relative difference estimated by permutation is used to select statistically significant genes based on a fixed threshold. The SAM procedure can be summarized as follows based on the description of SAM in Tusher et al. (2001). 1. Compute a test statistic ti for each gene i (i = 1, . . . , g). 2. Compute order statistics t(i) such that t(1) ≤ t(2) · · · ≤ t(g) . 3. Perform B permutations of the responses/covariates y1 , . . . , yn . For each permutation b, compute the permuted test statistics ti,b and the corresponding order statistics t(1),b ≤ t(2),b · · · ≤ t(g),b . 4. From the B permutations, estimate the expected values of order statistics by P t¯(i) = (1/B) b t(i),b .

5. Form a quantile-quantile (Q-Q) plot (SAM plot) of the observed t(i) versus the

expected t¯(i) . 6. For a given threshold ∆, starting at the origin, and moving up to find the first i = i1 such that t(i) − t¯(i) > ∆. All genes past i1 are called significant positive. Similarly, starting at the origin, moving down to the left and find the first i = i2 such that t¯(i) − t(i) > ∆. All genes past i2 are called significant negative. Define 83

the upper cut point cutup (∆) = min{t(i) : i ≤ i1 } = t(i1 ) , and the lower cut point cutlow (∆) = max{t(i) : i ≥ i2 } = t(i2 ) . 7. For a given threshold ∆, the expected number of false rejections E(V ) is estimated by computing the number of genes with ti,b above cutup (∆) or below cutlow (∆) for each of the B permutations, and averaging the numbers over B permutations. 8. A threshold ∆ is chosen to control the Fdr (F dr = E(V )/r) under the complete null hypothesis, at an acceptable nominal level.

4.2

Discrepancies between true expected values of order statistics and expected values estimated by permutation

In this section, we will show the situations in which the joint distribution of test statistics estimated by permutation is different from the true joint distribution in SAM. The difference of joint distributions leads to the discrepancies between true expected values of order statistics and expected values estimated by permutation.

4.2.1

Effect of unequal variances-covariance matrices and sample sizes

Suppose we want to observe gene expressions of m samples in the first biological condition and n samples in the second biological condition. Let Xl = (Xl1 , . . . , Xlk ), l = 1, . . . , m denote the observations of k genes of the lth sample from the first biological condition, and Yj = (Yj1, . . . , Yjk ), j = 1, . . . , n, denote the observations of the same k genes of the jth sample from the second biological condition. We assume that i.i.d

µX , ΣX ) both X i and Y j follow multivariate normal distributions, i.e. X i ∼ MV Nk (µ i.i.d

µY , ΣY ), i = 1, 2, . . . , m, j = 1, 2, . . . , n. Consider testing the null and Y j ∼ MV Nk (µ

84

¯ − Y¯ , where hypotheses H0µ : µX = µY using the test statistics T = X ¯ 1 − Y¯1 , . . . , X ¯ k − Y¯k ). T = (T1 , . . . , Tk ) = (X Under the null hypotheses H0µ : µX = µY , the distribution of T is T ∼ MV Nk (00,

ΣY ΣX + ) m n

The permutation distribution of T under the null hypotheses H0µ : µX = µY is min(m,n)

X r=0

 

m n r r m+n m

MV Nk (00,

ΣX + (n − r)Σ ΣY ΣX + rΣ ΣY rΣ (m − r)Σ + ). 2 2 m n

(4.3)

Thus, if m 6= n and ΣX 6= ΣY , the true and permutation distributions of T under the null hypotheses H0µ : µX = µY are different, which leads to the difference between the true expected values of order statistics and the expected values of order statistics estimated by permutation. We conduct simulation studies to show the effect of unequal variances-covariance matrices and sample sizes on the SAM procedure. In the first simulation study, 10,000 sets of random samples are drawn from µX = 0 , ΣX = I50 ) with m = 3 for the first biological condition and from MV N50 (µ µY = 0 , ΣY = 4 ·II50 ) with n = 6 for the second biological condition, indepenMV N50 (µ dently. Figure 4.1 shows the Q-Q plot of the true expected values of order statistics against the expected values of order statistics estimated by permutation. The Q-Q plot shows the departure of the expected values of order statistics by permutation from the true expected values of order statistics. Let J denote a k × k matrix with all elements equalling to 1. In the second µX = simulation study, 10,000 sets of random samples are drawn from MV N50 (µ 0 , ΣX = 0.1J50 + 0.9I50 ) with m = 3 for the first biological condition and from

85

1 0 −1 −2

True expectation of ordered test statistics

2

Q−Q plot

−2

−1

0

1

2

Expectation of ordered test statistics estimated by permutation

Figure 4.1: Q-Q plot of the true expected values of order statistics against the expected values estimated by permutation for unequal variance and sample sizes. Dashed line in the Q-Q plot is the 45 degree diagonal line.

86

0.5 0.0 −0.5 −1.0

True expectation of ordered test statistics

1.0

Q−Q plot

−1.0

−0.5

0.0

0.5

1.0

Expectation of ordered test statistics estimated by permutation

Figure 4.2: Q-Q plot of the true expected values of order statistics against the expected values estimated by permutation for unequal correlations and sample sizes. Dashed line in the Q-Q plot is the 45 degree diagonal line.

µY = 0 , ΣY = 0.9J50 + 0.1I50 ) with n = 6 for the second biological conMV N50 (µ dition, independently. Figure 4.2 shows the Q-Q plot of the true expected values of order statistics against the expected values estimated by permutation. There is a clear difference between the true expected values of order statistics and the expected values estimated by permutation as shown in Figure 4.2.

87

4.2.2

Effect of higher order cumulants with equal sample sizes

In previous section (4.2.1), we showed the discrepancies between the true expected values of order statistics and expected values of order statistics estimated by permutation when two independent multivariate normal distributions have unequal variance-covariance matrices and sample sizes. In this section, we will show the effect of skewness and third order cross cumulants on the SAM procedure when the observations in both biological conditions have independent multivariate lognormal distributions. i.i.d

i.i.d

µX , Σ X ) and Y j = (Yj1, . . . , Yjk ) ∼ Let X i = (Xi1 , . . . , Xik ) ∼ MV Lognormalk (µ

µY , Σ Y ), i = 1, 2, . . . , m, j = 1, 2, . . . , n, where MV Lognormalk (µ µ, Σ ) MV Lognormalk (µ denotes a k-dimensional multivariate lognormal distribution with logmean vector µ and log variance-covariance matrix Σ . To discover the genes whose expression levels are significantly changed between two biological conditions, one can test the null hypotheses H0µ : E(X) = E(Y ). Consider the following test statistics ¯ − Y¯ , T =X where T = {T1 , T2 , . . . , Tk } ¯ = {X ¯1, X ¯2, . . . , X ¯k } X Y¯ = {Y¯1 , Y¯2 , . . . , Y¯k } ¯ l − Y¯l , Tl = X ¯l = where X

Pm

¯ i=1 Xil /m and Yl =

Pn

l = 1, . . . , k,

j=1 Yjl /n.

88

(4.4)

X Y Suppose ΣX = (σlp ) and ΣY = (σlp ) (l = 1, . . . , k and p = 1, . . . , k). Consider the

following case: 

X Y σlp = σlp = σll if l = p X Y σlp 6= σlp if l 6= p.

If the correlations within the logarithm of X are different from the correlations Y X Y within the logarithm of Y , i.e. ρX lp 6= ρlp , then we get σlp 6= σlp when l 6= p.

The mean vectors of X and Y are respectively X

X

X

X

E(X) = {E(X1 ), . . . , E(Xk )} = {eµ1 +(1/2)σ11 , . . . , eµk +(1/2)σkk } Y

Y

Y

Y

E(Y ) = {E(Y1 ), . . . , E(Yk )} = {eµ1 +(1/2)σ11 , . . . , eµk +(1/2)σkk }. In addition, the variance, covariance, skewness and third order cross cumulants of Xil and Yjl (l = 1, . . . , k) are X +σ X ll

V ar(Xil ) = e2µl

Y

V ar(Yjl ) = e2µl

Y +σll

X

(eσll − 1) Y

(eσll − 1)

X +µX +(1/2)(σ X +σ X ) pp p ll

Cov(Xil , Xip ) = eµl

Y

Cov(Yjl, Yjp ) = eµl

Y Y +µY p +(1/2)(σll +σpp )

X +(2/3)(σ X )2 ll

κ3 (Xil ) = e3µl

Y

κ3 (Yjl ) = e3µl

Y )2 +(2/3)(σll

X

(eσlp − 1) Y

(eσlp − 1)

X 2

X 2

Y 2

Y 2

(e3(σll ) − 3e(σll ) + 2)

(e3(σll ) − 3e(σll ) + 2)

X +µX +µX +(1/2)(σ X +σ X +σ X ) pp ss p s ll

Cum(Xil , Xip , Xis ) = eµl

Y

Cum(Yjl , Yjp, Yjs ) = eµl

Y Y Y Y +µY p +µs +(1/2)(σll +σpp +σss )

X

X

X

X

X

X

(eσlp +σls +σps − eσlp − eσls − eσps + 2) Y

Y

Y

Y

Y

Y

(eσlp +σls +σps − eσlp − eσls − eσps + 2).

When σllX = σllY for l = 1, . . . , k, the mean vectors defined above can be written as X

X

Y

Y

E(X) = {eµ1 +(1/2)σ11 , . . . , eµk +(1/2)σkk } E(Y ) = {eµ1 +(1/2)σ11 , . . . , eµk +(1/2)σkk }. 89

Thus, only when σllX = σllY for l = 1, . . . , k, testing the null hypothesis H0µ : E(X) = E(Y ) is equivalent to testing the null hypothesis H0µ : µ X = µ Y . ¯l − Under the null hypothesis H0µ : E(X) = E(Y ), the test statistics Tl = X Y¯l ,

l = 1, . . . , k, have the first and second order cumulants for l = 1, . . . , k, shown

as follows: Y X Y ¯ l − Y¯l ) = eµX l +(1/2)σll − eµl +(1/2)σll = 0 E(Tl ) = E(X

¯ l − Y¯l ) = 1 V ar(Xil ) + 1 V ar(Yjl) V ar(Tl ) = V ar(X m n 1 2µX 1 X X Y X Y = e l +σll (eσll − 1) + e2µl +σll (eσll − 1) m n ¯ l − Y¯l , X ¯ p − Y¯p ) Cov(Tl , Tp ) = Cov(X 1 1 Cov(Xil , Xip ) + Cov(Yil , Yip ) m n X 1 µX X X X = e l +µp +(1/2)(σll +σpp ) (eσlp − 1) m 1 µYl +µYp +(1/2)(σllY +σpp Y Y ) + e (eσlp − 1). n =

The third order cumulants are: ¯ l − Y¯l ) κ3 (Tl ) = κ3 (X 1 1 κ (X ) − κ3 (Yjl ) 3 il m2 n2 1 X 2 X 2 X 2 X = 2 e3µl +(2/3)(σll ) (e3(σll ) − 3e(σll ) + 2) m 1 Y 2 Y 2 Y 2 Y − 2 e3µl +(2/3)(σll ) (e3(σll ) − 3e(σll ) + 2) n 1 1 Cum(Tl , Tp , Ts ) = 2 Cum(Xil , Xip , Xis ) − 2 Cum(Yjl , Yjp, Yjs) m n X X X X 1 µX X X X X X X X = 2 e l +µp +µs +(1/2)(σll +σpp +σss ) (eσlp +σls +σps − eσlp − eσls − eσps + 2) m 1 Y Y Y Y Y Y Y Y Y Y Y Y − 2 eµl +µp +µs +(1/2)(σll +σpp +σss ) (eσlp +σls +σps − eσlp − eσls − eσps + 2). n =

90

If we resample X i , i = 1, . . . , m and Y j , j = 1, . . . , n, from the pooled sample X 1 , . . . , X m , Y 1 , . . . , Y n }, and recompute T = {T1 , . . . , Tk } each time from resam{X pled observation vectors, then, for a given permutation with r (r ≤ min(m, n)) vectors ¯ r − Y¯ r has the first, second and relabeled, the distribution of the test statistic Tlr = X l l third order cumulants as follows under the null hypothesis H0µ : E(X) = E(Y ): E(Xl1 + · · · + Xl(m−r) + Yl1 + · · · + Ylr ) m E(Yl1 + · · · + Yl(n−r) + Xl1 + · · · + Xlr ) − n X Y Y +(1/2)σ µX ll me l neµl +(1/2)σll = − m n

E(Tlr ) =

=0

Xl1 + · · · + Xl(m−r) + Yl1 + · · · + Ylr m Yl1 + · · · + Yl(n−r) + Xl1 + · · · + Xlr − ) n m−r r r n−r =( + 2 )V ar(Xil ) + ( 2 + )V ar(Yjl ) 2 m n m n2 m−r n − r 2µYl +σllY σllY r X X X =( (e − 1) + 2 )e2µl +σll (eσll − 1) + )e 2 m n n2 m−r r r n−r Cov(Tlr , Tpr ) = ( + 2 )Cov(Xil , Xip ) + ( 2 + )Cov(Yjl , Yjp) 2 m n m n2 X r m−r X X X X + 2 )eµl +µp +(1/2)(σll +σpp ) (eσlp − 1) =( 2 m n r Y n − r µYl +µYp +(1/2)(σllY +σpp Y ) σlp +( 2 + − 1) (e )e m n2 V ar(Tlr ) = var(

91

r r n−r m−r − )κ (X ) + ( − )κ3 (Yjl ) 3 il m3 n3 m3 n3 m−r r X 2 X 2 X 2 X =( − 3 )e3µl +(2/3)(σll ) (e3(σll ) − 3e(σll ) + 2) 3 m n r n − r 3µYl +(2/3)(σllY )2 3(σllY )2 Y 2 +( 3 − − 3e(σll ) + 2) (e )e 3 m n 1 1 Cum(Tlr , Tpr , Tsr ) = Cum(Tl , Tp , Ts ) − r( 3 + 3 ) · (Cum(Xil , Xip , Xis ) m n κ3 (Tlr ) = (

− Cum(Yjl, Yjp , Yjs)) = Cum(Tl , Tp , Ts ) − r(

1 1 + )· m3 n3

X +µX +µX +(1/2)(σ X +σ X +σ X ) pp ss p s ll

[eµl

Y

− eµl

X

X

X

X

X

X

(eσlp +σls +σps − eσlp − eσls − eσps + 2)

Y Y Y Y +µY p +µs +(1/2)(σll +σpp +σss )

Y

Y

Y

Y

Y

Y

(eσlp +σls +σps − eσlp − eσls − eσps + 2)].

¯ l − Y¯l , l = 1, . . . , k, When σllX = σllY for l = 1, . . . , k, the test statistics Tl = X have the cumulants that can be simplified as follows under the null hypothesis H0µ : E(X) = E(Y ): ¯ l − Y¯l ) = eµl +(1/2)σll − eµl +(1/2)σll = 0 E(Tl ) = E(X ¯ l − Y¯l ) = ( 1 + 1 )e2µl +σll (eσll − 1) V ar(Tl ) = V ar(X m n ¯ l − Y¯l , X ¯ p − Y¯p ) Cov(Tl , Tp ) = Cov(X = eµl +µp +(1/2)(σll +σpp ) [ κ3 (Tl ) = (

1 σlp 1 Y X (e − 1) + (eσlp − 1)]. m n

1 1 2 2 2 − 2 )e3µl +(3/2)σll (e3σll − 3eσll + 2) 2 m n

Cum(Tl , Tp , Ts ) = eµl +µp +µs +(1/2)(σll +σpp +σss ) 1 σlp X X X X X X (e +σls +σps − eσlp − eσls − eσps + 2) 2 m 1 Y Y Y Y Y Y − 2 (eσlp +σls +σps − eσlp − eσls − eσps + 2)]. n

·[

¯ r − Y¯ r , l = 1, . . . , k, have the following The permuted test statistics Tlr = X l l simplified cumulants when σllX = σllY for l = 1, . . . , k under the null hypothesis H0µ : 92

E(X) = E(Y ): E(Xl1 + · · · + Xl(m−r) + Yl1 + · · · + Ylr ) m E(Yl1 + · · · + Yl(n−r) + Xl1 + · · · + Xlr ) − n meµl +(1/2)σll neµl +(1/2)σll = − m n

E(Tlr ) =

=0

Xl1 + · · · + Xl(m−r) + Yl1 + · · · + Ylr m Yl1 + · · · + Yl(n−r) + Xl1 + · · · + Xlr − ) n 1 1 = ( + )e2µl +σll (eσll − 1) m n m−r X r Y r n − r σlp Cov(Tlr , Tpr ) = eµl +µp +(1/2)(σll +σpp ) [( + 2 )(eσlp − 1) + ( 2 + )(e − 1)] 2 2 m n m n V ar(Tlr ) = var(

1 1 2 2 2 − 2 )e3µl +(3/2)σll (e3σll − 3eσll + 2) 2 m n 1 1 Cum(Tlr , Tpr , Tsr ) = Cum(Tl , Tp , Ts ) − r( 3 + 3 ) · [Cum(Xil , Xip , Xis ) m n κ3 (Tlr ) = (

− Cum(Yjl , Yjp, Yjs)] = Cum(Tl , Tp , Ts ) − r( X

X

X

X

1 1 + 3 )eµl +µp +µs +(1/2)(σll +σpp +σss ) 3 m n X

X

Y

Y

Y

Y

Y

Y

· [(eσlp +σls +σps − eσlp − eσls − eσps ) − (eσlp +σls +σps − eσlp − eσls − eσps )]. Direct consequences from the above calculations are: (1) V ar(Tl ) = V ar(Tlr ) when m = n or σllX = σllY . (2) If m = n and σllX = σllY , then Cov(Tl , Tp ) = Cov(Tlr , Tpr ). (3) If m = n and σllX = σllY , then κ3 (Tl ) = κ3 (Tlr ) = 0; If m 6= n, but σllX = σllY , then κ3 (Tl ) = κ3 (Tlr ) 6= 0; If m = n and σllX 6= σllY , then κ3 (Tl ) 6= κ3 (Tlr ). 93

(4) Even when m = n and σllX = σllY , Cum(Tl , Tp , Ts ) 6= Cum(Tlr , Tpr , Tsr ) as a consequence of Cum(Xil , Xip , Xis ) 6= Cum(Yjl , Yjp, Yjs ). Thus, if two joint distributions have different skewness or third order cross cumulants, the permutation and the true distributions of T are different under the null hypothesis H0µ : E(X) = E(Y ). We will conduct simulation studies to show the discrepancies between the true expected values of order statistics and expected values of order statistics estimated by permutation when two independent multivariate lognormal distributions have unequal skewness and third order cross cumulants. In the first simulation study, two groups have unequal skewness with the same sample sizes. We generate 10,000 sets of random samples for both biological conditions (Group X and Group Y ) from multivariate lognormal distributions with the following mean vectors and variance-covariance matrices in the logarithm scale. µ X = −0.4 50 , µ Y = 0.2 50 ,

Σ X = 1.6II 50 , Σ Y = 0.4II 50 ,

m = 5; n = 5.

Figure 4.3 shows the Q-Q plot of the true expected values of the order statistics versus the expected values of the order statistics estimated by permutation. Apparently, the true expected values of order statistics and the expected values estimated by permutation are different. In the second simulation study, we generate 10,000 sets of random samples for both biological conditions (Group X and Group Y ) from multivariate lognormal distributions with the following mean vectors and variance-covariance matrices in the

94

4 2 0 −2

True expectation of ordered test statistics

Q−Q plot

−4

−2

0

2

4

Expectation of ordered test statistics estimated by permutation

Figure 4.3: Q-Q plot of the true expected values of order statistics against the expected values estimated by permutation for unequal skewness. Dashed line in the Q-Q plot is the 45 degree diagonal line.

95

logarithm scale. µ X = 1 50 ,

J 50 , Σ X = 0.1II 50 + 0.9J

µ Y = 1 50 ,

Σ Y = I 50 ,

m = 5;

n = 5.

The two multivariate lognormal distributions have unequal third order cross cumulants. Figure 4.4 shows the Q-Q plot of the true expected values of the order statistics versus the expected values of the order statistics estimated by permutation. There is a difference between the true expected values of order statistics and the expected values estimated by permutation.

4.3

Conditions for controlling the expected number of false rejections in SAM

We showed the situations that the expected values of order statistics estimated by permutation are different from the true expected values of order statistics. SAM shows invalid results in those situations. In this section, we will derive the conditions for controlling the expected number of false rejections in SAM. To identify significantly differentially expressed genes, the SAM procedure proceeds as follows: Consider testing H1 , H2 , . . . , Hk using the corresponding test statistics T1 , T2 , . . . , Tk ¯i − Y¯i ). Let T(1) ≤ T(2) ≤ · · · ≤ T(k) be the ordered test statistics, and H(i) de(Ti = X note the null hypothesis corresponding to T(i) . For a given threshold ∆, the procedure for finding significant positives is: Let i1 be the smallest i for which T(i) − E(T(i) ) > ∆; then reject all H(i) i = i1 , i1 + 1, . . . , k. 96

2 0 −2 −4 −6 −10

−8

True expectation of ordered test statistics

4

Q−Q plot

−5

0

5

Expectation of ordered test statistics estimated by permutation

Figure 4.4: Q-Q plot of the true expected values of order statistics against the expected values estimated by permutation for unequal third order cross cumulants. Dashed line in the Q-Q plot is the 45 degree diagonal line.

97

Similarly, the procedure for finding significant negatives is: Let i2 be the largest i for which E(T(i) ) − T(i) > ∆; then reject all H(i) i = 1, 2, . . . , i2 . For a given data set, the upper cut-point and the lower cut-point according to Tusher et al. (2001) and Chu et al. (2000) are defined as Cutup (∆) = min(t(i1 ) , t(i1 +1) , . . . , t(k) ) = t(i1 )

and

Cutlow (∆) = max(t(1) , t(2) , . . . , t(i2 ) ) = t(i2 ) , where {t(1) , . . . , t(i2 ) , t(i2 +1) , . . . , t(i1 −1) , t(i1 ) , . . . , t(k) } are the observed order statistics. Thus, for a given threshold, Reject H(i) if T(i) ≥ Cutup (∆) (t(i1 ) ) or T(i) ≤ Cutlow (∆) (t(i2 ) ) is equivalent to Reject H(i) if T(i) > E(T(i1 ) ) + ∆ or T(i) < E(T(i2 ) ) − ∆. Let S0 = {i : H(i) is true}. There are k0 true null hypotheses. Then, the number of false rejections V for a given threshold ∆ can be expressed as V =

X

i∈S0

I{T(i) > E(T(i1 ) ) + ∆ or T(i) < E(T(i2 ) ) − ∆}

(4.5)

For the bth permutation, the number of false rejections Vb for the same given threshold ∆ has the following expression: Vb =

k X i=1

b b I{T(i) > E(T(i1 ) ) + ∆ or T(i) < E(T(i2 ) ) − ∆}

98

(4.6)

To simplify our expression, let pi = P r{T(i) > E(T(i1 ) ) + ∆ or T(i) < E(T(i2 ) ) − ∆}, b b pbi = P r{T(i) > E(T(i1 ) ) + ∆ or T(i) < E(T(i2 ) ) − ∆}, P Pk b pi p p¯ = i∈S0 , p¯b = i=1 i . k0 k

Then, E(V ) = E(

X

i∈S0

=

X

i∈S0

I{T(i) > E(T(i1 ) ) + ∆ or T(i) < E(T(i2 ) ) − ∆})

P r{T(i) > E(T(i1 ) ) + ∆ or T(i) < E(T(i2 ) ) − ∆}

= k0 p¯,

(4.7)

and E(Vb ) = E(

k X i=1

=

k X i=1

b b I{T(i) > E(T(i1 ) ) + ∆ or T(i) < E(T(i2 ) ) − ∆})

b b P r{T(i) > E(T(i1 ) ) + ∆ or T(i) < E(T(i2 ) ) − ∆}

= k p¯b .

(4.8)

Assume all test statistics T1 , T2 , . . . , Tk are independent, identically distributed with pdf fT and CDF FT . Let c1 = E(T(i1 ) ) + ∆ and c2 = E(T(i2 ) ) − ∆. Then the number of false rejections V has the following mean: E(V ) =

X

P r{T(i) > c1 or T(i) < c2 }

X

{1 −

i∈S0

=

i∈S0

l

k0   X k0 l=i

l

[(FT (c1 )l (1 − FT (c1 ))k0 −l

− (FT (c2 )) (1 − FT (c2 ))k0 −l ]}.

99

(4.9)

The number of false rejections from bth permutation Vb has the expected value E(Vb ): E(Vb ) =

k X i=1

=

k X i=1

b b P r{T(i) > c1 or T(i) < c2 }

{1 −

k   X k l=i l

l

[(FT b (c1 )l (1 − FT b (c1 ))k−l

− (FT b (c2 )) (1 − FT b (c2 ))k−l ]}.

(4.10)

If the true and permutation distribution of test statistics T can be described in terms of cumulants κa (T ) and κa (T b ), then the following results can be established based on Theorem 2.2 in Huang et al. (2006)’s paper: κa (T b ) = κa (T ) − b(

(−1)a 1 − )(κa (FX ) − κa (FY )). ma na

(4.11)

Thus, E(V ) = E(Vb ) only when the following conditions are satisfied: (1) k0 = k (when all the null hypotheses are true), (2) m = n for even order cumulants, or κa (FX ) = κa (FY ) for odd order cumulants. The counter examples given in the previous sections do not satisfy the above conditions. Therefore, when the above conditions are not satisfied, E(V ) can not be controlled by the SAM procedure.

4.4

An adaptive two-step procedure controlling the expected number of false rejections

Gordon et al. (2007) proposed that the Bonferroni procedure can control E(V ) at a pre-specified number of false rejections γ. The Bonferroni procedure compares each observed P -value with γ/k, and rejects the null hypothesis Hi if Pi < γ/k. The simulation studies conducted by Gordon et al. (2007) showed that the Bonferroni 100

procedure has an equivalent power, and a smaller variance of both the number of true rejections and the total number of rejections compared to the Benjamini-Hochberg procedure (Benjamini and Hochberg (1995)). The proof of Bonferroni procedure controlling E(V ) is straightforward as follows: E(V ) = E(

X i∈S

I{Pi ≤ γ/k}) =

X i∈S

P r{Pi ≤ γ/k} ≤ k0

γ ≤ γ, k

(4.12)

where S is the set of indices of true null hypotheses, and Pi are observed P-values that are uniformly distributed on [0, 1]. A sharp bound can be obtained using the Bonferroni procedure described above when all null hypotheses are true (k0 = k). When some null hypotheses are false, the Bonferroni procedure is conservative by a factor of k0 /k. Similarly, the FDR controlling procedures, such as the Benjamini-Hochberg procedure, are conservative by a factor of k0 /k when k0 < k. Extensive efforts have been taken to estimate k0 from the data in order to sharp the bound in FDR controlling procedures (Schweder and Spjotvoll (1982), Benjamini and Hochberg (2000), Efron et al. (2001), Storey (2002), Storey and Tibshirani (2003b), Storey and Tibshirani (2003a), Storey et al. (2004)). Benjamini et al. (2006) reviewed the previous adaptive FDR controlling procedures (estimating k0 from the data), and further proposed a two-stage linear step-up procedure (TST) as follows: Step 1. Use the Benjamini-Hochberg linear step-up procedure (Benjamini and ′

Hochberg (1995)) at level q = q/(1+q). Let r1 be the number of rejected hypotheses. If r1 = 0, do not reject any hypothesis and stop; if r1 = k, reject all k hypotheses and stop; otherwise continue. Step 2. Let kˆ0 = k − r1 , use the Benjamini-Hochberg linear step-up procedure at ′ level q ∗ = q k/kˆ0 .

101

Benjamini et al. (2006) proved that when the test statistics are independent, the above linear step-up procedure (TST) could control FDR at level q. Their proof for controlling FDR is summarized as follows: According to Benjamini and Yekutieli (2001), the FDR of any multiple testing procedure can be expressed as F DR =

k0 X k X 1 i=1 l=1 k X

= k0

l=1

l

P r{l hypotheses are rejected one of which is H0i }

1 P r{l hypotheses are rejected one of which is H01 } l

= k0 EP (1) Q(P (1) ).

(4.13)

Conditioning on P (1) , the vector of P-values corresponds to m − 1 hypotheses excluding H01 . Q(P (1) ) is defined as Q(P

(1)

)=

k X 1 l=1

l

P rP01 |P (1) {l hypotheses are rejected one of which is H01 },

where P01 is the P-value associated with H01 . For each value of P (1) , let r(P01 ) denote the number of hypotheses that are rejected as a function of P01 , and l(P01 ) be the indicator that H01 is rejected as a function of P01 . When r(p) is a non-increasing function and l(p) takes the form of 1[0,p∗ ] , where p∗ ≡ p∗ (P (1) ) satisfies p∗ ≤ qr(p∗ )/kˆ0 (p∗ ), Q(P (1) ) can be expressed as Q(P

(1)

l(P01 ) (1) ) = E( |P ) = r(P01 )

Z

l(p) p∗ q dµ01 (p) ≤ ≤ , r(p) r(p∗ ) kˆ0 (p∗ )

(4.14)

where, by assumed independence, µ01 is the marginal distribution of P01 . In continuous case, µ01 is a uniform distribution on [0, 1]. In discrete case, it is stochastically larger than the uniform. Thus, FDR has the following upper bound F DR ≤ qEP (1) ( 102

k0 kˆ0 (p∗ )

).

(4.15)

In the two-stage procedure, kˆ0 can only be one of two values, kˆ0 (0) or kˆ0 (1). ′

For P01 ≤ r1 q /k, H01 is rejected at both stages of the two-stage procedure, and ′ kˆ0 = kˆ0 (1). For P01 > r1 q /k, H01 is not rejected at the first stage, and hence ′ kˆ0 = kˆ0 (0). As long as P01 ≤ r(P01 )q /kˆ0 (1), however, H01 is rejected at the second

stage, and kˆ0 (p∗ ) = kˆ0 (1). We will have Q(P

(1)

)≤

q



kˆ0 (1)

.

(4.16)

′ If kˆ0 (1) = k, then for P01 > r1 q /k, the second stage of the testing procedure

is identical to the first stage. Thus, H01 is no longer rejected, and kˆ0 = kˆ0 (0). As ′

p∗ = r1 q /k and r1 ≤ r(P ∗), however, we will have ′

Q(P

(1)





p∗ r1 q /k q q )≤ . ≤ = = ∗ r(p ) r1 k kˆ0 (1)

(4.17)

As kˆ0 (1) is stochastically larger than Y + 1, where Y ∼ Binomial(k0 − 1, 1 − q/(q + 1)) and E{1/(Y + 1)} < 1/(k0 /(1 + q)), the above inequality yields F DR ≤ k0 EP (1) Q(P (1) ) ≤

q q k0 k0 EP (1) ≤ = q. 1+q Y +1 1 + q k0 /(1 + q)

(4.18)

A simulation study was conducted by Benjamini et al. (2006) to compare various adaptive FDR-controlling procedures in terms of the control of FDR and the power. The simulation results showed that, in the independent case, FDR can not be controlled at a nominal level using the adaptive linear step-up procedure incorporated in recent version of SAM software (Storey (2002), Storey and Tibshirani (2003b), Storey and Tibshirani (2003a)) and the Median adaptive linear step-up procedure (Storey (2002)). When the test statistics are positively correlated, among all proposed adaptive procedures, only the two-stage linear step-up procedure (TST) can control FDR at the nominal level. Their simulation results also showed that the variance of kˆ0 103

in the two-stage linear step-up procedure was smaller than that in Benjamini and Hochberg (2000) and that in Storey et al. (2004). To control E(V ), k0 can be estimated from the data to adjust the conservative factor in the Bonferroni procedure. We propose the following adaptive two-step procedure to control the expected number of false positives at a pre-specified number µ (0 < µ < k). ′

Step 1. Compare each P-value with

µ k

= µk (1− µk ). Let r be the number of rejected

hypotheses. If r = 0, do not reject any null hypothesis and stop; If r = k, reject all k null hypotheses and stop; otherwise, go to the second step. Step 2. Let kˆ0 = k − r, and compare each P-value with



µ ˆ0 k



=

µ . k−r

Theorem 4.2. The adaptive two-step procedure described above controls E(V ) at the pre-specified number µ. Proof. Let Bv denote the event that the Bonferroni procedure rejects V true null hypotheses. Then E(V ) can be written as E(V ) =

k0 X

vP r(Bv ).

v=1

104

For a fixed v, let s denote a subset of {1, . . . , k0} of size v, and Bvs denote the event in Bv that the v true null hypotheses rejected are s. Then, we have k0 X i=1

k

0 X X µ µ } ∩ Bv ) = P r({Pi ≤ } ∩ Bvs ) k k i=1 s

P r({Pi ≤

=

k0 XX s

=

i=1 k0 XX s

=v

i=1

X

P r({Pi ≤

µ } ∩ Bvs ) k

I{i ∈ s}P r(Bvs)

P r(Bvs )

s

= vP r(Bv ).

(4.19)

Thus, k

0 µ 1X P r({Pi ≤ } ∩ Bv ). P r(Bv ) = v i=1 k

Therefore, the E(V ) is E(V ) = =

k0 X k0 X

v=1 i=1 k0 X k0 X

P r({Pi ≤

µ } ∩ Bv ) k

P r(V true null hypotheses are rejected, one of which is H0i )

v=1 i=1 k0 X

= k0

P r(V true null hypotheses are rejected, one of which is H01 ).

v=1

The second equality follows as the exchangeability of the problem in P-values corresponding to the k0 true null hypotheses. Let P01 be the P-value associated with H01 . We have k0 ≥ 1 since otherwise E(V ) = 0. Let P (01) be the vector of P-values corresponding to the remaining k0 − 1 true null hypotheses excluding H01 . Conditioning on P (01) , E(V ) can be expressed as E(V ) = k0 EP (01) W (P (01) ), 105

(4.20)

where W (P (01) ) is defined as W (P

(01)

)=

k0 X

P rP01 |P (01) (V true null hypotheses are rejected, one of which is H01 ).

v=1

(4.21)

For each value of P (01) , let l(P01 ) be the indicator that H01 is rejected, as a function of P01 . Then W (P

(01)

) = E(l(P01 )|P

(01)

)=

Z

l(p)dx01 (p),

where x01 is the marginal distribution of P01 , which is just Unif orm(0, 1) in continuous cases, and stochastically larger than the uniform in discrete cases. l(p) takes the form 1[0,p∗ ] , where p∗ ≡ p∗ (P (01) ) satisfies p∗ ≤ P01 ≤

µ ˆ0 (P01 ) . k

µ ˆ0 (p∗ ) . k

Note that l(P01 ) = 1 as long as

Then, we have W (P

(01)

)=

Z

l(p)dx01 (p) ≤ p∗ ≤

µ ˆ k0 (p∗ )

(4.22)

and the following upper bound for E(V ): E(V ) ≤ µEP (01) (

k0 ). ˆ k0 (p∗ )

(4.23) ′

In our two-step procedure, each P-value is compared with

µ k

= µk (1 − µk ), r is the

number of rejected null hypotheses at the first step, and kˆ0 = k − r. kˆ0 can only be ′ one of two values kˆ0 (0) or kˆ0 (1). For P01 ≤ µ /k, H01 is rejected at both steps of the ′ two-step procedure, and kˆ0 = kˆ0 (1). For P01 > µ /k, H01 is not rejected at the first ′ step, and hence kˆ0 = kˆ0 (0). As long as P01 ≤ µ /kˆ0 (1), however, H01 is rejected at

the second step, and thus kˆ0 (p∗ ) = kˆ0 (1). We will have W (P

(01)

)≤

µ



kˆ0 (1)

.

(4.24)

′ If kˆ0 (1) = k, then for P01 > µ /k, the second step of the testing procedure is

identical to the first step. Thus, H01 is no longer rejected, and kˆ0 = kˆ0 (0). As 106



p∗ = µ /k, we have ′



W (P

(01)

µ µ = . )≤p = k kˆ0 (1) ∗

(4.25)

As kˆ0 (1) is stochastically larger than D + 1, where D ∼ Binomial(k0 − 1, 1 − µ /k) ′

1 }< and E{ D+1

1 , k0 (1−µ′ /k)

the above inequality yields ′

E(V ) = k0 EP (01) W (P (01) ) ≤ µ EP (01) ≤ µ(1 −

k0 k0 ′ ≤ µ EP (01) ˆ D+1 k0 (1)

µ k0 ≤ µ. ) µ k k0 (1 − k (1 − µk ))

In the adaptive two-step procedure, the critical value in the second step is which is greater than

µ k

µ(1−µ/k) , k−r

if µ < r. Since µ is commonly chosen as r · α (0 < α < 1),

µ < r is always satisfied. Hence, the adaptive two-step procedure is less conservative than the Bonferroni procedure. If the joint distribution of test statistics can be estimated, more powerful multiple testing procedures can be established when the test statistics are correlated. Resampling methods can be used to estimate the distribution of test statistics. Yekutieli and Benjamini (1999) reported that, for correlated test statistics, the resamplingbased multiple testing procedures not only control FDR at nominal level but also have higher power than other FDR controlling procedures. Yekutieli and Benjamini (1999) proposed two resampling-based FDR local estimators: a point estimator and a 1 − β FDR upper limit based on the number of rejections from each resample R∗ , the number of rejections from original data r, and rβ∗ which is the 1 − β quantile of R∗ . ( R∗ (p) , if r(p) − rβ∗ (p) ≥ p · k ER∗ R∗ (p)+r(p)−p·k F DR∗ (p) = P rR∗ (R∗ (p) ≥ 1), otherwise 107

is the point estimator of FDR and ( R∗ (x) if r(x) − rβ∗ (x) > 0 ER∗ R∗ (x)+r(x)−r ∗ (x) , ∗ β F DRβ (p) = supx∈[0,p] P rR∗ (R∗ (x) ≥ 1), otherwise is the 1 − β FDR upper limit. The resampling-based FDR multiple testing procedures are as follows: find kq = maxk {F DR∗ (p) ≤ q}, then reject H0(1) , . . . , H0(kq ) . Similarly, multiple testing procedures controlling E(V ) can be explored based on resampling methods.

4.5

Discussion

By showing the difference between the true joint distribution of test statistics and the joint distribution of test statistics estimated by permutation, and conducting simulation studies, we explored the discrepancies between the true expected values of order statistics and the expected values of order statistics estimated by permutation. Then, we derived the formulas for both the true expected number of false rejections and the permutation-based expected number of false rejections. The expected values for both the number of false rejections and permutation-based number of false rejections were also derived for independent test statistics. The formulas clearly show the evidence for overestimating Fdr by SAM. Based on the formulas, the conditions for SAM to control the expected number of false rejections were given to provide researchers guidance when using SAM for their microarray data analysis. In SAM 2.20, the estimation of the proportion of true null hypothesis (ˆ π0 = kˆk0 ) was used to improve the Fdr estimation by SAM. However, the Fdr is still greater than the nominal level according to the simulation studies by Benjamini et al. (2006). 108

The formulas for calculating the expected number of false rejections showed that the expected number of false rejections is a function of the joint distribution of test statistics. As Efron (2000) pointed out, the correlations among test statistics can considerably widen or narrow the null distribution of test statistics. At the same time, there is a strong dependence of the true false discovery proportion on the dispersion variable. The dispersion variable is a function of the correlations between test statistics. How the variability of the number of false rejections (V ) is affected by the correlations between genes need to be further investigated. We proposed an adaptive two-step procedure to control E(V ) at a predetermined value of µ, which has larger critical values compared to the Bonferroni procedure proposed by Gordon et al. (2007). Therefore, the adaptive two-step procedure is more powerful than the Bonferroni procedure. Yekutieli and Benjamini (1999) proposed to improve the power of the FDR controlling procedures using resampling methods. Likewise, exploring resampling-based E(V ) controlling procedures is an interesting topic for future research.

109

CHAPTER 5

CONCLUDING REMARKS

In microarray studies with two-group comparisons, the sample size of each group is typically small and the distributions of test statistics are usually unknown. Thus, resampling methods are widely used in microarray data analysis. For the same paired data set with a small sample size of two or three, permutation tests are very unlikely to give small P-values. On the contrary, the post-pivot and pre-pivot resampling methods are likely to give small P-values even adjusted for multiplicity. The above contradictive result was addressed in chapter 2. At first, for paired samples with a sample size of two or three in each group, the necessary and sufficient conditions for obtaining zero adjusted P-values were derived for the post-pivot and pre-pivot resampling methods. In addition, the test statistics null distribution estimated by the post-pivot resampling method was shown to be the same as that estimated by the pre-pivot resampling method for paired samples. The discreteness of the test statistic’s null distribution, estimated by the resampling methods, was further explored by calculating the maximum number of unique resampled test statistic values. The mathematical formulas for calculating the maximum number of unique resampled test statistic values were derived for two-group comparisons, fixed-effects general linear models, and general linear mixed-effects models 110

using three resampling methods. According to the formulas, the pre-pivot resampling method always gives the largest maximum number of unique resampled test statistic values among all three resampling methods. Therefore, the P-values computed by the pre-pivot resampling method are more reliable than the P-values computed by the permutation test and post-pivot resampling method. To estimate the test statistic’s null distribution, resampling methods are widely used in various hypothesis testing procedures. But researchers tend to ignore the conditions for resampling methods to control the multiple testing error rates when testing multiple null hypotheses simultaneously, resulting in inflated type I error rates. To control FWER at a desired level α in fixed-effects general linear models, the conditions were explored in chapter 3 for the permutation test, the post-pivot resampling method and the pre-pivot resampling method. In two-group comparisons, to control the multiple testing error rates at a desired level α, Xu and Hsu (2007) showed that permutation tests need to satisfy the Marginals-equalities-Determine-Joint-equalities (MDJ) condition to connect the marginal distributions (the null hypotheses) and the joint distributions (the assumption made by permutation tests). In fixed-effects general linear models, the conditions for permutation tests to control FWER were derived for the first time based on the Partitioning Principle, a basic multiple testing principle. The conditions are: (1) the errors of the fixed-effects general linear models are i.i.d., and (2) the test statistics are simply the ordinary least square (OLS) estimates. The conditions for the post-pivot resampling method and pre-pivot resampling method to control FWER asymptotically were also derived based on the Partitioning Principle for fixed-effects general linear models. The conditions are: (1) the errors of 111

the fixed-effects general linear models are i.i.d., and (2)

′ 1 XX n

→ V , where X is a

design matrix and V is a positive definite matrix. The Significance Analysis of Microarrays (SAM) procedure was first proposed by Tusher et al. (2001) to identify genes with statistically significant changes in expression, which is based on the Q-Q plot of the observed relative differences versus the expected relative differences estimated from permutation. SAM is frequently used in the biological sciences to identify differentially expressed genes in microarray experiments. Much literature (Pan (2003), Dudoit et al. (2003), Xie et al. (2005), Larsson et al. (2005), Zhang (2007)), however, has shown that SAM can not control the Fdr at desired nominal levels. The reason for this lack of control was explored by showing the discrepancies between the true expected values of order statistics and the expected values estimated by permutation. To provide a good reference for researchers who are willing to use SAM to find differentially expressed genes in microarray experiments, the situations when SAM gives invalid results were explored in chapter 4. SAM can not give correct reference distributions in the following cases: Case 1: Two independent data generating distributions have unequal variances and sample sizes. Case 2: Two independent data generating distributions have unequal correlations and sample sizes. Case 3: Two independent data generating distributions have unequal marginal skewness. Case 4: Two independent data generating distributions have unequal third order cross cumulants. The SAM procedure should be avoided in the above cases. 112

The conditions for SAM to control the expected number of false rejections are: (1) k0 = k (When all the null hypotheses are true), and (2) m = n (equal sample size) for even order cumulants or κa (FX ) = κa (FY ) for odd order cumulants. Gordon et al. (2007) proposed that the Bonferroni procedure can control the expected number of false rejections at certain number. Based on the Bonferroni procedure, an adaptive two-step procedure was proposed to control the expected number of false rejections at a predetermined number µ, which has larger critical values than the Bonferroni procedure, resulting in more powerful test than the Bonferroni procedure. The adaptive two-step procedure based on the Bonferroni procedure is as follows: ′

Step 1. Compare each P-value with

µ k

= µk (1− µk ). Let r be the number of rejected

hypotheses. If r = 0, do not reject any hypothesis and stop; If r = k, reject all k hypotheses and stop; otherwise, go to the second step. Step 2. Let kˆ0 = k − r and compare each P-value with



µ ˆ0 . k

With more probes being put on microarrays, ChIP-chip (a technique that combines chromatin immunoprecipitation (ChIP) with microarray technology (chip)) experiments are being used to find transcription factor target genes in genomes to build the regulatory network in plants, animals and human beings. In contrast to microarray data analysis, one-sided testing (looking for enrichments) is used in ChIP-chip data analysis, and the distribution of probe intensities is usually skewed to the right. Buck and Lieb (2004), Buck et al. (2005), Hong et al. (2005), Smith et al. (2005), and Keles et al. (2006) have done some exciting work on ChIP-chip data analysis, but the methods for controlling the type I error rate are very conservative and the powers are very low in their studies. One of interesting future research topics is applying the 113

Partitioning Principle and resampling techniques to the ChIP-chip data sets to find more powerful testing procedures.

114

REFERENCES

Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1):289 – 300. Benjamini, Y. and Hochberg, Y. (2000). On the adaptive control of the false discovery rate in multiple testing with independent statistics. Journal of Educational and Behavioral Statistics, 25:60 – 83. Benjamini, Y., Krieger, A. M., and Yekutieli, D. (2006). Adaptive linear step-up procedures that control the false discovery rate. Biometrika, 93(3):491 – 507. Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics, 29(4):1165 – 1188. Berger, J. O. (1993). Statistical decision theory and Bayesian analysis. Springer, 2nd edition. Bickel, P. J. and Freedman, D. A. (1981). Some asymptotic theory for the bootstrap. The Annals of Statistics, 9(6):1196 – 1217. Buck, M. J. and Lieb, J. D. (2004). ChIP-chip: considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments. Genomics, 83:349 – 360. Buck, M. J., Nobel, A. B., and Lieb, J. D. (2005). ChIPOTle: a user-friendly tool for the analysis of ChIP-chip data. Genome Biology, 6:R97:doi:10.1186/gb–2005– 6–11–r97. Buyse, M., Loi, S., van’t Veer, L., Viale, G., Delorenzi, M., Glas, A. M., d’Assignies, M. S., Bergh, J., Lidereau, R., Ellis, P., Harris, A., Bogaerts, J., Therasse, P., Floore, A., Amakrane, M., Piette, F., Rutgers, E., Sotiriou, C., Cardoso, F., and Piccart, M. J. (2006). Validation and clinical utility of a 70-gene prognostic signature for women with node-negative breast cancer. Journal of the National Cancer Institute, 98(17):1183 – 1192. Calian, V., Li, D., and Hsu, J. C. (2008). Partitioning to uncover conditions for permutation tests to control multiple testing error rates. Biometrical Journal, 50(5):756 – 766. 115

Casella, G. and Berger, R. L. (1990). Statistical inference. California: Wadsworth, Pacific Grove. Chu, G., Goss, V., Narasimhan, B., and Tibshirani, R. (2000). Sam ”Significant Analysis of Microarrays” - Users guide and technical document. Technical report, Stanford University. Churchill, G. A. and Doerge, R. W. (1994). Empirical threshold values for quantitative trait mapping. Genetics, 138:963 – 971. Davison, A. C. and Hinkley, D. V. (1997). Bootstrap methods and their application. Cambridge University Press. Dudoit, S., Shaffer, J. P., and Boldrick, J. C. (2003). Multiple hypothesis testing in microarray experiments. Statistical Science, 18(1):71 – 103. Efron, B. (1979). Bootstrap methods: Another look at the jackknife. The Annals of Statistics, 7(1):1 – 26. Efron, B. (2000). Correlation and large-scale simultaneous significance testing. Journal of the American Statistical Association, 102(477):93 – 103. Efron, B. and Tibshirani, R. J. (1994). An introduction to the bootstrap. Chapman & Hall/CRC. Efron, B., Tibshirani, R. J., Storey, J. D., and Tusher, V. (2001). Empirical Bayes analysis of microarray experiment. Journal of American Statistical Association, 96:1151 – 1160. Ein-Dor, L., Kela, I., Getz, G., Givol, D., and Domany, E. (2005). Outcome signature genes in breast cancer: Is there a unique set? Bioinformatics, 21:171 – 178. Finner, H. and Strassburger, K. (2002). The partitioning principle: A powerful tool in multiple decision theory. Annuals of Statistics, 30:1194 – 1213. Freedman, D. A. (1981). Bootstrapping regression models. The Annals of Statistics, 9(6):1218 – 1228. Glas, A. M., Floore, A., Delahaye, L. J., Witteveen, A. T., Pover, R. C., Bakx, N., Lahti-Domenici, J. S., Bruinsma, T. J., Warmoes, M. O., Bernards, R., Wessels, L. F., and Van ’t Veer, L. J. (2006). Converting a breast cancer microarray signature into a high-throughput diagnostic test. BMC Genomics, 7:278. Good, P. I. (2005). Permutation, parametric and bootstrap tests of hypotheses. Springer, 3rd edition.

116

Gordon, A., Glazko, G., Qiu, X., and Yakovlev, A. (2007). Control of the mean number of false discoveries, bonferroni and stability of multiple testing. The Annals of Applied Statistics, 1(1):179 – 190. Hall, P. (1986). On the bootstrap and confidence intervals. The Annals of Statistics, 14(4):1431 – 1452. Hehir-Kwa, J., Egmont-Petersen, M., Janssen, I., Smeets, D., Geurts van Kessel, A., and Veltman, J. (2007). Genome-wide copy number profiling on high-density bacterial artificial chromosomes, single-nucleotide polymorphisms, and oligonucleotide microarrays: A platform comparison based on statistical power analysis. DNA Research, 14:1 – 11. Hochberg, Y. (1988). A sharper Bonferroni procedure for multiple tests of significance. Biometrika, 75:800 – 802. Hochberg, Y. and Tamhane, A. C. (1987). Multiple comparison procedures. New York: Wiley. Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6:65 – 70. Hong, P., Liu, X. S., Zhou, Q., Lu, X., Liu, J. S., and Wong, W. H. (2005). A boosting approach for motif modeling using ChIP-chip data. Bioinformatics, 21(11):2636 – 2643. Huang, Y. and Hsu, J. C. (2007). Hochberg’s step-up method: Cutting corners off Holm’s step-down method. Biometrika, 94:965 – 975. Huang, Y., Xu, H., Calian, V., and Hsu, J. C. (2006). To permute or not to permute. Bioinformatics, 22:2244 – 2248. Keles, S., Van Der Laan, M. J., Dudoit, S., and Cawley, S. E. (2006). Multiple testing methods for ChIP-chip high density oligonucleotide array data. Journal of Computational Biology, 13(3):579 – 613. Kulesh, D. A., Clive, D. R., Zarlenga, D. S., and Greene, J. J. (1987). Identification of interferon-modulated proliferation-related cDNA sequences. Proceedings of the National Academy of Sciences, 84:8453 – 8457. Lander, E. S. and Botstein, D. (1989). Mapping mendelian factors underlying quantitative traits using RFLP linkage maps. Larsson, O., Wahlestedt, C., and Timmons, J. A. (2005). Considerations when using the significance analysis of microarrays (SAM) algorithm. BMC Bioinformatics, 6:129 – 134. 117

Lashkari, D. A., DeRisi, J. L., McCusker, J. H., Namath, A. F., Gentile, C., Hwang, S. Y., Brown, P. O., and Davis, R. W. (1997). Yeast microarrays for genome wide parallel genetic and gene expression analysis. Proceedings of the National Academy of Sciences, 94:13057 – 13062. Lee, M. T., Kuo, F. C., Whitmore, G. A., and Sklar, J. (2000). Importance of replication in microarray gene expression studies: Statistical methods and evidence from repetitive cDNA hybridizations. Proceedings of the National Academy of Sciences, 97(18):9834 – 9839. Lehmann, E. L. (1999). Elements of large-sample theory. New York: Springer. Lehmann, E. L. and Romano, J. P. (2005). Generalizations of the familywise error rate. The Annals of Statistics, 33(3):1138 1154. Liu, Y., Smith, M. R., and Rangayyan, R. M. (2004). The application of Efron’s bootstrap methods in validating feature classification using artificial neural networks for the analysis of mammographic masses. Engineering in Medicine and Biology Society, 1:1553 – 1556. Marcus, R., Eric, P., and Gabriel, K. R. (1976). On closed testing procedures with special reference to ordered analysis of variance. Biometrika, 63(3):655 – 660. Mei, R., Galipeau, P. C., Prass, C., Berno, A., Ghandour, G., Patil, N., Wolff, R. K., Chee, M. S., Reid, B. J., and Lockhart, D. J. (2000). Genome-wide detection of allelic imbalance using human SNPs and high-density DNA arrays. Genome Research, 10:1126 – 1137. Pan, W. (2003). On the use of permutation in and the performance of a class of nonparametric methods to detect differential gene expression. Bioinformatics, 19(11):1333 – 1340. Pan, W., Lin, J., and Le, C. T. (2003). A mixture model approach to detecting differentially expressed genes with microarray data. Functional & Integrative Genomics, 3:117 – 124. Pollack, J. R., Perou, C. M., Alizadeh, A. A., Eisen, M. B., Pergamenschikov, A., Williams, C. F., S., J. S., Botstein, D., and Brown, P. O. (1999). Genome-wide analysis of DNA copy-number changes using cDNA microarrays. Nature Genetics, 23:41 – 46. Pollard, K. S. and van der Laan, M. J. (2005). Resampling-based multiple testing: Asymptotic control of type I error and applications to gene expression data. Journal of Statistical Planning and Inference, 125:85 – 100. Ptitsyn, A., Zvonic, S., and Gimble, J. (2006). Permutation test for periodicity in short time series data. BMC Bioinformatics, 7(2):S10. 118

Schena, M., Shalon, D., Davis, R. W., and Brown, P. O. (1995). Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science, 270:467 – 470. Schweder, T. and Spjotvoll, E. (1982). Plots of p-values to evaluate many tests simultaneously. Biometrika, 69:493 – 502. Shaffer, J. P. (1995). Multiple hypothesis testing: A review. Annual Review of Psychology, 46:561 – 584. Smith, A. D., Sumazin, P., Das, D., and Zhang, M. Q. (2005). Mining ChIP-chip data for transcription factor cofactor binding sites. Bioinformatics, 21:i403 – i412. Stefansson, G., Kim, W., and Hsu, J. C. (1988). On confidence sets in multiple comparisons. In Gupta, S.S. and Berger, J.O., editors, Statistical Decision Theory and Related Topics IV, 2:89 – 104. Storey, J. D. (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society. Series B (Methodological), 64(3):479 – 498. Storey, J. D., Taylor, J. E., and Siegmund, D. (2004). Strong control, conservative point estimation, and simultaneous conservative consistency of false discovery rates: A unified approach. Journal of Royal Statistical Society. Series B (Methodological), 66:187 – 205. Storey, J. D. and Tibshirani, R. (2003a). Sam thresholding and false discovery rates for detecting differential gene expression in DNA microarrays. In Parmigiani, G., Garrett, E. S., Irizarry, R. A. and Zeger, S. L. (eds.) The Analysis of Gene Expression Data: Methods and Software. Springer, New York. Storey, J. D. and Tibshirani, R. (2003b). Statistical significance for genome-wide studies. Proceedings of the National Academy of Sciences, 100:9440 – 9445. Strimmer, K. (2008). A unified approach to false discovery rate estimation. BMC Bioinformatics, 9:303. Tsai, C., Hsueh, H., and Chen, J. J. (2003). Estimation of false discovery rates in multiple testing: Application to gene microarray data. Biometrics, 59(4):1071 – 1081. Tusher, V. G., Tibshirani, R., and Chu, G. (2001). Significant analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Academy of Sciences, 98(9):5116 – 5121. van de Vijver, M. J., He, Y. D., van’t Veer, L. J., Dai, H., Hart, A. A., Voskuil, D. W., Schreiber, G. J., Peterse, J. L., Roberts, C., Marton, M. J., Parrish, M., Atsma, D., Witteveen, A., Glas, A., Delahaye, L., van der Velde, T., Bartelink, 119

H., Rodenhuis, S., Rutgers, E. T., Friend, S. H., and Bernards, R. (2002). A gene-expression signature as a predictor of survival in breast cancer. New England Journal of Medicine, 347(25):1999 – 2009. van ’t Veer, L. J., Dai, H., van de Vijver, M. J., He, Y. D., Hart, A. A., Mao, M., Peterse, H. L., van der Kooy, K., Marton, M. J., Witteveen, A. T., Schreiber, G. J., Kerkhoven1, R. M., Roberts, C., Linsley, P. S., Bernards, R., and Friend, S. H. (2002). Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415(6871):530 – 536. Westfall, P. H. and Young, S. S. (1993). Resampling-based multiple testing: examples and methods for P-Value adjustment. New York: Wiley. Xie, Y., Pan, W., and Khodursky, A. B. (2005). A note on using permutation-based false discovery rate estimates to compare different analysis methods for microarray data. Bioinformatics, 21(23):4280 – 4288. Xu, H. and Hsu, J. C. (2007). Applying the generalized partitioning principle to control the generalized familywise error rate. Biometrical Journal, 49:52 – 67. Yekutieli, D. and Benjamini, Y. (1999). Resampling-based false discovery rate controlling multiple test procedures for correlated test statistics. Journal of Statistical Planning and Inference, 82:171 – 196. Zhang, S. (2007). A comprehensive evaluation of SAM, the SAM R-package and a simple modification to improve its performance. BMC Bioinformatics, 8:230 – 241.

120