A simple procedure to detect non-central observations from a sample ...

A simple procedure to detect non-central observations from a sample Mylène DUVAL∗, Céline DELMAS∗, Béatrice LAURENT†, Christèle ROBERT-GRANIE∗ ∗

SAGA, INRA Toulouse, France [email protected] [email protected]

[email protected]

†

Département de Mathématiques, INSA Toulouse, France [email protected]

Summary In this paper we propose a simple procedure to select the subset of non-central observations from a sample. This problem is motivated by the detection of differentially expressed genes between conditions in microarray experiments. We prove under mild conditions the consistency of the selected subset of observations. We compare by simulations our proposed procedure to the Benjamini and Hochberg’s procedure, a procedure based on mixture models and a procedure based on model selection. Some key words: microarray data; multiple testing procedure; ordered statistics; partial sums.

1

1. Introduction The aim of this paper is to propose a simple procedure to detect non-central observations from a sample. We will focus on the study of two samples (χi )i=1,··· ,n and (Fi )i=1,··· ,n . We write ν1 X χi = (γi Nik + δik )2 1 ≤ i ≤ n (1) Fi =

k=1 Pν1 (γi Nik + δik )2 k=1P ν2 2 j=1 Dij

where ν1 and ν2 are two known integers, (δik ) (Nik )

k = 1, · · · , ν1 i = 1, · · · , n

k = 1, · · · , ν1 i = 1, · · · , n

(2)

are unknown parameters,

are independent identically distributed centered variables with unit

variance and (Dij ) dent of the (Nik )

ν2 1≤i≤n ν1

are independent identically distributed variables indepenP1 2 δik . . We define the non-centrality parameter ηi as ηi = νk=1

j = 1, · · · , ν2 i = 1, · · · , n

k = 1, · · · , ν1 i = 1, · · · , n

We assume that γi = 1 for all i such that ηi = 0 and that γi ≥ 1 is unknown otherwise. Our goal is to identify the set Jn = {i : ηi 6= 0}. Note that taking the square of the observations of a Gaussian sample, we obtain a chi-square sample with one degree of freedom, which is a particular case of model (1) with ν1 = 1, γi = 1 and Ni1 independent identically distributed standard Gaussian variables for i ∈ {1, ..., n}. Similarly a Student sample is a particular case of model (2). Our work is motivated by the problem of the detection of differentially expressed genes between conditions in microarray experiments. For this purpose a test statistic is built for each gene. In general it is a T or a F statistic to compare two or more conditions but it can also be a Gaussian or a χ2 statistic when the variance of the gene expression is assumed to be known. Under the null hypothesis that the gene is not differentially expressed between the conditions, the non-central parameter of the test statistic is null whereas it is non-null under the alternative hypothesis that the gene is differentially expressed between the conditions. We observe the sample of all the test statistics corresponding to all the genes. To detect the genes that are differentially expressed we have to separate the central from the non-central observations. This is the object of the paper. To address this problem the simplest procedure would be the Bonferroni correction. This procedure controls the familywise error rate (FWER) which is the probability of accumulating one or more false-positives over all the tests. This criterion is very stringent and may affect the power when the number of tests is large. An alternative procedure consists in controlling the false-discovery rate (FDR). Benjamini and Hochberg 2

(1995) introduced a FDR controlling procedure and proved that it controls the FDR for independent test statistics. Adaptive FDR controlling procedures have been proposed to increase the power while controlling the FDR (Benjamini and Hochberg, 2000; Benjamini, Krieger and Yekutieli, 2005, Adaptive linear step-up procedures that control the false discovery rate, Research paper 01-03). Genovese and Wasserman (2002, 2004) have proposed a method to minimize the false-negative rate (FNR) while controlling the FDR. Mixture models have been studied to separate the central observations from the others (Delmas, 2006, On mixture models for multiple testing in the differential analysis of gene expression data.Submitted paper; Bordes et al., 2006). In the Gaussian case some procedures based on model selection have been proposed (Birgé and Massart, 2001; Huet, 2006). Some authors have proposed procedures based on the partial sums of the absolute or squared ordered observations (Hoh et al., 2001; Lavielle and Lude˜ na, 2006, Random thresholds for linear model selection.Submitted paper). Zaykin et al. (2002) and Dudbridge and Koeleman (2003, 2004) have proposed methods based on the partial products of the ordered pvalues. Meinshausen and Rice (2004) have proposed a method to estimate the proportion of false null hypotheses among a large number of independently tested hypotheses. This method is based on the distribution of the p-values of the hypothesis tests, which are uniform on [0, 1] under the null hypothesis. The aim of our paper is to propose a simple procedure that offers good asymptotic properties. The paper is organized as follows. In section 2 we introduce the proposed procedure. In section 3 we state the consistency of the selected subset of observations for the two kinds of samples (1) and (2). In section 4 we compare by simulations our proposed procedure with different methods: the Benjamini and Hochberg’s procedure, a procedure based on mixture models and the Birgé and Massart’s procedure. Section 5 is devoted to the proofs.

2. The procedure Assume we observe the sample (1) or (2). We note Xi the observation: Xi = χi or Xi = Fi . We note kn the number of non-central observations. Our procedure to separate the central from the non-central observations can be stated as follows: (i) We order the Xi ’s: Xσ(1) ≥ Xσ(2) ≥ · · · ≥ Xσ(n) 3

We define for 0 ≤ k < n, τˆk =

1 n−k

Pn

i=k+1

Xσ(i) .

(ii) We estimate kn by kˆn = min {k : τˆk ≤ τ } 0≤k τ , then we set kˆn = n. (iii) We decide that Xσ(1) , · · · , Xσ(kˆn ) are non-central observations. This procedure is based on the idea that if the two populations of the central and non-central observations are well separated then τˆk is a good estimator of τ only for k = kn . For k < kn the expression of τˆk includes non central variables and hence τˆk tends to over estimate τ . For k > kn the expression of τˆk does not include the largest observations of a sample of independent identically distributed variables with mean τ then τˆk tends to under estimate τ .

3. Results We first introduce some notations. We write Jn = {i, ηi 6= 0} and n o Jˆn = σ(1), σ(2), . . . , σ(kˆn ) . Write V and T respectively denote the number of false positives and the number of false negatives defined as ¯ T = Card(Jˆn ∩ Jn ),

V = Card(Jˆn ∩ J¯n ),

where, for all set J, J¯ denotes the complementary set of J. Before giving our main theorem, we state two lemma, to show that, under suitable assumptions, the variables satisfying ηi 6= 0 are “well separated” from the others. We introduce some hypotheses. P1 2 H1: Write φ(1) be the cumulative distribution function of L(1) = maxi=1,..,n νk=1 Nik . We assume that there is a sequence (an , bn )n≥0 with bn > 0 for all n ∈ N such that 4

φ(1) (an + bn x) −−−→ F (x), where F is a cumulative distribution function. We also n→∞ assume that, ηi ≥ αn min i∈Jn 2γi2 where the sequence (αn )n∈N satisfies αn /2 − an −−−→ +∞. n→∞ bn

Example: Assume that the Nik ’s are independent identically distributed centered Gaussian ranP1 2 dom variables with unit variance. Then for 1 ≤ i ≤ n, νk=1 Nik ∼ χ2 (ν1 ). In that case we can prove that F is the cumulative distribution function of a Gumbel variable, with: an = 2 log(n) + (ν1 − 2) log(log(n)) − 2 log(2ν1 /2 Γ(ν1 /2)) + ν1 log(2) ∀n and bn = 2 ∀n (see Resnick (1987) Section 1.5 for similar calculations). n Then we have to choose αn such that αn /2−a −−−→ +∞, for example αn = 2an + n bn n→∞ for all n where n −−−→ +∞. n→∞

H2: Write ψ(1) and ξ(1) respectively be the cumulative distribution functions of D(1) = max

ν2 X

i=1,..,n

and

Dil2 ν1 /ν2

l=1

Pν1 2 Nik ν2 ˜ L(1) = max Pk=1 . ν2 2 i=1,..,n l=1 Dil ν1

We assume that there are two sequences (cn , dn )n≥0 and (en , fn )n≥0 with dn > 0 and fn > 0 for all n ∈ N such that ψ(1) (cn + dn x) −−−→ G(x) and ξ(1) (en + fn x) −−−→ H(x), n→∞ n→∞ where G and H are cumulative distribution functions. Write (xn )n∈N satisfies xn − cn −−−→ +∞. n→∞ dn We assume that for all i ∈ Jn , ηi min ≥ 4xn βn 1≤i≤kn γi2 5

where the sequence (βn )n∈N satisfies βn − en −−−→ +∞. n→∞ fn

Example: Assume that the Dil ’s and the Nik ’s are independent identically distributed centered P2 Gaussian random variables with unit variance. Then for 1 ≤ i ≤ n, νl=1 Dil2 ∼ χ2 (ν2 ) and in that case we can prove that G is the cumulative distribution function of a Gumbel variable, with: i ν2 h cn = 2 log(n) + (ν2 − 2) log(log(n)) − 2 log(2ν2 /2 Γ(ν2 /2)) + ν2 log(2) ∀n ν1 and

ν2 ∀n ν1 (see Resnick (1987) Section 1.5 for similar calculations). n We choose xn such that xnd−c −−−→ +∞, for example xn = cn + n for all n where n n→∞ n −−−→ +∞. dn = 2

n→∞

P

Nik ν2 ˜ (1) = maxi=1,..,n Pk=1 L ν2 2 ν ∼ F (ν1 , ν2 ). We can prove that H is the cumulative distril=1 Dil 1 bution function of a Frechet variable, with: ν1

2

fn = (Kn)2/ν2 ∀n 1 +ν2 ) where K = 2 Γ(νΓ(ν 1 /2)Γ(ν2 /2)

ν /2−1

ν2 1

and

ν /2

ν1 1

en = 0 ∀n (see Resnick (1987) Section 1.5 for similar calculations). (1+) We choose βn such that βfnn −−−→ +∞, for example βn = fn where > 0. n→∞

Lemma 1 Write U1 , . . . , Un be n independent random variables such that Ui =

ν1 X

(γi Nik + δik )2 for 1 ≤ i ≤ kn

k=1

and Ui =

ν1 X

(Nik )2 for kn + 1 ≤ i ≤ n

k=1

6

Pν1 2 where for all 1 ≤ i ≤ kn , ηi = k=1 δik 6= 0 and γi ≥ 1 are unknown parameters; Nik , 1 ≤ i ≤ n, 1 ≤ k ≤ ν1 are independent identically distributed centered variables with unit variance. We assume that assumption H1 holds. We define Ωn = min Ui ≥ max Ui . 1≤i≤kn

kn +1≤i≤n

Then pr(Ωcn ) −−−→ 0. n→∞

Lemma 2 Write U˜1 , . . . , Uñ be n independent random variables such that Pν1 (γi Nik + δik )2 ν2 k=1P ˜ Ui = for 1 ≤ i ≤ kn ν2 2 ν1 l=1 Dil and

Pν1 2 Nik ν2 U˜i = Pk=1 for kn + 1 ≤ i ≤ n ν2 2 l=1 Dil ν1 Pν1 2 where for all 1 ≤ i ≤ kn , ηi = k=1 δik 6= 0 and γi ≥ 1 are unknown parameters; Nik , 1 ≤ i ≤ n, 1 ≤ k ≤ ν1 are independent identically distributed centered variables with unit variance; Dil , 1 ≤ i ≤ n, 1 ≤ l ≤ ν2 are independent identically distributed variables independent of the Nik ’s. We assume that assumption H2 holds. We define ˜ ˜ Ωn = min Ui ≥ max Ui . kn +1≤i≤n

1≤i≤kn

Then pr(Ωcn ) −−−→ 0. n→∞

Theorem 1 We consider the procedure described in Section 2. We assume that the cardinality of Jn equals kn = λn with 0 < λ < 1. Write W denote the cumulative distribution function of the variables Xi /τ under the assumption that ηi = 0. We assume that W (2) < 1. Then, under assumptions H1 for sample (1) or assumptions H2 for sample (2), V pr > un −→ 0 as n −→ +∞ n T pr > un −→ 0 as n −→ +∞ n √ where un −→ 0 and nun −→ +∞ as n tends to infinity.

7

4. Application and simulations 4.1 Application to microarray data This procedure may be convenient to detect differentially expressed genes in microarray experiments. DNA microarrays are a new class of technology that enables molecular biologists to simultaneously measure the expression level of thousands of genes (Brown and Bolstein, 1999). Thousands of gene probes made of cDNA or oligonucleotides are spotted on a small glass slide or a nylon membrane in a regular matrix pattern. A basic experiment consists in comparing the expression levels in two different types of conditions. More generally, we can study several conditions with one or more repetitions. The intensity level on each spot on the microarray represents a measure of the concentration of the corresponding mRNA in the biological sample. The detection of differentially expressed genes in DNA microarray experiments is an important question asked by biologists to statisticians. At this stage, we assume that intensity levels of genes are correctly normalized. Assume we study microarray data from an experiment including n genes, J conditions and R repetitions for each gene in each condition. We note Aijr the rth repetition of the expression level of gene i in condition j. We assume that for all i, j, r, Aijr ∼ N (mij , σi2 ) where mij ∈ R, and σi ∈ R+ are unknown. We assume that the Aijr ’s are independent. P PJ PR 1 We note Aij. = R1 R r=1 Aijr . r=1 Aijr , and Ai.. = JR j=1 If R is large enough, we can estimate σi2 by σ î2 = In that case: ∀1 ≤ i ≤ n,

1 J(R−1)

PJ

j=1

PR

r=1 (Aijr

− Aij. )2 .

J X 1 Xi = R(Aij. − Ai.. )2 ∼ F (ηi ; J − 1, J(R − 1)) (J − 1)ˆ σi2 j=1

where Xi ∈ R+ , ηi is the non-centrality parameter. ηi = 0 if gene i is non-differentially expressed between the J conditions, and ηi > 0 if not. This model is similar to model (2). When R is small we can not estimate σi2 by the same estimator σ î2 . Thus we assume that σi = σ for all the genes which are not differentially expressed and σ is known. P Write Xi = σ12 Jj=1 R(Aij. − Ai.. )2 . If gene i is not differentially expressed between 2 the J conditions: Xi ∼ χ2 (J − 1), else σσ2 Xi ∼ χ2 (ηi ; J − 1), where ηi > 0 is the noni centrality parameter. This model is similar to model (1) with γi = σσi . 2 i2. ) In the particular case where we compare only two conditions, Xi = R(Ai1.2σ−A . If 2 8

gene i is non differentially expressed between the two conditions: Xi ∼ χ2 (1), else σ2 X ∼ χ2 (ηi ; 1), where ηi > 0 is the non-centrality parameter. This model corresponds σi2 i to model (1) with ν1 = 1. Write J n = {1 ≤ i ≤ n, ηi 6= 0} J n is the set of the genes which are differentially expressed between the J conditions, it contains kn elements. Our aim is to estimate the number kn and the set J n , thanks to the procedure presented in Section 2.

4.2 Simulations To validate our procedure denoted by DDLR, we present several simulations results and some comparisons with three other methods. We want to determine the number kn of genes effectively differentially expressed between two conditions, with only one repetition for each gene and each condition. We suppose that all the genes have the same variance σ 2 . We simulated n independent observations Ai as follows: kn from a normal distribution N (mi1 − mi2 , 2σ 2 ) and (n − kn ) from a normal distribution N (0, 2σ 2 ). Ai represents the difference of expression of gene i between the two conditions. We write A2 Xi = 2σi2 . Generally in applications, the standard deviation σ is unknown. We note for all i, s2i = s2 =var(Ai ) = 2σ 2 . We propose to use the estimator of s presented by Haaland and O’Connell (1995). Estimation of the variance This estimator is defined as follows: sˆ = 1.5 ∗ median{|Ai |, |Ai | ≤ 2.5s0 },

where s0 = 1.5 ∗ median{|Ai |, i = 1..n}.

Intuitively, this estimator may be a consistent estimator if the proportion of variables Ai with non-null mean is quite small. Let see some heuristic ideas about the construction of this estimator: Write tn = median(|Ai |) the empirical median. We know that tn −−−→ mn almost n→∞

9

surely, where mn is the theorical median. Moreover, if we assume that for all i ∈ {1, ..., n}, Ai ∼ N (0, s2 ), then Z mn x2 1 1 √ e− 2s2 dx = pr(|Ai | ≤ mn ) = 2 2 s 2π 0 so Z mn s x2 1 1 √ e− 2 dx = 4 2π 0 mn Thanks to the statistic tables, we estimate s ' 0.6745, that is to say s ' 1.4826 mn ' 1.5 median{|Ai |, i = 1, ..., n } . However some random variables Ai are not centered. In order to separate the random variables Ai which are centered and the random variables Ai which have nonzero mean, Haaland and O’Connell (1995) approximated the first set with the set {|Ai |, 1 = 1, ..., n : |Ai | ≤ 2.5s0 }. This explains why they propose to estimate s by: sˆ = 1.5 ∗ median{|Ai |, |Ai | ≤ 2.5s0 }. With large probability s0 ≥ s, then if Ai ∼ N (0, s2 ), pr(|Ai | > 2.5s0 ) ≤ pr(|Ai | > 2.5s) ' 2 ∗ (1 − 0.9938) = 0.0124. That is to say that if Ai is centered, the probability not to take it to estimate the standard deviation is 1%. In this application, we have supposed that the random variables Ai are Gaussian (Kerr et al, 2000). We can also generalize the estimation of the variance in the case where the Ai ’s have a known symetric density. Simulations i2 We have considered different values for n, kn and µ, where mi1 −m = µ for all i ∈ Jn . s For each value of n = 5000 and n = 10000, we have considered: kn = 0.1 ∗ n and A2 µ ∈ {3, 5, 8}. We recall that Xi = s2i ∼ χ2 (µ; 1) for all i ∈ J n , and Xi ∼ χ2 (1) otherwise. In the expression of Xi , we replace s by sˆ. We use our procedure presented in Section 2 with the observations {Xi }i=1,...,n and τ = 1. √ Hypothesis H1 is satisfied with an = 2log(n)−log(log(n))−2 log( 2π)+log(2), bn = 2 for all n, and αn = 2an + log(log(n)) for all n for example. √ Theorem 1 is proved under the assumption that µ > 2αn . Now we present briefly three methods to compare the performances of DDLR procedure. 10

All these methods assume that the standard deviation s is known, so we estimated it with the threshold estimator sˆ presented in Section 4.2. BM method: First, Birgé and Massart (2001) provided a general approach based on model selection via penalization for Gaussian random vectors with a known variance. The penalty function presented by Birgé and Massart is: p ∀k ∈ {1, ..., n}, pen(k) = λσ 2 k(1 + 2Lk )2 where λ > 1 and (Lk )k≥1 is a series of positive real numbers. The procedure proposed by Birgé and Massart leads, in our practical setting, to define: k X ˆ kn = argmink=1,...,n − Xσ(i) + pen(k) . i=1

where Xσ(1) ≥ ... ≥ Xσ(n) . We have chosen pen(k) = M k, where M is a constant which has been calibrated at M = 8 to obtain good results when µ = 5. BH method: The second method is a test method presented by Benjamini and Hochberg (1995). It controls the expected proportion of errors among the rejected hypotheses, named the false discovery rate (FDR). In our application, as we want to select the differentially expressed genes, the problem means as many tests as the number of genes standing on the microarray. The situation is summarized in Table1. The false discovery number is connected with the proportion of the rejected null hypotheses which are erroneously rejected kˆV . Then the FDR is defined as n i h i h V T F DR = E max(1, . We define the False Negative Rate: F N R = E ˆn ) ˆn ) . k max(1,n−k When the number of tests n is large, one accumulates the false discovery number at each test, as a result in microarray experiments, the error on the estimation of genes effectively differentially expressed can be very important. This explains why Benjamini and Hochberg (1995) presented a method which controls the FDR. BH adaptative procedure (adapted from the classical procedure presented by Benjamini and Hochberg, 1995) is as follows: assuming that we test n − kn null hypotheses, - write p1 , p2 , ..., pn be the n p-values corresponding to the n tests. These ones are sorted in a decreasing order: p(1) ≤ p(2) ≤ ... ≤ p(n) . 11

- write H0(1) , H0(2) , ..., H0(n) the corresponding null hypotheses. - write π0 ∈ R∗+ and write kˆn the biggest integer k ∈ {1, ..., n} such as p(k) ≤ where α ∈]0, 1[. - H0(i) is rejected, for i = 1, ..., k. Benjamini and Hochberg proved that for this procedure, F DR ≤

k α nπ0

n−kn α. nπ0

In the application on microarray data, we fixed the coefficient α = 0.05 and we calibrated π0 thanks to simulations at π0 = 2.9, such that the estimation of kn was the nearest as possible from kn in the case where n = 5000, µ = 5, and k0 = 500. As a consequence this procedure controls the FDR by 0.016. MIXT method: The third method is a classical method on the mixture of two normal distribution pN (µ, s2 )+(1 − p)N (0, s2 ), where p and µ are unknown parameters, s is known. Write αi a variable corresponding to the conditionnal probability that gene i is differentially expressed, given that the observations Ai , µ and p are known. Ai is Gaussian with variance s2 and null mean if gene i is non differentially expressed. The estimations of p, µ and αi can be obtained by the EM algorithm (Titterington et al., 1985). Then write iter the number of iterations in the EM algorithm. We estimate kn by kˆn =

n X

1α[iter] >0.5

i=1

i,0

4.3 Results and discussion Table 2 and Table 3 present the results of simulations for n = 5000 and n = 10000 respectively for the four methods. The following notations are used: 1. kˆn denotes the estimation of the number of differentially expressed genes; ˆ denotes the estimation (in percentage) of the false discovery rate; 2. F DR

12

ˆ R denotes the estimation (in percentage) of the false negative rate; 3. F N 4. The rate RDR is defined as: RDR =

S , kn

in percentage.

The quantities 1, 2, 3, and 4 are calculated on the basis of 1000 simulations. For example, in Table 2, we simulated a microarray experiment, on which n = 5000 genes were tested, and with only kn = 500 genes simulated differentially expressed. We simulated three levels of difference of expressions for the genes differentially expressed: µ = 3, 5, or 8. In the case µ = 8, BM method estimates kn by kˆn = 519.6. Among these ˆ = 3.8% of genes were not simulated differentially expressed. kˆn = 519.6 genes, F DR ˆ R = 0% of Among these n − kˆn = 4480.4 genes found non-differentially expressed, F N genes were simulated differentially expressed. With this method, RDR = 100% of the genes which were simulated differentially expressed are found. The results given in Tables 2 and 3 suggest the following remarks: ˆ R, and F DR, ˆ • When we compare Tables 2 and 3, the proportions RDR, F N are the same in general. That is to say that the four methods may depend only on the proportion knn , which is an important parameter in the estimator of the variance s2 . Therefore we only discuss results presented in Table 3. • When µ = 8, all the methods give a good estimation of kn , and all the genes which were simulated differentially expressed are found: RDR = 100% for almost all the methods. BM method tends to over estimate kn , the criterion does not penalise enough the high dimensions. BH method tends also to over estimate kn . We may change the constant π0 in order to have a better estimate, however, if we do so, the method would tend to under estimate in the case µ = 5 (see description of BH method). So it does not seem to be an adapted solution to improve BH method. MIXT and DDLR ˆ ˆ R and RDR. methods give good results for the four criteria kˆn , F DR, FN • Consider the case µ = 5. All the methods find more than 96% of the genes simulated ˆ differentially expressed. Among the genes found differentially expressed, 3.8% (F DR) of these genes were not simulated differentially expressed for BM method, against 1.4% for DDLR and BH methods and 1.5% for MIXT method. The choice of the method depends on the objective of the user. If he prefers to find more differentially expressed 13

ˆ (F DR ˆ ≥ 3.8) but a better RDR (RDR > 98%), he can genes in spite of a high F DR ˆ choose BM method. If he prefers to control the error level F DR, he may choose among the methods BH, DDLR, or MIXT. It is worth noticing that BM method does not return a good RDR in the case µ = 3. Generally in most applications to microarray data, µ and σ are not known and the ratio µ is often close to 3 or less than 3. ˆ • In the case µ = 5, MIXT method seems to be the best method in terms of F DR ˆ of MIXT method (1.5%) and RDR (98%). However, in the case µ = 3, the F DR is higher (10%) essentially because the mean µ of the genes differentially expressed is very weak, so the method can not easily seperate the genes which are not differentially expressed from the others. On the other hand, in the case µ = 3, DDLR method finds 57% of differentially expressed genes among the genes which are simulated differentially ˆ is quite weak: 7% of genes. expressed. Moreover the F DR In real microarray data analysis, the mean of the differences of genes levels between two conditions, is generally small (µ ≤ 3). So DDLR method seems to be well adapted for transcriptomic data. As we said before, BM method is difficult to use, because some constants must be calibrated in the penalty function. Since the ratio µ is generally close to 3 in microarray data analysis, we should calibrate the constant in the penalty function of BM method by holding account that. Unfortunately in this case, we could not find this constant because the method is unstable and we do not obtain equivalent estimations at each simulation. So we decided to keep the constant M = 8 in the BM penalty function, calibrated in the case n = 5000, k0 = 500 and µ = 5. To conclude, for any level of µ, our method finds a high proportion of genes simuˆ (≤ 7%) stands quite low. lated differentially expressed (RDR ≥ 57%) and the F DR ˆ ˆ R or RDR. The other methods tend to privilege only one of the criteria F DR, FN Moreover, DDLR method is very easy to implement and no constant is needed to calibrate. In this new approach, we assumed that the variance was known. It would be interesting to develop our method in the case where the variance is unknown, as proposed by Sylvie Huet (2006), who generalized BM Method.

14

5. Proofs 5.1 Proof of Lemma 1 The variables (Ui )kn +1≤i≤n are denoted (Vi )1≤i≤n−kn . We write U(1) ≥ U(2) ≥ . . . U(kn ) and V(1) ≥ V(2) ≥ . . . V(n−kn ) . The complementary set P1 of Ωn is the event V(1) > U(kn ) . Since Ui = νk=1 (γi Nik + δik )2 , we use the inequality 2ab ≥ −a2 /2 − 2b2 which holds for all a, b ∈ R to obtain that for all 1 ≤ i ≤ kn , ν1 X Ui ηi 2 + 2. ≥− Nik 2 γi 2γi k=1

Recalling that L1 = max1≤i≤n

Pν1

k=1

2 Nik , this implies that

U(kn ) ηi ≥ min − L(1) . 2 1≤i≤kn 2γi2 γ(k ) n Then, since for 1 ≤ i ≤ kn , γi ≥ 1, pr(Ωcn ) = pr(V(1) > U(kn ) ) ! V(1) ηi ≤ pr 2 + L(1) > min 1≤i≤kn 2γi2 γ(kn ) ! ηi . ≤ pr V(1) + L(1) > min 1≤i≤kn 2γi2 L V(1) ≤ L(1) , hence pr(Ωcn ) ≤ pr(2L(1) > αn ). Since L(1) − an /bn → F and (αn /2 − an ) /bn −−−→ +∞ with H1 we obtain that pr(Ωcn ) −−−→ 0. This concludes the n→∞ n→∞ proof of Lemma 1. 5.2 Proof of Lemma 2 As in the proof of Lemma 1, we obtain ηi 1 c ˜ pr(Ωn ) ≤ pr 2L(1) > min . 1≤i≤kn 2γi2 D(1) By H2, pr D(1) ≤ xn −−−→ 1. n→∞

15

This implies that pr(Ωcn )

˜ (1) ≥ min ≤ pr L

1≤i≤kn

ηi 4xn γi2

+ pr D(1) > xn

which tends to 0 as n → ∞ by assumption H2.

5.3 Proof of Theorem 1 Let us first prove that the fonction k → τˆk is non increasing. For 1 ≤ k ≤ n − 1: n

τˆk−1

X 1 = X(i) (n − k + 1) i=k =

n X X(k) 1 + X(i) (n − k + 1) (n − k + 1) i=k+1

X(k) n−k + τˆk (n − k + 1) (n − k + 1) Pn 1 ≥ X(i) , we get X(k) ≥ n−k i=k+1 X(i) . This implies that =

Since for i ≥ k

X(k)

τˆk−1 ≥

1 n−k τˆk + τˆk = τˆk . (n − k + 1) (n − k + 1)

Hence, the function k 7→ τˆk is non increasing. Let us now prove that T pr > un −→ 0 as n −→ +∞. n ! n kˆ o T kn n > un ≤ pr − < −un ∩ Ωn + pr(Ωcn ). pr n n n pr {kˆn < kn − nun } ∩ Ωn ≤ pr {ˆ τkn −nun ≤ τ } ∩ Ωn kn n X X X(i) X(i) 1 1 ≤ pr + ≤ 1 ∩ Ωn . n − kn + nun i=k −nu +1 τ n − kn + nun i=k +1 τ n

n

n

We define P1 = pr

1 n − kn + nun

kn X i=kn −nun

n X X(i) X(i) 1 + ≤ 1 ∩ Ωn . τ n − kn + nun i=k +1 τ +1 n

16

P1

≤ pr

kn X

X(i) 2nun ≤ τ n − kn + nun i=kn −nun +1 n−k Xn 1 2nun + pr Zi ≤ 1 − n − kn + nun i=1 n − kn + nun 1 n − kn + nun

∩ Ωn

where (Zi )i=1,..,n−kn is a sample of independent identically distributed random variables with distribution W , which is the distribution of Xi /τ under the assumption ηi = 0. Note the E(Zi ) = 1, we denote by v the standard deviation of the Zi ’s. ! ! Pn−kn n−k Xn Z − (n − k ) −nu 1 2nun i n n i=1 √ = pr ≤ √ . pr Zi ≤ 1− n − kn + nun i=1 n − kn + nun v n − kn v n − kn The central limit theorem implies that Pn−kn i=1 Zi − (n − kn ) L √ → N (0, 1), v n − kn √ and since nun −−−→ +∞, we obtain n→∞

−nun p −−−→ −∞. v (n − kn ) n→∞ This implies that n−k Xn 1 2nun pr Zi ≤ 1 − n − kn + nun i=1 n − kn + nun

! −−−→ 0. n→∞

Let us now control the other term appearing in the upper bound for P1 . ! kn X X(i) 2nun 1 ≤ ∩ Ωn pr n − kn + nun i=k −nu +1 τ n − kn + nun n n ! k n X X(i) 1 = pr ≤ 2 ∩ Ωn . nun i=k −nu +1 τ n

n

On the event Ωn , 1 nun

kn X i=kn −nun

X(i) ≥ Z(1) , τ +1

where (Z1 , . . . , Zn−kn ) are independent identically distributed with distribution W , and Z(1) = max {Zi , 1 ≤ i ≤ n − kn }. Hence ! kn X X(i) 1 pr ≤ 2 ∩ Ωn ≤ pr Z(1) ≤ 2 = (W (2))n−kn . nun i=k −nu +1 τ n

n

17

This tends to 0 as n tends to infinity since we assumed that W (2) < 1. We have proved that P1 −−−→ 0. Since pr(Ωcn ) −−−→ 0 by Lemma 1 for Sample (1) and n→∞

n→∞

by Lemma 2 for Sample (2), we obtain that T pr > un −→ 0 as n −→ +∞. n It remains to show that

V pr > un −→ 0 as n −→ +∞ n ! ˆ V kn kn pr > un ≤ pr − > un ∩ Ωn + pr (Ωcn ) . n n n Hence, we shall prove that ! ˆ k n kn pr − > un ∩ Ωn −−−→ 0. n→∞ n n ˆ = pr τkn +nun > τ ∩ Ωn pr kn − kn > nun ∩ Ωn = pr

n

X 1 n − (kn + nun ) i=k +nu

X(i) > τ

! ∩ Ωn .

n +1

n

Write (Zi )i=1,..,n−kn be a sample of independent random variables with common cumulative distribution function W , and write Z(1) ≥ Z(2) ≥ .. ≥ Z(n−kn ) . ! n−k Xn 1 pr(kˆn − kn > nun ∩ Ωn ) = pr Z(i) > 1 ∩ Ωn n − (kn + nun ) i=nu +1 n

nun n−k Xn X ≤ pr Z(i) − Z(i) > n − (kn + nun ) i=1

i=1

n−k Xn p √ ≤ pr Z(i) > n − kn + n − kn nun i=1 nun X p √ + pr − Z(i) > −nun − n − kn nun i=1

We set P2 = pr

n−k Xn

Z(i) > n − kn +

i=1 nun X

P3 = pr −

p √ n − kn nun

Z(i) > −nun −

i=1

18

p

√ n − kn nun .

Since

Pn−kn i=1

Z(i) =

Pn−kn i=1

Zi ,

√ Pn−kn Z − (n − k ) nun i n i=1 √ P2 = pr > . v v n − kn The central limit theorem implies that Pn−kn i=1 Zi − (n − kn ) L √ → N (0, 1) v n − kn √ since nun −−−→ +∞, we obtain P2 −−−→ 0. n→∞

n→∞

Moreover, nun 1 X P3 = pr Z(i) < 1 + nun i=1

√

! n − kn √ , n

which implies that P3

nun 1 X ≤ pr Z(i) ≤ 2 nun i=1 ≤ pr Z(nun ) ≤ 2 ≤ pr Z(n−kn ) ≤ .. ≤ Z(nun ) ≤ 2

n−k Xn ≤ pr (1Zi ≤2 − p) ≥ n − kn − nun − (n − kn )p . i=1

where p = P (Zi ≤ 2) ∈]0, 1[. We define ζi = 1Zi ≤2 − p ; E(ζi ) = 0 et |ζi | ≤ 1. P3 ≤ pr

n−k Xn

! ζi ≥ (n − kn )(1 − p) − nun

i=1

! n−k Xn nun 1 . ≤ pr ζi ≥ 1 − p − n − kn i=1 n − kn We use Hoeffding’s inequality that we first recall : Lemma 3 Write (Wi )i=1,..,n be independent identically distributed random variables such as for all i, a ≤ Wi ≤ b and E(Wi ) = 0. Then for x > 0, ! ! n 2nx2 1X pr Wi ≥ x ≤ exp − . n i=1 (b − a)2 19

We apply Hoeffding’s inequality with the sample (ζi )i=1,..,n−kn and x = xn = 1 − p − nun / (n − kn ) . Since un −−−→ 0 and kn = λn, there exists n0 ∈ N such that for n ≥ n0 , n→∞

xn ≥

1−p > 0. 2

For n ≥ n0 , (n − kn ) P3 ≤ exp − 2 Then P3 −−−→ 0. n→∞

This concludes the proof of Theorem 1.

20

1−p 2

2 ! .

References Benjamini, Y. & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Royal Statist. Soc. Ser. B, 57, 289–300. Benjamini, Y. & Hochberg, Y. (2000). The adaptive control of the false discovery rate in multiple comparison problems. The Journal of Educational and Behavioral Statistics, 25, 1, 60–83. ´, L. & Massart, P. (2001). Gaussian model selection. J. Eur. Math. Soc., 3, 203–268. Birge Bordes, L., Delmas, C. & Vandekerkhove, P. (2006). Semiparametric Estimation of a two-component mixture model when a component is known. Scandinavian Journal of Statistics, to appear. Brown, P. & Bolstein, D. 1999. Exploring the new world of the genome with DNA microarray. Nat. Genet. Suppl., 21, 33-37. Dudbridge, F. & Koeleman, B. P. (2003). Rank truncated product of P-values, with application to genomewide association scans. Genet. Epidemiol., 25, 360–366. Dudbridge, F. & Koeleman, B. P. (2004). Efficient computation of significance levels for multiple associations in large studies of correlated data, including genomewide association studies. Am. J. Hum. Genet., 75, 424–435. Genovese, C. R. & Wasserman, L. (2002). Operating characteristics and extensions of the FDR procedure. J. Royal Statist. Soc. Ser. B, 64, 499–518. Genovese, C. R. & Wasserman, L. (2004). A stochastic process approach to false discovery control. Annals of Statistics, 32, 1035–1061. Haaland, D. & O’Connell M.A.(1995). Inference for effect Satured Fractionnal. Factorials. Technometrics.137,1 Hoh, J., Wille, A. & Ott, J. (2001). Trimming, weighting, and grouping SNPs in human case-control association studies. Genome Res., 11, 2115–2119. Huet, S. (2006). Model selection for estimating the non zero components of a Gaussian vector. ESAIM: Probability and Statistics, to appear. Kerr M.K., Martin M. & Churchill G.A. (2000). Analysis of variance for gene expression microarray data. Comput. Biol., 7, 819–837. Meinshausen, N. & Rice, J. (2004). Estimating the proportion of false null hypotheses among a large number of independently tested hypotheses. Annals of Statistics, to appear. Resnick, S.I (1987). Extreme values, Regular Variation, and Point Processes, SpringerVerlag Applied Probability Trust. Titterington, D., Smith, M. & Markov, U. (1985). Statistical analysis of finite mixture distributions. Chichester, UK: John Wiley and Sons.

21

Zaykin, D. V., Zhivotovsky, L. A., Westfall, P. H. & Weir, B. S. (2002). Truncated product method for combining P-values. Genet. Epidemiol., 22, 170–185.

22

true null hypotheses non-true null hypotheses total

decision H0 accepted H0 rejected U V T S n − kˆn kˆn

total n − kn kn n

Table 1: Number of errors when testing n hypotheses.

23

method µ 3 BM 5 8

ˆ kˆn F DR 261.9 4.7 511.1 3.8 519.6 3.8

ˆR FN 5.3 0.2 0.0

RDR 49.9 98.3 100.0

BH

3 5 8

110.7 490.4 507.4

0.7 1.5 1.5

8.0 0.4 0.0

22 96.6 100.0

MIXT

3 5 8

408.3 497.3 500

9.9 1.5 0.1

2.9 0.2 0.0

73.5 98.0 99.9

DDLR

3 5 8

306.2 487.0 500.3

7.2 1.5 0.6

4.6 0.5 0.1

56.8 95.9 99.5

Table 2: Comparison between BH, BM, MIXT and DDLR methods for different values of µ, in the case n = 5000 genes and kn = 500 genes simulated differentially expressed.

24

method µ 3 BM 5 8

ˆ kˆn F DR 521.5 4.6 1023.0 3.8 1039.7 3.8

ˆR FN 5.3 0.2 0.0

RDR 49.8 98.4 100.0

BH

3 5 8

220.2 980.2 1015.0

0.6 1.4 1.5

8.0 0.4 0.0

21.9 96.6 100.0

MIXT

3 5 8

820.2 994.9 999.9

10.0 1.5 0.0

2.9 0.2 0.0

73.8 98.0 100.0

DDLR

3 5 8

613.2 973.5 997.6

7.0 1.4 0.3

4.6 0.4 0.1

57.0 96.0 99.5

Table 3: Comparison between BH, BM, MIXT and DDLR methods for different values of µ, in the case n = 10000 genes and kn = 1000 genes simulated differentially expressed.

25

A simple procedure to detect non-central observations from a sample ...

A simple procedure to detect non-central observations from a sample ...

Suggest Documents

A simple questionnaire to detect chronic kidney disease patients from ...

A Simple Filtration Technique To Detect Enterohemorrhagic ...

A SIMPLE TECHNIQUE TO SAMPLE POLLEN FROM ...

a simple procedure for extracting quadratics from a given algebraic

A simple PCR procedure to detect white spot syndrome ... - Springer Link

A Procedure to Detect Problems of Processes in

A sensitive procedure to detect alternatively spliced mRNA in pooled ...

A Simple and Rapid Procedure for Purification of Haptoglobin from ...

A simple test to detect bacteria - The Hindu

a simple fluorogenic method to detect vibrio cholerae and ... - CiteSeerX

Using dynamic pupillometry as a simple screening tool to detect ...

Suggesting a Very Simple Experiment Designed to Detect Tachyons

An automated procedure to detect discontinuities; performance ...

A Simple Procedure for Maximum Margin Classification

A Simple Procedure for Remaining Life Assessment

A permutational-splitting sample procedure to quantify expert - arXiv

A Simple and Efficient Procedure for Diazotization

A Simple and Accurate Procedure for the

SUBMILLIMETRE OBSERVATIONS OF A SAMPLE OF BROAD ...

A Simple Procedure to Assemble Silver and Gold Noble Metal

A simple numerical procedure to calculate the input

A rapid and simple procedure to determine stigma receptivity ...

A simple procedure to perform intravenous injections ... - SAGE Journals

A Simple and Exact DEM Interpolation Procedure to Support ...