Significance testing in non-sparse high-dimensional linear models Yinchu Zhu Lundquist College of Business, University of Oregon, Eugene, OR 97403, USA e-mail:
[email protected]
and Jelena Bradic Department of Mathematics, University of California, San Diego, La Jolla, CA 92093, USA. e-mail:
[email protected] Abstract: In high-dimensional linear models, the sparsity assumption is typically made, stating that most of the parameters are equal to zero. Under the sparsity assumption, estimation and, recently, inference have been well studied. However, in practice, sparsity assumption is not checkable and more importantly is often violated, with a large number of covariates expected to be associated with the response, indicating that possibly all, rather than just a few, parameters are non-zero. A natural example is a genome-wide gene expression profiling, where all genes are believed to affect a common disease marker. We show that existing inferential methods are sensitive to the sparsity assumption, and may, in turn, result in the severe lack of control of Type-I error. In this article, we propose a new inferential method, named CorrT, which is robust to model misspecification and adaptive to the sparsity assumption. CorrT is shown to have Type I error approaching the nominal level for any models and Type II error approaching zero for sparse and many dense models. In fact, CorrT is also shown to be optimal in a variety of frameworks: sparse, non-sparse and hybrid models where sparse and dense signals are mixed. Numerical experiments show a favorable performance of the CorrT test compared to the state-of-the-art methods.
1. Introduction Rapid advances in information and technology have permanently changed the data collection and data analysis processes. New scientific discoveries are increasingly based on using massive amounts of observations, aggregated from many disparate origins. Such datasets are common throughout many disciplines in modern science and are typically of extremely high dimensions. In response to the high dimensionality and large sample size, many modeling assumptions have been imposed. One of the fundamental assumptions that underline most of the existing work is the sparsity assumption of the model parameter. Popular methods, such as LASSO (Tibshirani (1996)) and SCAD (Fan and Li (2001)), have benefited greatly from the sparsity principles and have led to a substantial improvement in prediction accuracy and model interpretation; see e.g., review article by 1
Zhu and Bradic/Non-sparse Hypothesis Testing
2
Fan and Lv (2010) and monographs by B¨ uhlmann and Van de Geer (2011), Dudoit and van der Laan (2007), Efron (2010) and Hastie et al. (2009). Penalized methods in general have been very successfully applied to numerous areas, including genomics, genetics, engineering, neuroscience and economics. The sparsity condition is also the cornerstone of high-dimensional inferential methods, e.g., Fan and Li (2001), Wasserman and Roeder (2009), Chatterjee et al. (2013), B¨ uhlmann (2013); see Section 1.3 for more details on recent development. However, the sparsity assumption could be unrealistic in the contemporary largescale datasets (Hall et al., 2014; Ward, 2009; Jin and Ke, 2014; Pritchard, 2001). Indeed, Dicker (2014), Donoho and Jin (2015), Kraft and Hunter (2009) and Janson et al. (2015) provide evidence that such an ideal assumption might not be valid. Additionally, despite its critical importance, the sparsity condition cannot be verified in practice. As there are no formal procedures for accessing its validity, a natural question arises – can adaptive inferential procedures be developed in extremely high-dimensions that are relying on no sparsity assumption? One of the most difficult high-dimensional models is perhaps the one with extremely dense parameters. In this article, a high-dimensional vector is said to be extremely dense if the number of non-zero elements is proportional its dimension such that the vector cannot be approximated in `2 -norm by a sparse vector. In Section 1.2, we illustrate the challenges of this setting and showcase that naively interpreting results of methods that heavily rely on the sparsity assumption could lead to misleading or even spurious results if the sparsity condition fails. 1.1. Goals of this article Although our method applies more broadly, our focus will be on one of the most widely used models in high dimensions, the linear regression model, Y = Xγ ∗ + Zβ ∗ + ε,
(1)
where Z ∈ Rn and X ∈ Rn×p are the design matrices, ε ∈ Rn is the error term. In what follows we refrain from assuming Gaussian distribution on the design and/or errors and homoscedasticity on the errors ε. Observe that γ ∗ ∈ Rp and β ∗ ∈ R are unknown model parameters and we are interested in p ≥ n case with log p = o(nτ ) for some τ > 0. In this article, we focus on the fundamental problem of significance testing of single entries of the model parameter, i.e., H0 : β ∗ = β0 ,
versus
H1 : β ∗ 6= β0 .
(2)
Here, we do not make any sparsity assumption on the nuisance parameter γ ∗ – it may or may not be sparse at all. Testing individual components of the parameter has important applications in practice and is a prerequisite in statistical inference, see, e.g., Shao (2003) and Lehmann and Romano (2006). For example, β ∗ represents the effect of a therapy/drug Z on the response after controlling for the impact of high-dimensional and
Zhu and Bradic/Non-sparse Hypothesis Testing
3
non-sparse genetic markers X. As this problem is extremely difficult, it is the cornerstone of developing inferential methods for the general hypothesis in high-dimensional and extremely dense models, an area of statistical inference that currently has not been addressed in the existing literature. The primary goal of this paper is to tackle this problem by developing sparsity-robust tests for the hypothesis (2) that is adaptive to the existence of sparsity or lack thereof as well as the existence of heteroscedasticity. We say that a test is sparsity-robust if the the Type I error is asymptotically exact regardless of whether or not γ ∗ is sparse, is optimal when γ ∗ is sparse and has vanishing Type II error for a large class of models, namely those of hybrid nature where the signal has small number of strong elements and a large number of weak elements (whose aggregate size is not small). We construct a new procedure for testing the hypothesis (2) while assuming no sparsity of γ ∗ and allowing for p n. The proposed test is shown to have Type I error asymptotically equal to the nominal level while allowing γ ∗ to be a fully dense vector of size p such that n/p → 0 as n → ∞. To the best of our knowledge, our work provides the first test of hypothesis (2) in such generality. Moreover, we show that, in comparison to existing work, our robustness to lack of sparsity does not come at the expense of efficiency. Specifically, whenever the sparsity condition holds, CorrT is shown to be optimal and matches existing methods developed for sparse linear models regarding Type II errors. Hence, the proposed test is sparsity-adaptive in that it automatically adapts to the unknown sparsity of the vector γ ∗ . Additionally, we show that the CorrT test √ achieves n optimality under a much wider class of models where γ ∗ is allowed to be an extremely dense nuisance parameter with small individual entries. In this way, our work makes a unique contribution to the literature and opens the door to many open problems. 1.2. Motivating example: a challenge of the lack of sparsity To better understand whether discoveries based on current state-of-the-art methods in high-dimensional inference can be spurious in the models that lack sparsity, we consider the problem of testing (2) in the model (1) with β0 = 0. Assume the simple setup of orthogonal designs and known noise level: the entries of Z, X and ε are known to be independent standard normal random variables. Moreover, assume that log p = o(n). Let β ∗ = 0 and γ ∗ = ap−1/2 1p , where a ∈ [−10, 10] is a fixed constant and 1p = (1, . . . , 1)> ∈ Rp . It is easy to see that γ ∗ is not sparse p if a 6= 0. Let π ∗ = (β ∗ , γ ∗ > )> ∈ Rp+1 and W = (Z, X) ∈ Rn×(p+1) . With λ = 16 n−1 log p we define the Lasso estimator 1 b = arg minπ ∈Rp+1 2n π kY − Wπk22 + λkπk1 . The de-biased estimator is then defined b > (Y − Wb b is the estimated precision matrix of the e = π b + ΘW as π π )/n, where Θ 1 e . Therefore, a Wald test for the hypothesis β ∗ = design . Let βe be the first entry of π 1 b = Ip+1 for Since the true precision matrix is Ip+1 the (p + 1) × (p + 1) identity matrix, we set Θ simplicity; our result still holds if we use the nodewise Lasso estimator in Equation (8) of Van de Geer et al. (2014). Moreover, observe that the method of Javanmard and Montanari (2014b) is equivalent to √ Van de Geer et al. (2014) in this particular setting. By Theorem 2.2 therein, n(βe − β ∗ ) = ξ + oP (1),
Zhu and Bradic/Non-sparse Hypothesis Testing
4
0 with nominal size α ∈ (0, 1) can be defined as a decision to reject the hypothesis e > Φ−1 (1 − α/2)/√n, where Φ(·) is the cumulative distribution function of whenever |β| the standard normal distribution. The Type I error of this test is characterized in the following result. Theorem 1. In the above setup, a high-dimensional Wald test is such that √ e > Φ−1 (1 − α/2)/ n = F (α, a), lim P |β| n→∞
h i √ where F (α, a) = 2 − 2Φ Φ−1 1 − α2 / 1 + a2 .
It is immediately obvious that when the model is sparse, i.e., a = 0, we have that F (α, a) = α and thus the Type I error of the test is asymptotically equal to the nominal level. However, when the model is not sparse, i.e., a 6= 0, we have that F (α, a) > α (see Figure 1), meaning that the Type I error asymptotically exceeds the nominal level. In fact, in the non-sparse case, the Type I error of the Wald test can be close to one. In Figure 1, we see that the state-of-the-art Wald test with nominal size 1% can reject, in large samples, a true null hypothesis with probability as high as 80%. In the proof of b = 0 with high-probability; this turns out to be a minimax Theorem 1, we show that π optimal estimator in `2 -balls, see Dicker (2016). However, even though the zero vector resulted from (entry-wise) shrinkage methods might be a good approximation of π ∗ in the `∞ -norm, this approximation is poor for the inference problems. Hence, this illustrates the drawback of the Wald principle in non-sparse models in general, as even optimal estimators might not be good enough for this principle to perform well. Fig 1. Plot of the asymptotic Type I error of the high dimensional Wald test in a linear model with independent design. The horizontal axis denotes a where γ ∗ = ap−1/2 1p and the vertical axis denotes the rejection probability F (α, a) under the null hypothesis. 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2
α=0.01 α=0.02 α=0.05 α=0.1
0.1 0 -10
-8
-6
-4
-2
0
2
4
6
8
10
-10 ≤ a ≤ 10
where ξ conditional on W has a normal distribution with mean zero and variance Z> Z/n.
Zhu and Bradic/Non-sparse Hypothesis Testing
5
Observe that, in practice, rejecting the hypothesis that certain regression coefficient is zero is typically interpreted as evidence supporting new scientific discovery, e.g., treatment effect. However, due to the non-robustness to the lack of sparsity as demonstrated above, researchers could, with high probability, obtain “discoveries” that do not exist. Therefore, new inferential tools are called for that do not reject true hypotheses too often regardless of whether the sparsity condition holds. 1.3. Relationship with the existing literature Our work contributes to the existing literature on inference in dense high-dimensional models, which is scarce and almost exclusively focuses on linear models. Among the existing work, we are unaware of any contributions that allow n/p → 0 while not imposing any sparsity assumptions. Even, a more moderate regime with n/p → δ ∈ (0, 1) has been scarcely investigated. Janson et al. (2015) study the inference problem of signal-to-noise ratio whereas inference on the noise level is studied by Dicker (2014). In such settings, B¨ uhlmann (2013) proposes to use a ridge-based method for inference problem (2) however a tight Type I error control is achieved at the expense of sub-optimal power. In a recent paper, Chernozhukov, Hansen and Liao (2015) study a hybrid model in which the parameter vector has a few large entries and many small entries. The authors propose a “lava” procedure for estimation by imposing both Lasso and ridge penalty. However, the inference has not been addressed yet. What distinguishes our treatment of this problem is that we directly allow “non-polynomial dimensionality”, i.e., log p = o(nτ ) with τ > 0. The significance testing problem (2) in sparse, high-dimensional models is much better understood with roughly three broad techniques. The “de-biasing” technique (Zhang and Zhang, 2014; Javanmard and Montanari, 2014a; Van de Geer et al., 2014) and the double-selection method (Belloni et al., 2014) aim to construct an asymptotically normal estimator for β ∗ and use it for inference. Score-based approaches using certain orthogonalization (Ning and Liu, 2014; Chernozhukov, Hansen and Spindler, 2015) are √ also proposed. These methods require kγ ∗ k0 = o( n/ log p). Under Gaussian designs and errors, the sparsity assumption can be somewhat weakened (Javanmard and Montanari, 2015; Cai and Guo, 2015), but the condition of kγ ∗ k0 = O(n/ log p) is still imposed. A recent work of Zhu and Bradic (2016a,b) consider one sample and two-sample highdimensional linear models with potentially non-sparse parameters. It is worth pointing that the first article discusses linear testing with exclusively Gaussian design and innovations and is designed to be efficient for the dense loading vectors. In contrast we design a test here that is efficient and optimal for dense models with non-gaussian distributions and possible heteroscedasticity – a concept rarely allowed in high-dimensions. Additionally, the former article focuses on a different inference problem of two-sample testing. As both linear testing and two-sample testing are more challenging problems these two articles do not address the power issue in the case of fully dense parameters. However, in this article we show that the proposed doubly-robust test achieves the optimal detection rate even for extremely dense, heteroscedastic and high-dimensional models, therefore broadening the scope of existing work. As a consequence, we are able to provide guar-
Zhu and Bradic/Non-sparse Hypothesis Testing
6
antees for models of a hybrid nature where the model is a composition of a dense and sparse signal under weak assumptions on the model and on the distribution of the errors. The main contribution of our work is to provide a test for problem (2) with tight control over both Type I and Type II error while making no assumptions on sparsity or structure of the nuisance parameters. The methodology proposed in this paper fundamentally differs from the above existing work. Our methodology is based on the idea of exploiting the implication of the null hypothesis. Instead of directly estimating the parameter under testing, we test a moment condition that is equivalent to the null hypothesis. This is the key feature of our methodology. Notice that β ∗ is quite difficult, if not impossible, to estimate in the presence of non-sparse γ ∗ with p n, causing problems for methods that rely on accurate estimation of β ∗ . In contrast, our proposed approach of testing moment conditions benefits from a special design of the constructed moment condition such that inaccurate estimation of β ∗ does not present any problem to the asymptotic validity of the test. 1.4. Notation and organization Throughout the article, we use bold upper-case letters for matrices and lower-case letters for vectors. Moreover, > denotes the matrix transpose and Ip denotes the p × p identity matrix. For a vector v ∈ Rk , vj denotes its jth entry and its `q -norm is defined as follows: P P kvkq = ( ki=1 |vi |q )1/q for q ∈ (0, ∞), kvk∞ = max1≤i≤k |vi | and kvk0 = ki=1 1I{vi = 0}, where 1I denotes the indicator function. For any matrix A, σmin (A) and σmax (A) denote the minimal and maximal singular values of A. For two sequences an , bn > 0, we use an bn to denote that there exist positive constants C1 , C2 > 0 such that ∀n, an ≤ C1 bn and bn ≤ C2 an . We also introduce two definitions that will be used frequently. The sub-Gaussian norm of a random variable X is defined as kXkψ2 = supq≥1 q −1/2 (E|X|q )1/q , whereas the sub-Gaussian norm of a random vector Y ∈ Rk is kYkψ2 = supkvk2 =1 kv> Ykψ2 . A random variable or a vector is said to be sub-Gaussian if its sub-Gaussian norm is finite. Moreover, a random variable X is said to have an exponential-type tail with parameter (b, γ) if ∀x > 0, P(|X| > x) ≤ exp [1 − (x/b)γ ]. This is a generalization of the sub-Gaussian property Vershynin (2010). The rest of the paper is organized as follows. Section 2 discusses the main idea behind the proposed CorrT procedure. It introduces a novel moment construction technique and a construction of a self-normalizing test statistic related to the constructed moment. Subsection 2.2 designs tuning-adaptive estimators that are particularly useful for estimation in potentially non-sparse models. Main theoretical results are summarized in Section 3.1. Section 3.2.1 discusses sparsity-adaptive property of the proposed test and its relation to the oracle, i.e. optimal test in the case where the model is truly sparse. Section 4 shows numerical examples and contrasts the outcomes with two state-of-theart methods. The detailed proofs of the main results and that of a number of auxiliary results are collected in the Supplementary Materials.
Zhu and Bradic/Non-sparse Hypothesis Testing
7
2. CorrT: a robust methodology In this section, we introduce a test statistic named CorrT, which will prove useful for the construction of the inference methods independent of the underlying model sparsity. We begin by introducing a two-step methodology that first “transforms” the null hypothesis into a moment condition and then designs suitable estimator of the moment in question. 2.1. Moment condition Since high-quality estimation in β ∗ is not guaranteed in the presence of high-dimensional dense γ ∗ , we take an “indirect” approach and proceed by formulating the testing problem (2) differently, in order to eliminate the sparsity requirement. We begin by observing that a new variable, a pseudo-response, V := Y − Zβ0
satisfies the linear model V = Xγ ∗ + e, with the error being e = Z(β ∗ − β0 ) + ε. In this new model, and whenever the null hypothesis holds, we observe that the design X is uncorrelated with the error e, and e = ε. However, under the alternative hypothesis H1 , we observe that the error e might be correlated with the design X through Z, indicating a possible problem due to the confounding effects. To solve this problem, we formally introduce a model which accounts for the dependence between X and Z. Namely, we propose to consider a linear model of the form Z = Xθ ∗ + u,
i = 1, . . . , n,
(3)
where θ ∗ ∈ Rp is a sparse and unknown parameter and u ∈ Rn is the model error. We allow ui to be dependent on εi (see Condition 1 below). Additionally, we allow model misspecification above by considering heteroscedastic innovations ui . Such an assumption is in line with the setup in Wang et al. (2015) and Belloni et al. (2014), and can be an alternative way of stating the assumption of sparse precision matrices, see Van de Geer et al. (2014) where a node-wise Lasso method is essentially estimating model (3). Notice that this includes orthogonal features as a special case. Notice that, under (1) and (3), one can rewrite the null hypothesis H0 in terms of, what is in effect, a univariate moment, E (V − Xγ ∗ )> (Z − Xθ ∗ ) . Since V − Xγ ∗ = e = Z(β ∗ − β0 ) + ε, we observe the following relationship h i E (V − Xγ ∗ )> (Z − Xθ ∗ ) /n = (β ∗ − β0 )E(n−1 u> u). Hence, solving the inference problem (2) becomes equivalent to testing h i h i H0 : E (V − Xγ ∗ )> (Z − Xθ ∗ ) = 0, versus H1 : E (V − Xγ ∗ )> (Z − Xθ ∗ ) 6= 0. (4) This radically changes the strategy for solving the initial hypothesis testing problem. The inner product structure of the moment condition above allows for certain inaccurate
Zhu and Bradic/Non-sparse Hypothesis Testing
8
estimation in V − Xγ ∗ and alleviates the need for a “good” estimator of a dense vector γ ∗ ; see Section 2.3 for more discussions. When estimators for γ ∗ and θ ∗ are carefully designed (as formulated below), our construction is robust to the inaccuracy in the estimator for γ ∗ . 2.2. Self-adaptive and sparsity-adaptive estimation We now provide estimates for the unknown parameters γ ∗ and θ ∗ , which will be used to construct the CorrT test for the moment condition (4). On the estimation side, there are many procedures available for fitting sparse regression models, e.g., the Lasso estimator, Dantzig selector and self-tuning methods by Belloni et al. (2011), Sun and Zhang (2012) and Gautier and Tsybakov (2013). However, as the model may no longer be sparse, a candidate estimator for γ ∗ must be more stable than the vanilla Lasso or Dantzig type estimators, i.e., the candidate estimator must behave “reasonably well” even in dense settings. We propose a novel estimation scheme with the following property. When the estimation target fails to be sparse, the estimator is stable; when the estimation target is sparse, the estimator automatically achieves consistency and does not require knowledge of the noise level. We achieve that by combining two estimators, one that is optimal under sparse settings and another that satisfies basic requirement of “stability” in non-sparse settings. Our method is based on the solution path of the following linear program. For a > 0, let e (a) := γ arg minkγk1 γ ∈Rp > −1 s.t. kn X (V − Xγ)k∞ ≤ η0 a (5) kV − Xγk∞ ≤ kVk2 / log2 n n−1 V> (V − Xγ) ≥ ρn n−1 kVk22 . √ where the constants η0 = 1.1n−1/2 Φ−1 (1 − p−1 n−1 ) and ρn = 0.01/ log n. The first inequality in (5), much like the Dantzig selector, provides a necessary gradient condition guaranteeing that the gradient of the loss is close to zero. Observe that constant a is different from (Eε2i )1/2 as the model is heteroscedastic in nature. Optimal choice of the constant a is driven by (6). The constant η0 is scale-free and is derived from moderate deviation results of self-normalized sums. The second and third inequality introduce stability in estimation, whenever the model parameter is grossly non-sparse. Under the null hypothesis those constraints limit the growth of the estimated residuals and prevent the choice of σ that is too large. Additionally, it is worth pointing that the estimator above changes with the null hypothesis as V = Y − Zβ0 – and because of this aspect the proposed method is able to achieve the adaptivity to the problem of interest. If one is interested in a different model parameter, it suffices to reparametrize the initial model and repeat the estimation again. Estimation of the variance is extremely important for any testing problem. Since we consider a heteroscedastic model, the variance of the error terms might not be informative. For this reason, we invoke moderate deviation results for self-normalized sums
Zhu and Bradic/Non-sparse Hypothesis Testing
9
P and would like a to target the quantity a0,γ , where a20,γ = max1≤j≤p n−1 ni=1 x2i,j ε2i . This quantity mimics the celebrated White (1980)’s heteroscedasticity-robust standard error. Hence, we treat the unknown quantity a0,γ as a parameter which we estimate by e (a) as the larges element that belongs to the set Sγ choosing it along the path of a 7→ γ defined as v u n X u 3 1 2 ≥ a . e a ≥ t max n−1 (6) x2i,j (vi − x> γ (a)) Sγ = a ≥ 0 : i 1≤j≤p 2 2 i=1
In other words, b aγ = arg max{a : a ∈ Sγ }. The requirements in the set Sγ can be viewed as a data-dependent way of detecting sparsity and the tuning parameter a. Our final estimator is then defined as the following combination √ b=γ e (b e (2kXk∞ kVk2 / n)1{Sγ = ∅}. γ aγ )1{Sγ 6= ∅} + γ (7)
Similar to many high-dimensional estimators including Lasso, the solution of the above optimization problems might not be unique and (γ, aγ ) is defined as any minimizer of the optimization problem (5); see Section 2.4 for computational issues. The theory established in Section 3 holds for any minimizer. This estimator is accurate under sparsity and stable for dense models. When γ ∗ is sparse, we show that Sγ 6= ∅ with probability approaching one and that the constraint on kn−1 X> (V − Xγ)k∞ and the definition of Sγ induce q the self-adapting property by
adjusting b aγ to a similar magnitude as the true level E[x2i1 ε2i ]. Thus, for sparse γ ∗ , e (b γ aγ ) is consistent and behaves like other sparsity-based estimations, such as Lasso or Dantzig selector with ideal choice of tuning parameters. When γ ∗ is not sparse, the e (·) guarantee that kX> (V − Xb constraints in the definition of γ γ )k∞ /kV − Xb γ k2 is not growing too fast; see Lemma 3 in the Supplementary Materials. Hence, the resulting b is not only stable but also is able to automatically “detect” sparseness and estimator γ exploit it to achieve estimation accuracy. e The estimation of θ ∗ is done in a similar manner. Define a 7→ θ(a) by e θ(a)
arg minkθk1 θ ∈Rp > −1 s.t. kn X (Z − Xθ)k∞ ≤ η0 a kZ − Xθk∞ ≤ kZk2 / log2 n −1 n Z> (Z − Xθ) ≥ ρn n−1 kZk22 . =
(8)
where η0 and ρn are as defined before. Let b aθ = arg max{a : a ∈ Sθ }, where v u n X u 3 1 2 ≥ a . e Sθ = a ≥ 0 : a ≥ t max n−1 x2i,j (zi − x> θ(a)) i 1≤j≤p 2 2 i=1
The final estimator is then defined as
√ b = θ(b e aθ )1{Sθ 6= ∅} + θ(2kXk e θ ∞ kZk2 / n)1{Sθ = ∅}.
(9)
Zhu and Bradic/Non-sparse Hypothesis Testing
10
Since we assume sparsity in θ ∗ (the dependence structure of the features), by previous b is a consistent estimator and does not require the knowledge of the noise discussions, θ 2 1/2 level (Eu1 ) . In fact, we allow ui to be heteroscedastic too so knowing (Eu21 )1/2 does not simplify the problem. Compared to the existing self-tuning procedures, the proposed estimator is more suitable for our purposes. Square-root Lasso by Belloni et al. (2011) and scaled Lasso by Sun and Zhang (2012) have demonstrated advantages of tuning-free estimation in sparse regressions; however, these procedures might not easily yield estimators that both (1) satisfy the constraints in (8), which are needed to handle the non-sparse case, and (2) easily adapt to heteroscedasticity of the model. Self-tuned Dantzig estimator by Gautier and Tsybakov (2013) can accommodate these additional constraints but lacks the combination structure in (7) that guarantees both the accuracy under sparse models and robustness under dense models. 2.3. CorrT Test We propose the plug-in statistic for testing the moment condition expressed in (4). We consider two-sided alternatives unless otherwise stated. The test statistic we propose is self-normalizing and takes the form of a t-statistic. b be defined in (7) and (9), respectively. We propose to consider the b and θ Let γ following correlation test (CorrT) statistic b b ε> u Tn (β0 ) = qP , n 2u 2 ε b b i=1 i i
(10)
b The notation Tn (β0 ) indicates that the test statistic is b = Z − Xθ. where b ε = V − Xb γ, u constructed directly with the knowledge of the null hypothesis H0 : β = β0 . Whenever possible we suppress its dependence on β0 and use Tn instead. Self-normalizing nature of the test statistics also has an advantage that it allows us to derive an analytical critical value for the test as we show that Tn (β0 ), under the null hypothesis, converges in distribution to N (0, 1). Hence, a test with nominal size α ∈ (0, 1) rejects the hypothesis (2) if and only if |Tn (β0 )| > Φ−1 (1 − α/2). We briefly discuss the mechanism of our test and outline the reason for its robustness to the lack of sparsity in γ ∗ . The product structure in (10) helps to decouple the bias b after which we are able to establish uncorrelatedness between the estiintroduced by γ mated residuals b ε and u under the null hypothesis H0 . We can show, without assuming ∗ sparsity of γ , that p b = n−1/2 b n−1/2 b ε> u ε> u + OP ( log pkθb − θ ∗ k1 ), where under the null hypothesis, the first term on the right hand side has zero expecb as in (7) is tation and the second term vanishes fast enough. The construction of γ fundamentally critical to de-correlate V − Xb γ and u. We note that commonly used estimators in sparse regression of Y against X and Z do not deliver such de-correlation and, for non-sparse γ ∗ , do not possess tractable properties.
Zhu and Bradic/Non-sparse Hypothesis Testing
11
2.4. Computational Aspects: parametric simplex method The proposed method allows for efficient implementation. We point out that both optimization problems (5) and (8) can be written in the form of a parametric linear programing problem, where the “additional unknown parameter”, σ, is present only as an upper bound of the linear constraints. It is well-known that, if a problem can be cast as a parametric linear programming problem, then the parametric simplex method can solve the problem for all values of the parameter in the almost same time as solving the problem for one value of the parameter; see Vanderbei (2014) and Pang et al. e (σ) We now (2014). Hence, computational burden of obtaining the solution paths σ 7→ γ explicitly formalize the optimization problem (5) as parametric linear programs; the forb + (a) − b b − (a), where e (a) = b mulation of (8) is analogous. Let a > 0, and denote with γ b b + (a)> , b b − (a)> )> is defined as the solution to the following parametric right b(a) = (b hand side linear program b b(a) = arg max M> 1b
subject to
b∈R2p
M2 b ≤ M3 + M4 a,
(11)
where the matrices M1 ∈ R2p×1 , M2 ∈ R(2p+2n+1)×2p , M3 ∈ R2p+2n+1 , M4 ∈ R2p+2n+1 are taken to be M1 = −12p×1 , −1 > n−1 X> V n X X −n−1 X> X η0 1p×1 −n−1 X> X n−1 X> X η0 1p×1 −n−1 X> V 2 M2 = X −X , M3 = V + 1n×1 kVk2 / log n , M4 = 0n×1 2 −V + 1n×1 kVk2 / log n 0n×1 −X X > > > −1 −1 −1 0 n V X −n V X (1 − ρn )n V V It is known that the solution of a parametric simplex algorithm is a piecewise linear and convex function of a. For example, we can split the interval (0, ∞) into a 2k nonoverlaping intervals (0, 2), (2, , 4), . . . , (2k+1 −4, 2k+1 −2), (2k+1 −2, ∞) and perform 2k −1 dual simplex pivot steps before termination; see Magnanti and Orlin (1988) for details. 3. Inferential Asymptotic Properties In this section we present theoretical properties of the proposed method and consider an asymptotic regime in which p and the sparsity level of γ ∗ are allowed to grow to infinity faster than n → ∞. Throughout this article, we assume that the observed variables {(xi , zi , yi )}ni=1 are independent random variables (vectors) with mean zero, where xi , zi and yi denote the i-th row of X, the i-th entry of Z and of Y, respectively. We say that a vector is a sub-Gaussian vector if all of it’s coordinates are sub-Gaussian random variables (see Javanmard and Montanari (2014a) for example). In order to establish the theoretical claims, we require the following regularity condition where constants κ1 , ..., κ5 ∈ (0, ∞) do not depend n and p. > Condition 1. Suppose that {(εi , zi , xi )}ni=1 is an i.i.d sequence. (i) Let wi = (zi , x> i ) ∈ p+1 > p×p R and ΣW = E[w1 w1 ] ∈ R satisfy 0 < κ1 ≤ σmin (ΣW ) ≤ σmax (ΣW ) ≤ κ2 < ∞
Zhu and Bradic/Non-sparse Hypothesis Testing
12
and wi has a sub-Gaussian norm bounded by a constant. (ii) In addition, assume that εi are independent conditional on (X, Z) and 0 < E(ε2i ) < κ3 and E(xi εi ) = 0. (iii) Suppose that {ui |X, ε}ni=1 is a sub-Gaussian vector with independent coordinates and 2 = E(u2 | X, ε) < κ ) = 1 such that E(ui | X, ε) = 0 and limn→∞ P (κ4 < σu,i 5 i A few comments are immediate. First, Condition 1(i) is less restrictive from a Gaussian design assumptions directly assumed in Zhu and Bradic (2016a) or Janson et al. (2015) – among references that allow dense model parameters. Well-behaved covariance matrices (see e.g. Van de Geer et al. (2014) and Javanmard and Montanari (2014a)) are quite standard among inferential results. Condition 1(ii) is strictly less confining than those imposing homoscedastic and unconditionally independent errors. Namely, we allow possible dependencies between εi and xi therefore enabling heterogeneity. We are unaware of existing work that allows optimal inference in heteroscedastic and highdimensional models. Lastly, in Condition 1(iii) we remove not only Gaussianity but feature homoscedasticity as well (a condition regularly imposed in the literature) and consider wide spectrum of feature correlations and allow for extremely large p with n/p → 0. In particular, we impose the following model assumptions. Remark 1. Observe that Condition 1 imposes very weak model structure. In fact, general and nonparametric deviations of the linear model are allowed; both in the main model and in the feature model. In particular partial linear model of the form yi = g(xi ) + Zi β ∗ + εi ,
(12)
∗ are allowed. Observe that the function g can be approximated as g(xi ) = x> i γ + ri where ri is bounded; here ri is dependent on xi . Then, the results obtained in this article hold for the model (12) as well. In this case feature model (3) can be replaced with
zi = h(xi ) + ui ∗ as long as h(xi ) = x> i θ + wi with the bounded approximation error, wi .
The only model assumptions we are willing to make are presented in Condition 2 below. p Condition 2. kγ ∗ k2 ≤ κ6 and sθ = o n/(log(p ∨ n))5 , where sθ = kθ ∗ k0 .
The assumption on sθ imposes sparsity in the first row of the precision matrix Σ−1 W . The rate for sθ is slightly stronger than the conditions in Belloni et al. (2014) and Ning √ and Liu (2014) who impose o( n/ log p) and in Van de Geer et al. (2014) who impose o(n/ log p), however, we do not impose that every row of this matrix is sparse, just the first. This can be interpreted as the cost of allowing for non-sparse γ ∗ as well as general heteroscedatic model errors. Moreover, we observe that the method of Javanmard and Montanari (2014a) does not require sparsity condition on θ ∗ . However, it still relies on the sparsity of γ ∗ . In fact, our test can detect automatically which one of the two is sparse and utilize it for effective inference; therefore, the practitioner does not have to choose which of the two are sparse when applying our method.
Zhu and Bradic/Non-sparse Hypothesis Testing
13
Additionally, in studying the estimation problem for signals with many nonzero entries, it is common to consider parameters in bounded convex balls, e.g. Donoho and Johnstone (1994), Raskutti et al. (2011) and Janson et al. (2015). Thus, we require √ kγ ∗ k2 = O(1). However, we can replace this condition with kγ ∗ k2 = o( n) and obtain the same results without significantly changing the proof. It is worth pointing out that the bounded `2 -norm of γ ∗ do not impose constraints on its sparsity, see e.g., Janson et al. (2015). 3.1. Size properties: validity regardless of sparsity For the CorrT statistic Tn (β0 ), we provide the following robustness guarantees. Theorem 2. Let Conditions 1 and 2 hold. Then under the null hypothesis (2), ∀α ∈ (0, 1), lim P |Tn (β0 )| > Φ−1 (1 − α/2) = α, n,p→∞
(13)
where Tn (β0 ) is defined in Equation (10).
Theorem 2 formally establishes that the new CorrT test is asymptotically exact in testing β ∗ = β0 while allowing log p = o(n). In particular, CorrT is robust to dense γ ∗ , in the sense that even under dense γ ∗ , our procedure does not generate false positive results. Compared to existing methods, we have moved away from the unverifiable model sparsity assumption and hence can handle more realistic problems. Remark 2. Our result is theoretically intriguing as it circumvents limitations of the “inference based on estimation” principle. This principle relies on accurate estimation, which is challenging, if not impossible, for non-sparse and high-dimensional models. To see the difficulty, consider the full model parameter π ∗ := (β ∗ , γ ∗ > )> ∈ Rp+1 . The minimax rate of estimation in terms of `2 -loss for parameters in a (p + 1)-dimensional `q -ball with q ∈ [0, 1] and of radius rn is rn (n−1 log p)1−q/2 (see Raskutti et al. (2011)). Theorem 2 says that CorrT is valid even when π ∗ cannot be consistently estimated. For √ example, suppose that log p nc for a constant c > 0 and π ∗j = 1/ p for j = 1, . . . , p+1. Then, as n → ∞, kπ ∗ kq (n−1 log p)1−q/2 → ∞ for any q ∈ [0, 1], suggesting potential failure in estimation. Instead of solely relying on estimation of π ∗ , we impose the null hypothesis in our inference procedure and fully exploit its implication. With sophistication in constructing the test, the inaccuracy in π ∗ does not impair the validity (Type I error control) of our approach. Remark 3. With an almost analogous argument as in the proof of Theorem 2, we can show that the conclusion (13) of Theorem 2 still holds if we replace sθ with sγ in the statement of Condition 2. This means that as long as one of γ ∗ and θ ∗ is sparse, CorrT delivers valid inference. In this sense, CorrT automatically detects and adapts to sparse structures from different sources in the data, either in the model parameter or the covariance matrix of features; prior knowledge of the exact source is not needed.
Zhu and Bradic/Non-sparse Hypothesis Testing
14
We note that confidence sets for β ∗ can be constructed by inverting the CorrT test. Let 1 − α be the nominal coverage level. We define C1−α := β : |Tn (β)| ≤ Φ−1 (1 − α/2) . (14) Theorem 2 guarantees the validity of this confidence set.
Corollary 3. Let Conditions 1 and 2 hold. Then the confidence set C1−α (14) has the exact coverage asymptotically: lim P (β ∗ ∈ C1−α ) = 1 − α.
n,p→∞
Corollary 3 implies that the confidence set (14) provides exact asymptotic coverage even when the nuisance parameter γ ∗ is non-sparse and sγ /p → 1 with n/p → 0 as n → ∞. To the best of our knowledge, this result is unique in the existing literature. As we step further into the non-sparse regime (as we take more and more entries in γ ∗ to be none-zero), existing results could provide confidence sets with coverage probability far below the nominal level. In the example discussed in Section 1.2 with sγ = p, Figure 1 indicates that the 99% confidence interval based on debiasing method can have coverage probability around 20%. We would also like to point out that Theorem 2 holds uniformly over a large parameter space and over a range of null hypotheses. To formally state the result, we define the nuisance parameter ξ ∗ = (γ ∗ , θ ∗ , F ), where F is the distribution of (xi , εi , ui ). Notice that the distribution of the observed data {(yi , xi , zi )}ni=1 is determined by (β ∗ , ξ ∗ ). The probability distribution under (β ∗ , ξ ∗ ) is denoted by P(β ∗ ,ξ∗ ) . The space for the nuisance parameter we consider is Ξ(sθ ) = {ξ = (γ, θ, F ) : Condition 1 holds with κ1 , . . . , κ5 > 0, kγk2 ≤ κ6 and kθk0 ≤ sθ } , where κ1 , . . . , κ6 > 0 are constants. We have the following result. p Corollary 4. If sθ = o n/(log(p ∨ n))5 , then ∀α ∈ (0, 1), lim sup
sup
n,p→∞ (β,ξ)∈R×Ξ(sθ )
P(β,ξ) |Tn (β)| > Φ−1 (1 − α/2) − α = 0.
Corollary 4 says that Theorem 2 holds uniformly in a large class of distributions. In particular, Corollary 4 allows the null hypothesis to change with n. 3.2. Power properties In this subsection, we investigate the power properties of CorrT. Let us observe that it is still unclear what statistical efficiency means for inference in non-sparse high-dimensional models. Various minimax results on both estimation (Raskutti et al., 2011) and inference (Ingster et al., 2010) suggest that no procedures are guaranteed to accurately identify non-sparse parameters in a uniform sense. In light of these results, it appears that, for
Zhu and Bradic/Non-sparse Hypothesis Testing
15
dense high-dimensional models, it is quite difficult, if possible at all, to obtain tests that are powerful uniformly over a large class of distributions, e.g., a bounded `2 ball for γ ∗ . However, our test does have desirable power properties for certain classes of models of practical interest. First, we show that our test is efficient for sparse models. Second, we show that our test has the optimal detection rate O(n−1/2 ) for important dense models. 3.2.1. Adaptivity: efficiency under sparsity In Section 1.2, we have demonstrated that the lack of robustness could cause serious problems for existing inference methods designed for sparse problems. A certain oracle procedure with the knowledge of sγ = kγ ∗ k0 would proceed as follows: (1) use existing sparsity-based methods for sparse models in order to achieve efficiency in inference and (2) resort to certain conservative approaches to retain validity for dense problems. However, the sparsity level sγ is rarely known in practice. Hence, an important question is whether or not it is possible to design a procedure that can automatically adapt to the unknown sparsity level sγ and match the above oracle procedure in the following sense. We say that a procedure for testing the hypothesis (2) is sparsity-adaptive if (i) this procedure does not require knowledge of sγ , (ii) provides valid inference under any sγ and (iii) achieves efficiency with sparse γ ∗ . In this subsection, we show that the CorrT test possesses such sparsity-adaptive property. It is clear from Section 2 that CorrT does not require knowledge of sγ . In Section 3.1, we show that CorrT provides valid inference without any assumption on sγ . We now show the third property, efficiency under sparse γ ∗ . To formally discuss our results, we consider testing H0 : β ∗ = β0 versus √ H1,h : β ∗ = β0 + h/ n. (15) where h ∈ R is a fixed constant. H1,h in (15) is called a Pitman alternative and is typically used to assess the asymptotic efficiency of tests; see Van der Vaart (2000) and Lehmann and Romano (2006) for classical treatments on local power. √ Theorem 5. Let Conditions 1 and 2 hold together with log p = o( n). Let ΣX = √ (p−1)×(p−1) . Suppose that kγ ∗ k = o( n/ log2 (p∨n)), E(ε2 ) ≥ τ , min 2 2 E[xi x> 0 1 1≤j≤p E[x1,j ε1 ] ≥ 1 i ] ∈ Rp τ2 and Eu21 / Eε21 u21 → κ for some constants τ1 , τ2 > 0 and κ ∈ R. Then, under H1,h in (15), lim Pβ ∗ |Tn (β0 )| > Φ−1 (1 − α/2) = Ψ(α, κ, h), n,p→∞
where Ψ(h, κ, α) = Φ −Φ−1 (1 − α/2) + hκ + Φ −Φ−1 (1 − α/2) − hκ .
Theorem 5 establishes the power properties of the proposed CorrT test.
Remark 4. In the case of homoscedastic errors with σε2 = Eε21 and σu2 = Eu21 , we can compare the local power of CorrT with that of existing methods. In particular we consider the local power of Van de Geer et al. (2014) and note that similar analysis applies to Belloni et al. (2014) and Ning and Liu (2014). Let bbLasso denote the debiased Lasso
Zhu and Bradic/Non-sparse Hypothesis Testing
16
estimator pdefined in Van de Geer et al. (2014). Under homoscedasticity, our condition 2 of Eu1 / Eε21 u21 → κ translates into σu /σε → κ. Applying Theorem 2.3 of Van de Geer √ et al. (2014) to our setup, we obtain n(bbLasso − β ∗ ) →d N (0, κ−2 ). Hence, a natural test, referred to as Van de Geer et al. (2014) test, is to reject H0 in (2) if and only √ if |bbLasso − β0 | > κ−1 Φ−1 (1 − α/2)/ n. Under H1,h (15), the asymptotic normality of bbLasso implies that √n(bbLasso − β0 ) →d N (h, κ−2 ). Hence, it is not hard to see that the power of Van de Geer et al. (2014) test against H1,h (15) is asymptotically equal to Ψ(k, κ, α) defined in Theorem 5. Since bbLasso is shown to be a semi-parametrically efficient estimator for β ∗ , CorrT is asymptotically equivalent to tests based on efficient estimators. Moreover, results from Javanmard and Montanari (2014b) suggest that our test is also minimax optimal whenever the model is homoscedastic, indicating that the procedure apart from sparsity adapts to the presence or absence of heterogeneous errors in (1). By Theorem 2.3 therein (adapted to the Gaussian setting), a minimax optimal α level test for testing β ∗ = β0 against H1,h in (15) has power at most Φ an hσu σε−1 − Φ−1 (1 − α/2) + Φ −an hσu σε−1 − Φ−1 (1 − α/2) + Fn−sγ +1 (n − s0 + ln ). (16) p √ In the display above ln = n log n, an = (n − sγ + ln )/n and Fk (x) = P(Wk ≥ x) with Wk ∼ χ2 (k). By the Bernstein’s inequality applied to the sum of i.i.d χ2 (1) random variables, one can easily show that Fn−sγ +1 (n − s0 + ln ) = o(1). Hence, as n → ∞, the limit of the bound in (16) is equal to Ψ(k, κ0 , α) defined in Theorem 5. However, Theorem 5 also holds under a heteroscedastic model (1). Observe that the only condition regarding heteroscedasticity is with respect to the second moment between the design and the errors – which in homoscedastic models corresponds to non-negative variances of the design and error. Moreover, the power above holds in models where ε and u are possibly dependent. In that sense, model (3) is treated as an approximated and not exact feature model. Theorems 2 and 5 establish the sparsity-adaptive property discussed at the beginning of Section 3.2.1: CorrT is shown to automatically detect sparse structures in γ ∗ and utilize them to optimize power, while maintaining validity even in the absence of sparsity. 3.2.2. Extremely dense models One important class of non-sparse models is what we shall refer to as the “extremely dense models”. In these models, a potentially important case is that the entries of the model are all small individually but are strong collectively as the dimension of the model explodes. In our testing problem (2), this translates into an extremely dense nuisance parameter γ ∗ with kγ ∗ k0 = p n. Consistent estimation of such extremely dense nuisance parametrs might not be possible for p n. However, we are able to discuss the power of the proposed test in the presence of a particular dense nuisance parameter and obtain a strong result indicating that the proposed test controls Type II errors extremely well. To the best of our knowledge, power results in dense models allowing
Zhu and Bradic/Non-sparse Hypothesis Testing
17
kγ ∗ k0 nc , ∀c > 0 have not been discussed in the existing literature, indicating that the next result is unique. For that purpose, let τ1 , τ2 be positive constants independent of n and p. √ Theorem 6. Let Conditions 1 and 2 hold together with log p = o( n). Let ΣX = (p−1)×(p−1) . Suppose that E(ε2 ) ≥ τ , min > ∗ E[xi x> 1 q 1≤j≤p Var [x1,j (x1 γ + ε1 )] ≥ 1 i ] ∈ R p ∗ 2 2 τ2 , kΣX γ ∗ k∞ ≤ τ2 (2n)−1 log(p ∨ n)/25 and Eu21 / E(x> 1 γ + ε1 ) u1 → κ for some constants τ1 , τ2 > 0 and κ ∈ R. Then, under H1,h in (15), lim Pβ ∗ |Tn (β0 )| > Φ−1 (1 − α) = Ψ(h, κ, α), n,p→∞
where Ψ(h, κ, α) is defined in Theorem 5.
We √ can draw a few conclusions from Theorem 6. The most obvious is that for n, p → ∞, log p/n = o(1) (i.e. n/p → 0), the Type II error of the proposed CorrT test, against alternatives that are larger than O(n−1/2 ), converges to zero. Note that we make no assumptions at all on the sparsity of the nuisance parameter γ ∗ . The only assumption on γ ∗ utilized above is on the size of the individual elements accompanied by the design covariance matrix ΣX . For the case of independent columns of Toeplitz p of the design matrix and the casep designs, the condition kΣX γ ∗ k∞ = O( (log p)/n) is satisfied if kγ ∗ k∞ = O( (log p)/n) as long as γ ∗ lies in a bounded `2 -ball; with slight changes in the technical proofs, we √ canpallow γ ∗ to live in an `2 -ball with radius o( n/ log p). More generally, kγ ∗ k∞ = O( (log p)/n) is a sufficient condition under any covariance matrix ΣX satisfying max1≤j≤p kΣX,j k1 = O(1) (a condition imposed for consistent matrix estimation). Theorem 6 offers new insight into the inference of dense models. When we test the full parameter (β ∗ , γ ∗ ) against an alternative, the minimax test might not have power against “fixed alternative” (deviation bounded away from zero) if the parameter is not sparse; see Ingster et al. (2010). Theorem 6 says that the problem of testing single entries of a potentially dense model is an entirely different problem; we can indeed have power against alternatives of the order O(n−1/2 ), just like the case of fixed p. Hence, this rate is optimal. This would also imply that any confidence intervals computed by inverting the test Tn are optimal regardless of the size of sparsity of γ ∗ . 3.2.3. Hybrid high-dimensional model To showcase adaptivity of our method and to provide the bridge between sparse and dense models, we can consider what is perhaps more practical scenario for the nuisance parameter γ ∗ , the so-called “sparse + dense” model, where γ ∗ is the sum of a sparse vector and an extremely dense vector. The estimation of high-dimensional models with this hybrid sparsity are discussed by Chernozhukov, Hansen and Liao (2015), who derive bounds for prediction errors. However, their results do not yield an inference procedure, and thus, the best statistical test in these models even in moderate high-dimensions is still an open problem. Our result offers the first solution to this issue; with appropriate
Zhu and Bradic/Non-sparse Hypothesis Testing
18
modification to the proof of Theorem 6, we can show that our test has power in these hybrid high-dimensional models. √ Theorem 7. Let Conditions 1 and 2 hold together with log p = o( n). Let ΣX = E[xi x> ] ∈ R(p−1)×(p−1) . Suppose that γ ∗ = π ∗ +µ∗ for π ∗ and µ∗ such that (1) kπ ∗ k0 = √ i 5/2 > ∗ o( n/ log (p ∨ n)), (2) E(ε21 ) ≥ τ1 , (3) min1≤j≤p V par[x1,j (x1 µ + ε1 )] ≥ τ2 , (4) ∗ ∗ > ∗ 2 ∗ −1 (π +qµ ) ΣX µ + E(ε1 ) ≥ τ3 , (5) kΣX µ k∞ ≤ τ2 (2n) log(p ∨ n)/25 and (6) ∗ 2 2 Eu21 / E(x> 1 µ + ε1 ) u1 → κ, where τ1 , τ2 , τ3 > 0 and κ ∈ R are constants. Then, under H1,h in (15), lim Pβ ∗ |Tn (β0 )| > Φ−1 (1 − α) = Ψ(h, κ, α), n,p→∞
where Ψ(h, κ, α) is defined in Theorem 5.
Remark 5. Theorem 7 says that when high-dimensional models have parameters with the hybrid structure, our test is unique in the existing literature; as a test of single entries it has power against alternatives larger than O(n−1/2 ), the optimal rate in low-dimensional models. Condition (4) of Theorem 7 is a natural restriction on the interaction between the sparse part and the dense part of the model. When ΣX = Ip , condition (4) is automatically satisfied if π ∗ and µ∗ have disjoint supports. Since (π ∗ + µ∗ )> ΣX µ∗ + E(ε21 ) = 1/2 1/2 kΣX (π ∗ + µ∗ /2)k22 + E(ε21 ) − kΣX µ∗ k22 /4, a sufficient condition for condition (4) is 1/2 kΣX µ∗ k22 ≤ 2E(ε21 ). Remark 6. In summary, our procedure achieves the optimal detection rate while adapting to any level of sparsity or lack thereof. Therefore CorrT is extremely useful for applied scientific studies where a priori knowledge of sparsity level is unknown. Since there are currently no procedures for testing the sparsity assumption CorrT is exceptionally practical. It possesses the regime-encompassing property in that it behaves optimally in all three regimes: sparse, dense and hybrid and does so automatically without user interference while allowing heteroscedastic errors, therefore, providing a comprehensive tool for significance testing. 4. Numerical Examples In this section we evaluate the proposed method in the finite sample setting by observing its behavior on simulated data setting. 4.1. Simulation Examples We simulate the linear model Y = Wπ ∗ + ε as follows. The design matrix W ∈ Rn×p is consistent of independent and identically distributed (i.i.d) rows from one of the following distributions:
Zhu and Bradic/Non-sparse Hypothesis Testing
19
Table 1 Average Type I error of the CorrT test, Debias (De-biasing test) and Score test over 100 repetitions with n = 200 and p = 500. Rows represent different sparsity levels, whereas columns represent different design and error setting. LTD + LTE, ρ = 0 s=1 s=3 s=5 s = 10 s = 20 s = 50 s = 100 s=n s=p
CorrT 0.05 0.02 0.02 0.07 0.05 0.04 0.05 0.03 0.05
Debias 0.05 0.03 0.02 0.09 0.11 0.13 0.27 0.36 0.57
Score 0.05 0.03 0.02 0.10 0.10 0.11 0.24 0.37 0.56
LTD + HTE, ρ = 0 s=1 s=3 s=5 s = 10 s = 20 s = 50 s = 100 s=n s=p
CorrT 0.03 0.06 0.04 0.04 0.09 0.03 0.03 0.07 0.05
Debias 0.01 0.06 0.06 0.02 0.08 0.18 0.21 0.38 0.59
Score 0.02 0.05 0.06 0.02 0.08 0.21 0.21 0.36 0.61
LTD + LTE, ρ = − 12
CorrT 0.06 0.16 0.08 0.09 0.05 0.08 0.02 0.16 0.07
Debias 0.03 0.05 0.06 0.08 0.06 0.16 0.19 0.34 0.42
Score 0.06 0.18 0.07 0.11 0.12 0.23 0.16 0.35 0.47
LTD + HTE, ρ = − 21
CorrT 0.06 0.11 0.08 0.09 0.04 0.05 0.05 0.05 0.04
Debias 0.05 0.03 0.07 0.06 0.05 0.11 0.22 0.20 0.44
Score 0.04 0.11 0.08 0.12 0.09 0.08 0.19 0.24 0.41
HTD + LTE, ρ = 0 CorrT 0.04 0.07 0.03 0.07 0.08 0.10 0.07 0.05 0.04
Debias 0.14 0.17 0.05 0.08 0.13 0.25 0.28 0.33 0.54
Score 0.09 0.09 0.03 0.10 0.11 0.18 0.20 0.37 0.55
HTD + HTE, ρ = 0 CorrT 0.04 0.09 0.04 0.02 0.07 0.02 0.07 0.05 0.09
Debias 0.06 0.15 0.05 0.12 0.16 0.17 0.29 0.35 0.52
Score 0.05 0.08 0.05 0.02 0.11 0.16 0.21 0.33 0.54
LTD Light-tailed design: N (0, Σ(ρ) ) with the (i, j) entry of Σ(ρ) being ρ|i−j| . 1/2
HTD Heavy-tailed design: each row has the same distribution as Σ(ρ) U, where U ∈ Rn contains i.i.d random variables with Student’s t-distribution with 3 degrees of freedom (the third moment does not exist.) We consider both uncorrelated designs (ρ = 0) and the correlated designs (ρ = −1/2). The error term ε ∈ Rn is drawn as a vector of i.i.d random variables from either N (0, 1) (light-tailed error, or LTE) or Student’s t-distribution with 3 degrees of freedom (heavytailed error, or HTE). Let s = kπ ∗ k0 denote the size of the model sparsity and we vary simulations settings from extremely sparse s = 3 to extremely large s = p. We set the √ model parameters as π ∗j = 2/ n whenever 2 ≤ j ≤ 4, π ∗j = 0 whenever j > max{s, 4} √ and other entries of π ∗ are i.i.d random variables uniformly distributed on (0, 4/ n). √ We test the hypothesis H0 : π ∗3 = 2/ n and the rejection probability represents the Type I error. We compare our method with the following procedures. First, we compare with Debiasing method of Van de Geer et al. (2014). Second, we compare with the Score algorithm of Ning and Liu (2014). In both approaches, we choose the scaled Lasso to estimate π ∗ and consider the true precision matrix and variance of the model as known. In a certain sense these procedures are quasi-oracle as they use part of the information on the unknown true probability distribution of the data. We do this for a few reasons,
Zhu and Bradic/Non-sparse Hypothesis Testing
20
among which the primary is to reduce the arbitrariness of choosing tuning parameters (and hence the quality of inference) in implementing these methods. Moreover, these methods should not be expected to behave better under their original setting with unknown Σ(ρ) and noise level than under our “ideal setting”. All the rejection probabilities are computed using 100 random samples. All the tests we consider have a nominal size of 5%. We collect the result in Table 1, where we clearly observe instability of the competing debias and score methods with Type-I error diverging steadily away from 5% with growing sparsity sγ . In contrast, CorrT remains stable even when the model size is equal to p. This is true even if we change the correlation among the features and the thickness of tails in the distribution of the designs and errors. However, observe that both the debiasing and the Score test rapidly fail to control Type I error rate, often quickly leading to a random guess results where the observed Type I error is close to 50%. √ We also investigate the power properties by testing H0 : π ∗3 = 2/ n. The data is generated by the same model as in Table 1, except that the true value of π ∗3 = √ √ 2/ n+h/ n. Table 2 presents full power curves with various values of h, which measures the magnitude of deviations from the null hypothesis. Hence the far left presents Type I error (h = 0) whereas other points on the curves correspond to Type II errors (h 6= 0). In all the power curves, we use the design matrices have a Toeplitz covariance matrix Σ(ρ) with ρ = −1/2. The two plots in the first row of the table correspond to extremely sparse models with s = 3: LTD + LTE structure on the left and LTD + HTE on the right. In both figures, we observe that CorrT compares equally to the existing methods. The second row corresponds to the growing s with s = n; we clearly observe that the CorrT outperforms both debias and score tests by providing firm Type I error and also reaching full power quickly. Lastly, we present power curves for s = p; for these extremely dense models, debias and score methods can have Type I error close to 50% whereas CorrT still provides valid inference. Interestingly, CorrT still achieves full power against alternatives √ that are 1/ n away from the null.
4.2. Real Data We use the real dataset for the illustration of the proposed method. The samples were selected from the transNOAH breast cancer trial (GEO series GSE50948), available for download at “http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE50948”. Genomewide gene expression profiling was performed using micro RNA from biopsies from 114 pre-treated patients with HER2+ breast cancer. The complete data contains gene expression values of about 20000 genes located on different chromosomes. Clinically, breast cancer is classified as hormone receptor (HR)-positive, HER2+ and triple-negative breast cancer. The “HER2-positive” subtype of breast cancer overexpresses the human epidermal growth factor receptor 2 (HER2). BRCA1 is a human tumor suppressor gene that is normally expressed in the cells of breast and other tissue, where they help repair damaged DNA. Research suggests that the BRCA1 proteins
Zhu and Bradic/Non-sparse Hypothesis Testing
21
Table 2 Power properties of the CorrT test across different sparsity settings in high dimensional linear models. We consider extremely sparse, s = n and s = p cases presented through top to bottom row. We also consider design setting with light tailed and heavy tailed distributions, left to right columns. 1.00
1.00
0.75
0.75
CorrT Debias
colour Power
Power
colour 0.50
CorrT 0.50
Debias
Score
Score
0.25
0.00 0.00
0.25
0.25
0.50
0.75
0.00 0.00
1.00
0.25
h
0.50
0.75
1.00
h
1.00
1.00
0.75
0.75
CorrT Debias
colour Power
Power
colour 0.50
CorrT 0.50
Debias
Score
Score
0.25
0.00 0.00
0.25
0.25
0.50
0.75
0.00 0.00
1.00
0.25
h
0.50
0.75
1.00
h
1.00
1.00
0.75
0.75
CorrT Debias
colour Power
Power
colour 0.50
CorrT 0.50
Debias
Score
0.25
0.00 0.00
Score
0.25
0.25
0.50
h
0.75
1.00
0.00 0.00
0.25
0.50
h
0.75
1.00
Zhu and Bradic/Non-sparse Hypothesis Testing
22
Table 3 Value of the Test Statistic of GSE50948 real data set concerning 114 HER+ cancer patients with 20000 genes mapped. Gene
Biological association
Test Statistic CorrT Debias Score
IGF2R Nmi RBBP4 NARS2
breast cancer tumor suppressor endogenously associated with BRCA1 breast cancer breast cancer
-4.692 -4.239 -4.186 -4.163
-4.285 -2.956 -3.314 -5.000
-4.445 -2.669 -2.806 -4.983
B3GALNT1 C3orf62 LTB TNFAIP1 CCPG1 LRRIQ3 LOC100507537 ELOVL4
lung cancer lung cancer lung cancer lung cancer prostate cancer colorectal cancer bladder cancer ataxia
1.151 -1.274 -0.131 1.231 -1.597 -1.025 -0.137 -1.354
2.082 -2.143 -2.107 2.181 -2.154 -2.480 -1.966 -2.152
2.065 -2.139 -2.143 2.118 -2.251 -2.240 -1.135 -2.136
regulate the activity of other genes including tumor suppressors and regulators of the cell division cycle. However, the association between BRCA1 and other genes is not completely understood. Moreover, it is believed that BRCA1 may regulate pathways that remove the damages in DNA introduced by the certain drugs. Thus understanding associations between BRCA1 and other genes provides a potentially important tool for tailoring chemotherapy in cancer treatment. HER2+ breast cancer is biologically heterogeneous and developing successful treatment is highly important. We apply the method developed in Section 2 to this dataset with the goal of testing particular associations between BRCA1 and and a few other genes conditional on all remaining genes present in the study. Simply applying Lasso procedure results in an estimate of all zeros. Results are reported in Table 3. Therein we report the size of the test statistic of the proposed test, CorrT, as well as the De-biasing test and the Score test. We observe that there are a number of genes where the three tests report largely different values – namely CorrT reports a non-significant finding whereas the De-biasing and the Score test report significant finding. These genes include B3GALNT1, C3orf62, TNFAIP1 and LTB gene that have been previously linked to the lung cancer (Bosse et al., 2016; Wang et al., 2016; Zhang et al., 2017; Poczobutt et al., 2016), as well as CCPG1, LRRIQ3 and LOC100507537 linked to prostate, colorectal and bladder cancer, respectively (Tang et al., 2016; Arriaga et al., 2017; Terracciano et al., 2017). Such finding would indicate that this dataset likely does not follow a sparse model and that previous methods were reporting false positives. Observe that even mutations related to certain disease (not related to cancer), such is ataxia and ELOVL4 gene, are being reported as significant by the De-biasing and the Score test; CorrT in contrast, does not report significant finding. Chromatin regulators have been known to act as a guide for cancer treatment choice (Kitange et al., 2016). Here we identify retinoblastoma binding protein RBBP4, a chromatin modeling factor, as being highly associated with BRCA1 in all HER2+ patients. This indicates that early detection and chemoterapy treatment should target RBBP4.
Zhu and Bradic/Non-sparse Hypothesis Testing
23
Results of Table 3 demonstrate the gene that a partial or complete loss of NARS2 is a significant event associated with the HER2+ breast cancer. Significance of this gene was confirmed in Holm et al. (2012) where a significant correlation was found between the gene CCND1 (known trigger of the ER positive breast cancer) and NARS2. Our analysis enriches these findings and shows statistically significant association between NARS2 and BRCA1 for HER2+ cancer cells. Lastly, Table 3 showcases significance of a known breast tumor suppressor gene IGF2R (Oates et al., 1998) Mannose 6-phosphate/IGF2 receptor (M6P/IGF2R) is mutated in breast cancer, leading to loss of IGF2 degradation (Ellis et al., 1998) leading to its usage as a sensitivity marker for radiation, chemotherapy, and endocrine therapy. Germline mutations in BRCA1 predispose individuals to breast and ovarian cancers. A novel endogenous association of BRCA1 with Nmi (N-Myc-interacting protein) in breast cancer cells was detected in Li et al. (2002). Nmi was found to interact specifically with BRCA1, both in-vitro and in-vivo, by binding to two major domains in BRCA1. Table 3 showcases significance of the association between Nmi and BRCA1 gene. The above analysis showcases parallel and new discoveries of our method to the newly established results in medicine, and provides evidence that our methods can be used to discover new scientific findings in applications involving high-dimensional datasets. 5. Discussion This article considers inference on single regression parameters in high-dimensional linear models. We consider the setup in which no assumption is imposed on the sparsity of the model. We construct the test statistic by proposing a novel moment-based approach that incorporates the null hypothesis. The proposed procedure is proved to have tight Type I error control regardless of whether or not the model is sparse. Our test also has desirable power properties. Our test achieves efficiency in terms of local power when the model is indeed sparse; moreover, the proposed test also has optimal detection rate for extremely dense models and hybrid (sparse+extremely dense) models even if p n. For these reasons, our test greatly complements existing literature since existing sparsity-based methods might not control Type I error for non-sparse models and sparsity assumptions are difficult to test. Our results also opens doors to other research areas. First, the inference on single entries of high-dimensional parameters is a very different problem from inference of the entire high-dimensional parameter when the model is not sparse. For the latter problem, Ingster et al. (2010) among others show that the minimax detection rate does not converge to zero; for the former problem, our test has detection rate O(n−1/2 ) for many dense models. A natural question is to investigate the minimax detection rate for inference of single entries in high-dimensional parameters. Second, our methodology can be used or extended for other more general testing problems. For the problem of simultaneously testing multiple components of the regression parameter, a similar moment-based strategy can be applied. Our method can also be extended for inference on signal-to-noise ratios where in contrast to Janson et al. (2015) we would not need to require ΣW = Ip
Zhu and Bradic/Non-sparse Hypothesis Testing
24
and p n. Appendix A: Proofs of theoretical results A.1. Proof of Theorem 1 b = 0) → Proof of Theorem 1. The proof proceeds in two steps. We first show that P (θ 1 and then prove the desired result. b = 0) → 1. Step 1: Prove that P (θ We show that, with probability approaching one, the vector of all zeros, satisfies the KKT condition of the Lasso optimization problem (see Lemma 2.1 of B¨ uhlmann and Van de Geer (2011)), i.e., p P kn−1 W> Yk∞ ≤ 16 n−1 log p → 1. (17)
P Let ξi,j = ap−1/2 pl=1,l6=j xi,l + εi , where xi,l denotes the l-th entry of xi . Notice that (1) {xi,j }ni=1 and {ξi,j }ni=1 are independent by construction of the model (design have i.i.d. rows) , (2) yi = ap−1/2 xi,j + ξi,j by the definition of the model and ξi,j and (3) 2 = a2 (1 − p−1 ) + 1 for ∀1 ≤ i ≤ n and ∀1 ≤ j ≤ p. Eξi,j Notice that x2i,j is a chi-squared random variable with one degree of freedom. Since x2i,j − 1 is a chi-squared random variable right it has bounded sub-exponential norm, Proposition 5.16 of Vershynin (2010) and the union bound imply that for some constants c1 , c2 > 0 and for any z > 0 P
! n p −1/2 X 2 (xi,j − 1) > z log p n j=1 i=1 √ 2 z log p z log p ≤ 2p exp −c1 min , . c22 c1 n−1/2
! p n X p −1/2 X 2 P max n (xi,j − 1) > z log p ≤ 1≤j≤p i=1
It follows that there exists a constant c3 > 0 such that ( ) n X p −1 2 −1 P (An ) → 1 with An = max n xi,j − 1 > c3 n log p . 1≤j≤p
(18)
i=1
Pn Since {xi,j }ni=1 is independent of {ξi,j }ni=1 , we have that n−1/2 Pn i=12xi,j2ξi,j conditional n −1 2 on {xi,j }i=1 is Gaussian with mean zero and variance n i=1 xi,j σξ , where σξ = 2 −1 a (1 − p ) + 1.
Zhu and Bradic/Non-sparse Hypothesis Testing
25
p Let w > 0 satisfy 2 log w = 1.452 log p / 1 + c3 n−1 log p . Then, on the event An ,
=
= (i)
≤
(ii)
=
(iii)
≤
(iv)
=
! n p −1/2 X n P n xi,j ξi,j > 1.45σξ log p {xi,j }i=1 i=1 Pn √ n−1/2 i=1 xi,j ξi,j 1.45 log p n q P >q P Pn {xi,j }i=1 n 2 2 −1 −1 σξ n n i=1 xi,j i=1 xi,j √ log p 1.45 1 − Φ q P n−1 ni=1 x2i,j √ 1.45 log p 1 − Φ q p −1 1 + c3 n log p p 2 log w . 1−Φ w−1
1.452 log p
, exp − p −1 2 1 + c3 n log p
where (i) holds by the definition of An , (ii) holds by the definition of w, (iii) holds by Lemma 1 and (iv) holds by the definition of w. By the law of iterated expectations and the union bound, we have that ! n p −1/2 X P max n xi,j ξi,j > 1.45σξ log p (19) 1≤j≤p i=1 2 −1.45 log p (i) + P (Acn ) = o(1), ≤ p exp (20) p 2 1 + c3 n−1 log p
Zhu and Bradic/Non-sparse Hypothesis Testing
26
where (i) holds by p → ∞, n−1 log p → 0 and (18). Notice that p P kn−1/2 W> Yk∞ > 1.5 (a2 + 1) log p ! n n X p (i) −1/2 X = P max n xi,j ξi,j + ap−1/2 n−1/2 x2i,j > 1.5 (a2 + 1) log p 1≤j≤p i=1 i=1 ! n X p ≤ P max n−1/2 xi,j ξi,j > 1.45σξ log p 1≤j≤p i=1 ! n p p −1/2 −1/2 X 2 +P max ap log p n xi,j > 1.5 a2 + 1 − 1.45σξ 1≤j≤p i=1 ! n p p X p (ii) (iii) −1 = o(1) + P max n log p = o(1), x2i,j > a−1 p/n 1.5 a2 + 1 − 1.45σξ 1≤j≤p i=1
where (i) holds by yi = ap−1/2 xi,j + ξi,j , (ii) holds by (20) and (iii) holds by (18) and p p p p p/n 1.5 a2 + 1 − 1.45σξ log p p(log p)/n → ∞. √ b = 0) → 1. Since 1.5 a2 + 1 ≤ 16 ∀a ∈ [−10, 10], we have proved (17). Hence, P (θ
∗ Step 2: Establish the desired result. Since yi = x> i γ + εi is independent of zi 2 2 2 ∗ 2 2 and Ezi yi = Eyi = kγ k2 + 1 = a + 1, the central limit theorem implies that
n
−1/2
n X i=1
zi yi →d N (0, a2 + 1).
b = 0) → 1 (by step 1), √n(βe − β ∗ ) = n−1/2 Z > Y = n−1/2 Pn zi yi with Since P (θ i=1 probability approaching one. Hence, √ n(βe − β ∗ ) →d N (0, a2 + 1). bΣ b W Θ) b 1,1 = n−1 Z> Z = n−1 Pn z 2 , the law of large numbers implies that Since (Θ i=1 i bΣ b W Θ) b 1,1 = 1 + oP (1). By Slutzsky’s lemma, we have (Θ √ e n(β − β ∗ ) q →d N (0, a2 + 1). (21) b b b (ΘΣW Θ)1,1 Notice that
√ e ∗ n(β − β ) α P (β ∗ ∈ CI1−α ) = P −Φ−1 1 − ≤ Φ−1 1 − ≤q 2 2 bΣ b W Θ) b 1,1 (Θ α p 2 α p 2 (i) = Φ Φ−1 1 − / a + 1 − Φ −Φ−1 1 − / a +1 2 2 α p 2 (ii) −1 = 2Φ Φ 1− / a + 1 − 1, 2
α
Zhu and Bradic/Non-sparse Hypothesis Testing
27
where (i) holds by (21) and (ii) holds by the identity Φ(−z) = 1 − Φ(z) for z ≥ 0. The proof is complete.
A.2. Proof of Theorem 2 The proof of Theorem 2 is a consequence of a series of lemmas presented below. √ −1 (1 − w −1 ) ≤ Lemma 1. For any w ≥ 5, Φ 2 log w. For any w ≥ 14, Φ−1 (1 − w−1 ) ≥ √ log w. Proof of Lemma 1. By Lemma 2 on page 175 of Feller (1968), we have that for any z > 0 (z −1 − z −3 ) exp(−z 2 /2) exp(−z 2 /2) √ √ ≤ 1 − Φ(z) ≤ . (22) 2π 2πz √ 2π · 1 · exp(12 /2) < of z 7→ Fix an arbitrary w ≥ 5. Since √ √ 5, the continuity 2 2 2πz exp(z /2) implies that there exists z0 ≥ 1 with w = 2πz0 exp(z0 /2). By (22), we have 1 − Φ(z0 ) ≤ w−1 . In other words, (i) p Φ−1 1 − w−1 ≤ z0 < 2 log w,
√ where (i) holds by w = 2πz0 exp(z02 /2) > exp(z02 /2) (since z0 ≥ 1). The first result follows. To see √ the second result, we observe the elementary√inequality x−1 − x−3 ≥ x−2 for any x ≥ ( 5 + 1)/2. Hence, (22) implies that for z ≥ ( 5 + 1)/2, 1 − Φ(z) ≥
(z −1 − z −3 ) exp(−z 2 /2) 1 √ ≥√ . 2π 2πz 2 exp(z 2 /2)
√ In other words, for z ≥ ( 5 + 1)/2, −1 Φ 1− √
1 2πz 2 exp(z 2 /2)
≥ z.
√ √ Now let w = exp(z 2 ). Then z = log w and we have that for w ≥ 14 > exp[( 5 + 1)2 /4], p 1 −1 Φ 1− √ √ ≥ log w. 2π w log w √ √ It is straight-forward to verify that for any w ≥ 14, 1/( 2π w log w) ≥ 1/w and hence 1 1 −1 −1 Φ ≥Φ 1− 1− √ √ . w 2π w log w The second result follows from the above two displays. The proof is complete.
Zhu and Bradic/Non-sparse Hypothesis Testing
28
Lemma 2. Let Condition 1 hold. Suppose that H0 in (2) holds. Then, with probability tending to one, γ ∗ lies in the P feasible set of the optimization problem (5) for a = a∗,γ , where a2∗,γ = n−1 max1≤j≤p ni=1 x2i,j ε2i . Proof of Lemma 2. Under H0 in (2), Y − Zβ0 = Xγ ∗ + ε. Then, it suffices to verify the following claims: (a) P kn−1 X> εk∞ ≤ η0 a∗,γ → 1. (b) P kεk∞ ≤ kVk2 / log2 n → 1. (c) P n−1 V> ε ≥ n−1 kVk22 ρn → 1.
We proceed in three steps with each step corresponding to one of the above claims. In the rest of the proof, we denote by vi the i-th entry of V. the claim (a). Fix P δ ≥ 2. For 1 ≤ j ≤ p, define Ln,j = PnStep 1: Establish n 2+δ = nE|x ε |2+δ and B 2 2 E|x ε | = i,j i 1,j 1 n,j i=1 i=1 E(xi,j εi ) = nE(x1,j ε1 ) . Since x1,j and ε1 have bounded sub-Gaussian norms, Ln,j /n is bounded above; since E(x1,j ε1 )2 is bounded away from zero and infinity, Bn,j /n is bounded away from zero and infinity. Hence, there exists a constant K1 > 0 such that for 1 ≤ j ≤ p, 1/(2+δ)
Bn,j /Ln,j
≥ nδ/(2+δ) K1 .
Let an = Φ−1 (1−p−1 n−1 ). By Theorem 7.4 of Pe˜ na et al. (2008), there exists an absolute constant A > 0 such that for 1 ≤ j ≤ p, " Pn 2+δ # x ε 1 + a i,j i n P qPi=1 ≥ an ≤ [1 − Φ(an )] 1 + A δ/(2+δ) K n 2 2 n 1 i=1 xi,j εi
and
" Pn 2+δ # x ε − 1 + a i,j i n i=1 P qP ≥ an ≤ Φ(−an ) 1 + A . n 2 ε2 nδ/(2+δ) K1 x i=1 i,j i p Since Φ(−an ) = 1 − Φ(an ) = p−1 n−1 and an = Φ−1 (1 − p−1 n−1 ) ≤ 2 log(pn) (due to Lemma 1), the above two displays imply that, for 1 ≤ j ≤ p, !2+δ p Pn 2 log(pn) 1 + x ε | | i,j i . P qPi=1 ≥ an ≤ 2p−1 n−1 1 + A n 2 2 nδ/(2+δ) K1 x ε i=1 i,j i
By the union bound, we have that Pn | xi,j εi | P max qPi=1 ≥ an ≤ 2n−1 1 + A n 1≤j≤p 2 2 i=1 xi,j εi
!2+δ p 1 + 2 log(pn) (i) = o(1), (23) δ/(2+δ) n K1
where (i) holds by log p = o(n) and δ ≥ 2. Since η0 = 1.1n−1/2 an , we have that −1/2 Pn n xi,j εi qP i=1 ≥ η0 = o(1). P max n 1≤j≤p 2 2 i=1 xi,j εi
Zhu and Bradic/Non-sparse Hypothesis Testing
29
Claim (a) follows by the fact that −1/2 Pn P n max1≤j≤p n−1/2 ni=1 xi,j εi n−1 kX> εk∞ i=1 xi,j εi qP qP = ≤ max . n n 1≤j≤p a∗,γ x 2 ε2 x 2 ε2 max 1≤j≤p
i=1
i=1
i,j i
i,j i
Step 2: Establish the claim (b). By the law of large numbers, ∗ 2 2 2 n−1 kVk22 = oP (1) + Ev12 = oP (1) + E(x> 1 γ ) + Eε1 ≥ oP (1) + Eε1 .
Since Eε21 is bounded away from zero, there exists a constant M > 0 such that √ P (kVk2 / n ≥ M ) → 1. On the other √ hand, εi has bounded sub-Gaussian norm and √ √ thus by the union bound, kεk∞ = OP ( log n). Since n/ log2 n log n, claim (b) follows. Step 3: Establish the claim (c). By the law of large numbers, n−1 V> ε = oP (1) + Ev1 ε1 = oP (1) + Eε21 . Again, since Eε21 is bounded away from zero, there exists a constant M 0 > 0 such that P (n−1 V> ε ≥ M 0 ) → 1. On the other hand, due to n−1 kVk22 = Ev12 + oP (1), ρn = o(1) and Ev12 = O(1), we have that n−1 kVk22 ρn = oP (1). Claim (c) follows. We have showed the claims (a)-(c). The proof is complete. √ Lemma 3. Let Condition 1 hold. Let σ bε = kV − Xb γ k2 / n. Suppose that H0 in (2) holds. Then, with probability tending to one, we have that (1) kn−1 X> (V − Xb γ )k∞ /b σ ≤ 2kXk∞ η0 ρ−1 n and √ ε 2 (2) kV − Xb γ k∞ /b σε ≤ n/(ρn log n). P √ Proof of Lemma 3. We define a2∗,γ = max1≤j≤p n−1 ni=1 x2i,j ε2i and a0,γ = kXk∞ kVk2 / n. We first show that P (2a0,γ ≥ a∗,γ ) → 1. By the law of large numbers, n−1
n X i=1
∗ 2 (4vi2 − ε2i ) = E(4vi2 − ε2i ) + oP (1) = 3σε2 + 4E(x> i γ ) + oP (1),
where σε2 = Eε2i . Hence, P
4a20,γ
≤
a2∗,γ
=P
4n
kXk2∞ kVk22
≤ max n 1≤j≤p
4n−1 kVk22 ≤ max n−1
=P
1≤j≤p
4n−1 kVk22 ≤ n−1
≤P (i)
−1
n X i=1
ε2i
n X i=1
!
−1
n X i=1
x2i,j ε2i
!
2 2 kXk−2 ∞ xi,j εi
∗ 2 ≤ P 3σε2 + 4E(x> i γ ) + oP (1) ≤ 0
!
(ii) = o(1),
(24)
Zhu and Bradic/Non-sparse Hypothesis Testing
30
where (i) follows by (24) and (ii) follows by the fact that σε is bounded away from zero. In other words, P (2a0,γ ≥ a∗,γ ) → 1. Notice that the feasible set of the optimization problem (5) is increasing in a. By Lemma 2, γ ∗ lies in the feasibility set of the optimization problem (5) for a = a∗,γ with probability approaching one. Hence, P (2a0,γ ≥ a∗,γ ) → 1 implies P (A) → 1, where A denotes the event that γ ∗ lies in the feasibility set of the optimization problem b is well defined. (5) for a = 2a0,γ . Observe that on the event A, γ By the last constraint in (5), we have that on the event A, n−1/2 kVk2 σ bε = n−1 kVk2 kV − Xb γ k2 ≥ n−1 V> (V − Xb γ ) ≥ ρn n−1 kVk22 .
Hence, on the event A ,
√
nρ−1 bε . n σ T b=γ e (2a0,γ ) and thus Define the event M = {Sγ = ∅}. On the event M A, γ kVk2 ≤
√ (i) kn−1 X> (V − Xb γ )k∞ ≤ 2η0 a0,γ = 2η0 kXk∞ kVk2 / n ≤ 2η0 kXk∞ ρ−1 bε , n σ
where (i) follows by T (25). c b=γ e (b A, Sγ 6= ∅, γ aγ ) for some σ bγ ∈ Sγ and therefore, On the event M v u n X 1 (i) u −1 t b )2 b aγ ≤ max n x2i,j (vi − x> i γ 1≤j≤p 2 i=1 v u n X u −1 t b )2 = kXk∞ σ ≤ kXk∞ n (vi − x> bε , i γ
(25)
(26)
(27)
i=1
where (i) follows by the definition of Sγ in (6). Notice that on the event Mc (i)
(ii)
kn−1 X> (V − Xb γ )k∞ ≤ η0 b aγ ≤ 2kXk∞ η0 σ bε
T
A, (28)
b = γ e (b where (i) follows by the first constraint in (5) and the fact that γ aγ ) and (ii) follows by (27). Since P (A) → 1 and ρn ≤ 1, part (1) follows by (26) and (28). In regards to part (2), notice that on the event A, kV − Xb γ k∞
√ nb σε ≤ kVk2 / log n ≤ , ρn log2 n
(i)
2
(ii)
where (i) follows by the second constraint in (5) and (ii) follows by (25). Part (2) follows. Lemma 4. Let Condition 1 hold. Then, with probability tending to one, θ ∗ lies inPthe feasible set of the optimization problem (8) for a = a∗,θ , where a2∗,θ = max1≤j≤p n−1 ni=1 x2i,j u2i .
Zhu and Bradic/Non-sparse Hypothesis Testing
31
Proof of Lemma 4. The argument is identical to the proof of Lemma 2, except that V, γ ∗ , a∗,γ and {εi }ni=1 are replaced by Z, θ ∗ , a∗,θ and {ui }ni=1 , respectively. Since Z = Xθ ∗ + u holds regardless of whether or not H0 holds, the same reasoning in the proof of Lemma 2 applies. Lemma 5. Let Condition 1 hold. If s/p → 0 and s log p = o(n), Then there exists a constant K > 0 such that ! kn−1/2 Xvk2 P min min > K → 1. kvJ k2 A⊂{1,··· ,p}, |A|≤s kvAc k1 ≤kvA k1 Proof of Lemma 5. We invoke Theorem 6 of Rudelson and Zhou (2013) with 1/2 (A, q, p, s0 , k0 , δ) = (ΣX , p, p, s, 1, 1/2). Clearly, RE(s, 3, A) condition holds; see Definition 1 in Rudelson and Zhou (2013). −1/2 Let Ψ = XΣX . Notice that rows of Ψ are independent vectors with mean zero and variance Ip and hence are isotropic random vectors; see Definition 5 of Rudelson and Zhou (2013). By Condition 1, rows of Ψ are also ψ2 vectors for some constant α > 0. By the well-behaved eigenvalues of ΣX in Condition 1, we have that m defined in the statement of Theorem 6 of Rudelson and Zhou (2013) satisfies m ≤ min{Cs, p} for some constant C > 0. Since s/p → 0 and s log p = o(n), we have that, for large n, 2000mα4 60ep n≥ log . δ2 mδ By the well-behaved eigenvalues of ΣX , K(s, 1, A) defined in Rudelson and Zhou (2013) is bounded below by a constant C0 > 0. Since X = ΨA, it follows, by Theorem 6 of therein (in particular Equation (14) therein), that for large n, ! kn−1/2 Xvk2 min min P > (1 − δ)/C0 ≥ 1 − 2 exp(−δ 2 n/2000α4 ). kvJ k2 A⊂{1,··· ,p}, |A|≤s kvAc k1 ≤kvA k1 The proof is complete. b ∈ Rp and a matrix M ∈ Rn×p . Suppose Lemma 6. Consider vectors H ∈ Rn and b, b b ∞ ≤ ησ. that for some σ, η > 0, kn−1 M> (H − Mb)k∞ ≤ ησ and kn−1 M> (H − Mb)k Assume that kn−1/2 Mvk2 min min >K (29) kvJ k2 A⊂{1,··· ,p}, |A|≤s kvAc k1 ≤kvA k1
b 1 ≤ kbk0 , then for some K > 0. If kbk0 ≤ s and kbk
b − bk1 ≤ 4K −2 ησs kb
and v v u u n n X X u u √ 2 > 2 > 2 2 −1 −1 t t b − max n Mi,j (Hi − Mi b) max n Mi,j (Hi − Mi b) ≤ 4kMk∞ σK −1 η s, 1≤j≤p 1≤j≤p i=1 i=1
Zhu and Bradic/Non-sparse Hypothesis Testing
32
where Mi,j , Mi> and Hi denote the (i, j) entry of M, the i-th row of M and the i-th entry of H, respectively and kMk∞ = max1≤j≤p, 1≤i≤n |Mi,j |. Proof of Lemma 6. The argument closely follows the proof of Theorem 7.1 of Bickel b − b. Since kbk b 1 ≤ kbk0 , it follows that et al. (2009). Define J = support(b) and ∆ = b b J k1 + k∆J c k1 = kb b J k1 + kb b J c k1 ≤ kbJ k1 and thus k∆J c k1 ≤ k∆J k1 , implying that kb k∆J k1 ≤ k∆k1 ≤ 2k∆J k1 .
(30)
By the triangular inequality, we have b ∞ ≤ 2ησ. kn−1 M> M∆k∞ ≤ kn−1 M> (H − Mb)k∞ + kn−1 M> (H − Mb)k
(31)
Therefore,
(i) (ii) (iii) √ n−1 ∆> M> M∆ ≤ k∆k1 kn−1 M> M∆k∞ ≤ 4ησk∆J k1 ≤ 4ησ sk∆J k2 ,
(32)
where (i) follows by Holder’s inequality, (ii) follows by (30) and (31) and (iii) follows by Holder’s inequality and |J| ≤ s. Since k∆J c k1 ≤ k∆J k1 and |J| ≤ s, we have that n−1 ∆> M> M∆ ≥ K 2 k∆J k22 (due to the assumption in (29)). This and the above display imply that √ k∆J k2 ≤ 4K −2 ησ s. (33) √ The first claim follows by Holder’s inequality: k∆J k1 ≤ sk∆J k2 q ≤ 4K −2 ησs. P 2 a2 To show the second claim, we define k · kM,j on Rn by kakM,j = n−1 ni=1 Mi,j i
for a = (a1 , ..., an )> . For any a, b ∈ Rn , we define aM,j = (M1,j a1 , ..., Mn,j an )> ∈ Rn and bM,j = (M1,j b1 , ..., Mn,j bn )> ∈ Rn and notice that (i)
ka + bkM,j = kaM,j + bM,j k2 ≤ kaM,j k2 + kbM,j k2 = kakM,j + kbkM,j , where (i) follows by the triangular inequality for the Euclidean norm in Rn . Hence, k · kM,j is a semi-norm for any 1 ≤ j ≤ p. The semi-norm property of k · kM,j for all 1 ≤ j ≤ p implies that v u n X u 2 (M > ∆)2 −1 t b max kH − MbkM,j − kH − MbkM,j ≤ max kM∆kM,j = n Mi,j i 1≤j≤p
1≤j≤p
i=1
v u n X u −1 t (Mi> ∆)2 ≤ kMk∞ n i=1
p = kMk∞ n−1 ∆> M> M∆ √ ≤ 4kMk∞ K −1 ησ s,
(i)
Zhu and Bradic/Non-sparse Hypothesis Testing
33
where (i) follows by (32) and (33). The second claim follows by observing b M,j − max kH − MbkM,j ≤ max kH − Mbk b M,j − kH − MbkM,j . max kH − Mbk 1≤j≤p
1≤j≤p
1≤j≤p
The proof is complete.
b ∗ k1 = OP n−1/2 sθ log(p ∨ n) and kX> (Z− Lemma 7. Let Condition 1 hold. Then kθ−θ b ∞ = OP (√n log(p ∨ n)). Xθ)k P Proof of Lemma 7. We define a2∗,θ = max1≤j≤p n−1 ni=1 x2i,j u2i and the events A1 = {θ ∗ lies in the feasible set of the optimization problem (8) for a = a∗,θ .} ) ( kn−1/2 Xvk2 >K A2 = min min kvJ k2 A⊂{1,··· ,p}, |A|≤s kvAc k1 ≤kvA k1
A3 = {a∗,θ ∈ Sθ }
A4 = {b aθ ≤ 3a∗,θ } , where K > 0 is the constant defined in Lemma 5. By Lemmas 4 and 5, \ P A1 A2 → 1.
(34)
We proceed in three steps. We first show that P (A3 ) → 1 and then that P (A4 ) → 1. Finally, we shall show the desired results. T Step 1: Establish that P (A3 ) → 1. Notice that, on the event A1 A2 , kn−1 X> (Z− e ∗,θ ))k∞ ≤ η0 a∗,θ , kn−1 X> (Z − Xθ ∗ )k∞ ≤ η0 a∗,θ and kθ(a e ∗,θ )k1 ≤ kθ ∗ k1 . Then we Xθ(a ∗ b e can apply Lemma 6 with T (H, M, b, b, σ, s, η) = (Z, X, θ , θ(a∗,θ ), a∗,θ , sθ , η0 ) and obtain that on the event A1 A2 , v v u u n n X X u u √ ∗ 2 > 2 > e ∗,θ ))2 − t max n−1 t max n−1 xi,j (zi − xi θ(a xi,j (zi − xi θ )2 ≤ 4a∗,θ kXk∞ K −1 η0 sθ . 1≤j≤p 1≤j≤p i=1 i=1
Notice that a2∗,θ = max n−1 1≤j≤p
Pn
2 i=1 xi,j (zi
∗ 2 − x> i θ ) . Thus, the above display implies that
r Pn e ∗,θ ))2 θ(a max n−1 i=1 x2i,j (zi − x> i 1≤j≤p √ (i) − 1 ≤ 4kXk∞ K −1 η0 sθ = oP (1), a∗,θ
p p where (i) holds by η0 ≤ 1.1 2n−1 log(pn) (due to Lemma 1), kXk∞ = OP ( log(pn)) (due to the sub-Gaussian property and the union bound) and the rate condition for sθ . Therefore,
Zhu and Bradic/Non-sparse Hypothesis Testing
1 P (A3 ) = P 2 ≤
r
max n−1
1≤j≤p
Pn
2 >e 2 i=1 xi,j (zi − xi θ(a∗,θ ))
a∗,θ
34
3 ≤ → 1. 2
(35)
b = θ(b e σθ ) for Step 2: Establish that P (A4 ) → 1. On the event A3 , Sθ 6= ∅ and thus θ some b aθ ≥ a∗,θ (since b aθ is the maximal element of Sθ ). Notice that the feasible set of the e optimization problem (8) is nondecreasing in aTand T thus the mapping of a 7→ kθ(a)k 1 e e is non-increasing. Therefore, on the event A1 A2 A3 , kθ(b aθ )k1 ≤ kθ(a∗,θ )k1 . By e ∗,θ )k1 ≤ kθ ∗ k1 (on the event A1 ), it follows that on the event A1 T A2 T A3 , kθ(a
e aθ )k1 ≤ kθ ∗ k1 , kn−1 X> (V−Zθ(b e aθ ))k∞ ≤ η0 b kθ(b aθ and kn−1 X> (V−Zθ∗ )k∞ ≤ η0 a∗,θ ≤ η0 b aθ .
b σ, s, η) = (Z, X, θ ∗ , θ(b e aθ ), b Hence, we can apply Lemma 6TwithT(H, M, b, b, aθ , sθ , η0 ) and obtain that on the event A1 A2 A3 , v u n X u 2 > 2 e /b t max n−1 x (z − x θ(b a )) − a i θ ∗,θ i,j i aθ 1≤j≤p i=1 v v u u n n X X u u ∗ 2 2 > 2 > −1 2 −1 t t e xi,j (zi − xi θ(b aθ )) − max n xi,j (zi − xi θ ) /b aθ = max n 1≤j≤p 1≤j≤p i=1 i=1 √ ≤4kXk∞ K −1 η0 sθ = oP (1). In other words, v u n X u t max n−1 e aθ ))2 = a∗,θ + oP (1)b x2i,j (zi − x> aθ . i θ(b 1≤j≤p
i=1
Step 3: Establish the desired results. Now notice that on the event A1 (i) e aθ )k1 ≤ kθ ∗ k1 kθ(b (ii) (iii) −1 X> (V − Zθ(b e aθ ))k∞ ≤ η0 b kn a ≤ 3η0 a∗,θ θ (iv) kn−1 X> (V − Zθ ∗ )k ≤ η a ≤ 3η a . ∞ 0 ∗,θ 0 ∗,θ
T
A2
T
A3
T
where (i) is proved in Step 2, (ii) holds by the definition of Sθ and (iii) and (iv) hold by the definitions of A3 and A1 , respectively. Hence, we can apply Lemma 6 b σ, s, η) = (Z, X, θ ∗ , θ(b e aθ ), 3a∗,θ , sθ , η0 ) and obtain that on the event with M, b,Tb, T (H,T A1 A2 A3 A4 , e σθ ) − θ∗ k1 ≤ 12K −2 η0 a∗,θ sθ = 13.2n−1/2 K −2 Φ−1 (1 − n−1 p−1 )a∗,θ sθ kθ(b (i) p ≤ 13.2n−1/2 K −2 2 log(pn)a∗,θ sθ ,
(36)
A4 ,
Zhu and Bradic/Non-sparse Hypothesis Testing
35
where (i) follows by Lemma 1. We notice that a2∗,θ
=n
−1
max
1≤j≤p
n X i=1
x2i,j u2i
≤
kXk2∞ n−1
n X i=1
(i)
u2i = kXk2∞ [E(u21 ) + oP (1)] (ii)
= OP (log(pn))[E(u21 ) + oP (1)]
= OP (log(pn)), p where (i) follows by the law of large numbers and (ii) follows by kXk∞ = OP ( log(pn)) (due to the sub-Gaussian property and the T T T union bound). The first result follows by (36), the above display and P (A1 A2 A3 A4 ) → 1. T T T For the second result, we notice that on the event A1 A2 A3 A4 , (i)
(ii)
(iii) b ∞ ≤ η0 b kn−1 X> (Z − Xθ)k aθ ≤ 3η0 a∗,θ = OP (n−1/2 log(p ∨ n))
b = θ(b e where (i) follows by the first constraint in (8) and the fact that θ p aθ ) on the event A3 , (ii) follows by the p definition of A4 and (iii) follows by a∗,θ = OP ( log(pn)) (proved above) and η0 = O( n−1 log(pn)) (due to Lemma 1). This proves the second result. Proof of Theorem 2. Let H0 hold. By Lemmas 2 and 7, we have that, with probability tending to one, the optimization problems (5) and (8) are feasible and thus the test statistic Tn (β0 ) is well defined. Recall that V = Y − Zβ0 . Since H0 in (2) holds, we have V = Xγ ∗ + ε.
(37)
P Define a2∗,γ = max1≤j≤p ni=1 x2i,j ε2i . Let A denote the event that (i) γ ∗ lies in the feasible set of the optimization problem (5) for a = a∗,γ , (ii) kn−1 X> (V − Xb γ )k∞ /b σε ≤ √ γ k∞ /b σε ≤ n/(ρn log2 n). By Lemmas 2 and 3, 2kXk∞ η0 ρn−1 and (iii) kV − Xb P (A) → 1. (38) P P b2 = σ e 2 = n ξ 2 and D bε−2 n−1 ni=1 εb2i u b2i , where Define ξi = n−1/2 σ bε−1 εbi ui 1{A}, D i=1 i b b and u εbi = vi − x> bi = zi − x> i γ i θ. We define Pn b e ξi n−1/2 (V − Xb D γ )> X(θ ∗ − θ) e Tn = × i=1 + . b e b D D σ b D ε | {z } | {z } Tn,1
Tn,2
By straight-forward computation, one can verify that on the event A, Tn (β0 ) = Ten . By (38), it suffices to show that Ten →d N (0, 1) under H0 . We do so in two steps. First, e D b = 1 + oP (1) and Tn,2 = oP (1); second, we show Tn,1 →d N (0, 1). we show that D/ e D b = 1 + oP (1) and Tn,2 = oP (1). Step 1: show that D/
Zhu and Bradic/Non-sparse Hypothesis Testing
36
Notice that on the event A, n n X X 2 b 2 e 2 −2 −1 2 2 2 2 −2 −1 bε n εbi (b ui − ui ) ≤ max u bi − ui σ bε n εb2i D − D = σ 1≤i≤n i=1
i=1
b − θ ∗ )k2 = kX(θ ∞ 2 b ≤ kXk kθ − θ ∗ k2 ∞
1
= OP (log(pn))OP s2θ n−1 log2 (p ∨ n)
(i)
= oP (1), (39) p where (i) follows by Lemma 7 and kXk∞ = OP ( log(pn)) (due to the bounded sub2 = E(u2 | X, ε). Observe Gaussian norm of entries in X and the union bound). Let σu,i i that (38) implies ! n n X X e2 2 2 P D − n−1 σu,i bε−2 n−1 εb2i (u2i − σu,i )1{A} → 1. (40) =σ i=1 i=1 2 )b bε−2 max1≤i≤n εb2i 1{A}. By the Define qi = (u2i − σu,i ε2i σ bε−2 1{A} and dn = n−1 σ definition of A, (i) 1 dn = n−1 kV − Xb γ k2∞ σ bε−2 1{A} ≤ 2 , (41) ρn log4 n where (i) follows by the definition of A. Let Fn,0 be the σ-algebra generated by (ε, X). By assumption, {ui }ni=1 is independent across i conditional on Fn,0 . Notice that (V, X) only depends on (ε, X) (due to (37)) and that γ b and σ bε are computed using only (V, X). Therefore, {b εi }ni=1 , σ bε and 1{A} are Fn,0 -measurable. 2 Let K > 0 be a constant that upper bounds the sub-exponential norm of u2i − σu,i conditional on Fn,0 ; such a constant K exists because ui has bounded sub-Gaussian norm 2 |F 2 2 n conditional on Fn,0 . Since E(u2i −σu,i n,0 ) = 0 and {ui −σu,i }i=1 is independent across i conditional on Fn,0 , we apply Proposition 5.16 of Vershynin (2010) to the conditional probability measure P (· | Fn,0 ) and obtain that there exists a universal constant c > 0 such that on the event A, ∀t > 0, ! n t t2 −1 X P n qi > t Fn,0 ≤ 2 exp −c min Pn 4 −4 , 2 n−2 Kdn K bi σ bε i=1 ε i=1 (i) t2 t ≤ 2 exp −c min , 2 2 K dn Kdn (ii) t2 ρ4n log8 n tρ2n log4 n , , ≤ 2 exp −c min K2 K where (i) follows by n X i=1
εb4i σ bε−4 ≤ σ bε−2
max εb2i
1≤i≤n
σ bε−2
n X i=1
εb2i
!
= nb σε−2 max εb2i = n2 dn 1≤i≤n
Zhu and Bradic/Non-sparse Hypothesis Testing
37
and (ii) follows by (41). By (38) and ρn log2 n → ∞, the above display implies that ! ! n n X e2 −1 X −1 2 c P D − n σu,i > t ≤ P (A ) + P n ∀t > 0. qi > t = o(1) i=1
i=1
The above display implies that
e 2 − n−1 D
It follows by (39) that
Since P (c1 ≤ n−1
Pn
2 i=1 σu,i
We observe that
e D = b D
s
b 2 = n−1 D
n X
2 σu,i = oP (1).
(42)
2 σu,i + oP (1).
(43)
i=1
n X i=1
≤ c2 ) → 1 for some constants c1 , c2 > 0, we have that P 2 + o (1) n−1 ni=1 σu,i P P = 1 + oP (1). n 2 −1 n i=1 σu,i + oP (1)
b − θ ∗ k1 n−1/2 k(V − Xb γ )> Xk∞ kθ b σ bε D √ b − θ ∗ k1 (i) 2 nkXk η ρ−1 kθ ∞ 0 n ≤ b D √ −1 −1/2 log(p ∨ n) (ii) 2 nkXk∞ η0 ρn OP sθ n q = = oP (1), P 2 + o (1) n−1 ni=1 σu,i P
|Tn,2 | ≤
(44)
where (i) follows by the definition of A and (ii) follows by (43) and Lemma 7. Step 2: show that Tn,1 →d N (0, 1). Define Fn,i as the σ-algebra generated by (X, ε, u1 , ..., ui ). Since E(ui+1 | Fn,i ) = 0 (by assumption), we have that {(ξi , Fn,i )}ni=1 is a martingale difference array, i.e., E(ξi+1 | Fn,i ) = 0. We invoke the martingale central limit theorem. By Theorem 3.4 of Hall and Heyde (1980), it suffices to verify the following conditions. (a) max1≤i≤n |ξi | = oP (1). (b) P E max1≤i≤n ξi2 P = o(1). 2 + o (1). (c) ni=1 ξi2 = n−1 ni=1 σu,i P Pn −1 2 (d) limδ→0 lim inf n→∞ P n i=1 σu,i > δ = 1. Notice that 2 −1 −2 2 2 E max ξi ≤ E n σ bε kV − Xb γ k∞ max ui 1{A} 1≤i≤n 1≤i≤n (i) 1 ≤ 2 E max u2i 1≤i≤n ρn log4 n 1 (ii) = 2 O(log n) = o(1), ρn log4 n
Zhu and Bradic/Non-sparse Hypothesis Testing
38
where (i) follows by the definition of A and (ii) follows by the uniformly bounded subexponential norm of u2i and Corollary 2.6 of Boucheron et al. (2013). This proves claim e claim (c) follows (b). Notice that claim (a) follows by claim (b). By the definitionPof D, n −1 2 by (42). Claim (d) follows by the assumption that P (c1 ≤ n i=1 σu,i ≤ c2 ) → 1 for some constants c1 , c2 > 0. The proof is complete. A.3. Proof of Corollary 4 Proof of Corollary 4. Similar to the comment on Corollary 2.1 of Van de Geer et al. (2014), the proof of Corollary 4 is exactly the same as in Theorem 2 once we notice the following. Take an arbitrary sequence (β0n , ξn ) ∈ R × Ξ(sθ ). Notice that the properties of the test statistic Tn (β0n ) depend on X, Z and V exclusively. Moreover observe that under H0 , V = Xγ + ε. Moreover, in the arguments in the proof of Lemmas 2-7 and Theorem 2, we use bounds that hold in finite samples with universal constants as well as law of large numbers and martingale central limit theorem for triangular arrays. Hence, Lemmas 2-7 and the arguments in the proof of Theorem 2 still go through when ξn is a sequence. A.4. Proof of Theorems 5 and 6 Proof of Theorem 5. It suffices to notice that Theorem 5 is a special case of Theorem 7 in which the decomposition now reads γ ∗ = π ∗ + µ∗ with π ∗ = γ ∗ and µ∗ = 0. Hence, (π ∗ + µ∗ )> ΣX µ∗ + E(ε21 ) = E(ε21 ) ≥ τ1 . In other words, in the notation of Theorem 7, τ3 = τ1 . The other assumptions of Theorem 7 are trivially satisfied. The result then follows by Theorem 7. Proof of Theorem 6. It suffices to notice that Theorem 6 is a special case of Theorem 7 in which the decomposition now reads γ ∗ = π ∗ + µ∗ with π ∗ = 0 and µ∗ = γ ∗ . Hence, (π ∗ +µ∗ )> ΣX µ∗ +E(ε21 ) = (µ∗ )> ΣX µ∗ +E(ε21 ) ≥ E(ε21 ) ≥ τ1 . In other words, in the notation of Theorem 7, τ3 = τ1 . The other assumptions of Theorem 7 are trivially satisfied. The result then follows by Theorem 7. A.5. Poof of Theorem 7 The following result is due to Borovkov (2000) and pointed out by Merlev`ede et al. (2011). Lemma 8. Suppose that W1 , ..., Wn are i.i.d random variables such that EW1 = 0, EW12 ≥ C1 and P (|W1 | > t) ≤ C2 exp(−C3 tα ) for any t > 0, where C1 , C2 , C3 > 0 and
Zhu and Bradic/Non-sparse Hypothesis Testing
39
α ∈ (0, 1) are constants. Then for any z > 0, ! n −1 X P n Wi > z ≤ n exp (−M1 (nz)α ) + exp −M2 (nz)2 , i=1
where M1 , M2 > 0 are constants depending only on C1 , C2 , C3 and α.
Proof. Notice that |W1 |α has bounded sub-exponential norm. Hence, E exp(D1 |W1 |α ) ≤ D2 for some constants D1 , D2 > 0. Hence, by Equation (1.4) of Merlev`ede et al. (2011), we have that for any z > 0, ! n X P Wi > nz ≤ n exp (−D3 (nz)α ) + exp −D4 (nz)2 . i=1
The proof is complete.
Lemma 9. Suppose that {(wi,1 , . . . , wi,p )}ni=1 is an i.i.d sequence of p-dimensional ran√ dom vectors with mj = E(w1,j ) and log p = o( n). Suppose that min1≤j≤p V ar(wi,j ) ≥ 2K1 and for all 1 ≤ j ≤ p, the sub-exponential norm of w1,j is upper bounded by −1 −1 √ K2 , where √ K1 , K2 > 0 are constants. If max1≤j≤p |mj | ≤ DΦ(1 − p n )/ n with D ≤ 0.04 K 1 , then ! n −1 X P max n wi,j ≥ η0 A → 0, 1≤j≤p i=1 P 2 . where A2 = max1≤j≤p n−1 ni=1 wi,j Proof. Define t = Φ(1 − p−1 n−1 ) and ξi,j = wi,j − mj as well as the following events Pn i=1 ξi,j ≤t M1 = max qP n 1≤j≤p 2 i=1 ξi,j ( ) ( ) n n X \ X −1 2 −1 2 M2 = min n ξi,j ≥ K1 min n wi,j ≥ K1 1≤j≤p
1≤j≤p
i=1 i=1 ) n X p M3 = max n−1 ξi,j ≤ K3 n−1 log p 1≤j≤p i=1 v v u u n n X X u u 2 − 1.05ttn−1 2 ≤0 , M4 = max Dt + ttn−1 ξi,j wi,j 1≤j≤p
(
i=1
i=1
where K3 > 0 is a constant to be chosen. The proof proceeds in two steps. We first prove that M1 , M2 , M3 and M4 occur with probability approaching one and then show the desired result. Step 1: show that M1 , M2 , M3 and M4 occur with probability approaching one.
Zhu and Bradic/Non-sparse Hypothesis Testing
40
2 is bounded away from zero and E|ξ |2+δ is bounded Fix any δ ≥ 2. Notice that Eξi,j i,j away from infinity. As in the proof of Lemma 2 (Step 1), Theorem 7.4 of Pe˜ na et al. (2008) implies that there exist constants A, C1 > 0 such that for any 1 ≤ j ≤ p, " Pn 2+δ # ξ | | 1 + t i,j i=1 ≥ t ≤ 2Φ(−t) 1 + A . P qP n 2 nδ/(2+δ) C1 ξ i=1 i,j
By the union bound, we have that " Pn 2+δ # | ξ | 1 + t (i) i,j i=1 P (M1 ) = P max qP = o(1), ≥ t ≤ 2pΦ(−t) 1 + A δ/(2+δ) n 1≤j≤p 2 n C1 ξ i=1 i,j
where (i) follows by the fact that Φ(−t) = 1−Φ(t) = p−1 n−1 and t = Φ−1 (1−p−1 n−1 ) ≤ p 2 log(pn) (due to Lemma 1). By the bounded sub-exponential norm of ξi,j , there exist 2 > z) ≤ constants C2 , C3 > 0 such that for any 1 ≤ j ≤ p and any z > 0, P (ξ1,j √ C1 exp(−C2 n). Hence, ! ! n n X X −1 2 −1 2 2 2 P n ξi,j ≤ K1 = P n (ξi,j − Eξi,j ) ≤ K1 − Eξi,j i=1
i=1
≤P
n
−1
n X
2 (ξi,j
i=1
−
2 Eξi,j )
≤ −K1
!
p ≤ n exp −C3 nK1 + exp −C4 (nK1 )2
(i)
for some constants C3 , C4 > 0, where (i) follows by Lemma 8. By the union bound and √ log p = o( n), we have that P
min n−1
1≤j≤p
n X i=1
2 ξi,j ≤ K1
!
≤
p X j=1
P
n−1
n X i=1
≤ pn exp −C3
2 ξi,j ≤ K1
!
p nK1 + p exp −C4 (nK1 )2 = o(1).
Similarly, the above display holds if ξi,j is replaced by wi,j . Hence, P (M2 ) → 1. Moreover, notice that ! p n X X p ξi,j ≥ K3 n−1 log p ≤ P max n−1 P 1≤j≤p i=1
! n X p −1 ξi,j ≥ K3 n−1 log p n j=1 i=1 √ 2 (i) K3 log p K3 n log p ≤ 2p exp −c min , K2 K22
Zhu and Bradic/Non-sparse Hypothesis Testing
41
for some universal constant c > 0, where (i) follows by Proposition 5.16 of Vershynin √ (2010). Hence, for K3 = 2K2 / c, P (M3 ) → 1. Recall the elementary inequality that for any a, b ≥ 0, a + b + 1/2 ≥ thus √ √ |a − b| |a − b| √ ≥ . a − b = √ a + b + 1/2 a+ b
√
a+
√
b and (45)
Moreover, observe that tn−1
p
log p = Φ−1 (1 − n−1 p−1 )n−1
(ii) p p (i) log p = O n−1 (log p)(log(pn)) = o(1), (46)
√ where (i) follows by LemmaT1 andT(ii) holds by log p = o( n). Hence, on the event M1 M2 M3 , v v −1 Pn u u 2 2 n n n X X (i) u u i=1 (wi,j − ξi,j ) 2 2 tn−1 Pn wi,j ≥ tn−1 ξi,j − 2 2 −1 1/2 + n i=1 (wi,j + ξi,j ) i=1 i=1 v −1 Pn u 2 n n X u i=1 (2mj ξi,j + mj ) 2 − Pn = tn−1 ξi,j 2 2 −1 1/2 + n i=1 (2ξi,j + 2mj ξi,j + mj ) i=1 v Pn 2 u n mj + 2mj n−1 i=1 ξi,j X u 2 − = tn−1 ξi,j Pn Pn 2 2 + 2 n−1 −1 1/2 + m i=1 i=1 ξi,j ) i=1 ξi,j + 2mj (n j v p u n 2 + 2|m |K X (ii) u m n−1 log p j 3 j 2 − p ≥ tn−1 ξi,j 1/2 + m2j + 2K1 − 2 |mj | K3 n−1 log p i=1 v u √ n X u D2 t2 n−1 + 2Dtn−1 K3 log p 2 −1 t √ ξi,j − ≥ n 1/2 + 2K1 − 2Dtn−1 K3 log p i=1 v u √ n X (iii) u D2 t2 n−1 + 2Dtn−1 K3 log p 2 −1 t ≥ n ξi,j − , (47) 1/2 + K1 i=1
where (i) follows by the previous elementary inequality in (45), √ (ii) follows by the definitions of M2 and M3 , (iii) follows by the fact that 2Dtn−1 K3 log p < K1 for large n
Zhu and Bradic/Non-sparse Hypothesis Testing
42
T T (due to (46)). Now we have that on the event M1 M2 M3 , v v u u n n X X u u 2 − 1.05ttn−1 2 /t max Dt + ttn−1 ξi,j wi,j 1≤j≤p
i=1
i=1
v u √ n 2 2 −1 −1 X (i) u + 2Dtn K3 log p 2 + tD t n wi,j /t ≤ max Dt − 0.05ttn−1 1≤j≤p 1/2 + K1 i=1 √ (ii) p D2 t2 n−1 + 2Dtn−1 K3 log p ≤ D − 0.05 K1 + 1/2 + K1 p (iii) = D − 0.05 K1 + o(1).
where (i) follows by √ (47), (ii) follows by the definition of M2 and (iii) follows by (46). Since D ≤ 0.04 K 1 , the above display implies that v v u u n n X X u u 2 − 1.05ttn−1 2 ≤ 0 → 1. P (M4 ) = P max Dt + ttn−1 ξi,j wi,j 1≤j≤p
i=1
i=1
Step 2: prove the desired result. The desired result now follows by the following observation ! n X P n−1 wi,j ≥ η0 A i=1 v u n n √ X X u 2 ξi,j ≥ 1.1t max tn−1 wi,j =P max nmj + n−1/2 1≤j≤p 1≤j≤p i=1 i=1 v u n n X X u 2 ≤P Dt + max ξi,j ≥ 1.1t max tn−1 wi,j 1≤j≤p 1≤j≤p i=1 i=1 v u n n X X u 2 and M + P (Mc ) wi,j ≤P Dt + max ξi,j ≥ 1.1t max tn−1 1 1 1≤j≤p 1≤j≤p i=1 i=1 v v u u n n X X u u 2 ≥ 1.1t max tn−1 2 + P (Mc ) ≤P Dt + max ttn−1 ξi,j wi,j 1 1≤j≤p
i=1
1≤j≤p
v u n X u 2 ≥ 1.1t max ≤P Dt + max ttn−1 ξi,j
1≤j≤p
i=1
1≤j≤p
v u n X u 2 ≥ 1.1t max ≤P 1.05t max tn−1 wi,j
1≤j≤p
i=1
(i)
=P (Mc1 ) + P (Mc4 ) = o(1),
1≤j≤p
i=1
v u n X u 2 and M + P (Mc ) + P (Mc ) tn−1 wi,j 4 1 4 i=1
v u n X u 2 + P (Mc ) + P (Mc ) tn−1 wi,j 1 4 i=1
Zhu and Bradic/Non-sparse Hypothesis Testing
43
where (i) holds by Step 1. Lemma 10. Let the conditions in the statment of Theorem 7 hold. Suppose that H1,h ∗ −1/2 hu + ε . Then, in (15) holds. Define γh,n = π ∗ + n−1/2 hθ ∗ and ε(h),i = x> i i i µ +n with probability tending to one, γh,n lies in the feasible set of the optimization problem Pn 2 −1 2 2 (5) for a = ah,γ , where ah,γ = n max1≤j≤p i=1 xi,j ε(h),i . P Moreover, max1≤j≤p |n−1 ni=1 xi,j ε(h),i | = OP (n−1/2 log(p ∨ n)).
Proof of Lemma 10. Let V = Y − Zβ0 and ε(h) = (ε(h),1 , . . . , ε(h),n )> . In the rest of the proof, we denote by vi the i-th entry of V. Notice that under H1,h in (15), ∗ −1/2 ∗ > ∗ vi = yi − zi β0 = n−1/2 hzi + x> h(x> i γ + εi = n i θ + ui ) + xi γ + εi
= x> i γ h,n + ε(h),i .
Then for the first result, it suffices to verify the following claims: (a) P kn−1 X> ε(h) k∞ ≤ η0 a∗,γ → 1. (b) P kε(h) k∞ ≤ kVk2 / log2 n → 1. (c) P n−1 V> ε(h) ≥ n−1 kVk22 ρn → 1.
We proceed in four steps corresponding to above three claims as well as the second result. Step 1: Establish claim (a). We apply Lemma 9. Let wi,j = xi,j ε(h),i and notice that ∗ −1/2 V ar(w1,j ) = V ar[x1,j (x> hu1 )] 1 µ + ε1 + n (i)
∗ = n−1 h2 Eu21 + V ar[x1,j (x> 1 µ + ε1 )] ≥ 2K1
with K1 = τ2 /2,
where (i) holds by the assumption of Theorem 7. Then 1 p −1 √ τ2 n log p 25 2 (ii) 1 p −1 −1 √ ≤ τ2 n Φ (1 − p−1 ) 25 2 1 p −1 −1 < √ τ2 n Φ (1 − n−1 p−1 ) 25 2 p √ = 0.04 K1 Φ−1 (1 − n−1 p−1 )/ n, (i)
max |Ew1,j | = kΣX µ∗ k∞ ≤
1≤j≤p
where (i) holds by the assumption of Theorem 7 and (ii) follows by Lemma 1. Since both xi,j and ε(h),i have bounded sub-Gaussian norms, it follows that wi,j has a bounded subexponential norm. Therefore, claim (a) follows by Lemma 9. Step 2: Establish claim (b). By the law of large numbers n−1 kVk22 − Ev12 = n−1
n X i=1
(vi2 − Evi2 ) = oP (1).
(48)
Zhu and Bradic/Non-sparse Hypothesis Testing
44
∗ 2 −1 2 2 2 2 Notice that Ev12 = E(x> bounded away from 1 (γ h,n +µ )) + n h Eui +Eε1 ≥ Eε1 is√ zero. Hence, there exists a constant M > 0 such that P (kVk2 ≥ nM ) → 1.√On the other hand, ε(h),i has a bounded sub-Gaussian norm and thus kε(h) k∞ = OP ( log n). √ √ Since n/ log2 n log n, claim (b) follows.
Step 3: Establish claim (c). Notice that n−1 V> ε(h) = E[v1 ε(h),1 ] + n−1
n X (vi ε(h),i − E[vi ε(h),i ]) i=1
(i)
= E[v1 ε(h),1 ] + oP (1)
= (π ∗ + n−1/2 hθ ∗ + µ∗ )> ΣX µ∗ + E(ε21 ) + n−1 h2 E(u21 ) + oP (1) (ii)
= (π ∗ + µ∗ )> ΣX µ∗ + E(ε21 ) + oP (1),
where (i) holds by the law of large numbers and (ii) holds by the fact that kθ ∗ k2 , kµ∗ k2 , the eigenvalues of ΣX and E(u21 ) are bounded. By assumption, (π ∗ +µ∗ )> ΣX µ∗ +E(ε21 ) is bounded below by a positive constant. Hence, we have that P (n−1 V> ε(h) ≥ M 0 ) → 1 for some constant M 0 > 0. By (48) and the boundedness of E(v12 ), we have that n−1 kVk22 ρn = OP (ρn ) = oP (1). Hence, claim (c) follows. We have showed the claims (a)-(c), completing the proof for the first result. p Step 4: Establish the second result. By claim (a) p above and η0 ≤ 1.1 2n−1 log(pn) (due to Lemma 1), it suffices to show that a∗,γ = OP ( log(pn)), which is obtained by the following observation a2∗,γ = max n−1 1≤j≤p
n X i=1
x2i,j ε2(h),i ≤ kXk2∞ n−1
n X i=1
(i)
ε2(h),i = kXk2∞ (Eε2(h),1 + oP (1)) (ii)
= OP (log(pn))OP (1),
where (i) follows by the law of large numbers and (ii) follows by the sub-Gaussian properties of entries in X and the union bound. We have proved the second result. Lemma 11. Let the conditions in the statment of Theorem 7 hold. Suppose that H1,h in (15) holds. Then kb γ − γ h,n k1 = OP n−1/2 (kπ ∗ k0 + kθ ∗ k0 ) log(p ∨ n) , where γh,n = π ∗ + n−1/2 hθ ∗ . ∗ −1/2 hu + ε and a2 Proof of Lemma 11. Define V = Y − Zβ0 , ε(h),i = x> i i i µ +n h,γ =
Zhu and Bradic/Non-sparse Hypothesis Testing
45
P n−1 max1≤j≤p ni=1 x2i,j ε2(h),i as well as the following events A1 = γ h,n lies in the feasible set of the optimization problem (5) for a = ah,γ . ) ( kn−1/2 Xvk2 >K A2 = min min kvJ k2 A⊂{1,··· ,p}, |A|≤s kvAc k1 ≤kvA k1 A3 = {ah,γ ∈ Sγ }
A4 = {b aγ ≤ 3ah,γ } , where K > 0 is the constant defined in Lemma 5. By Lemmas 10 and 5, \ P A1 A2 → 1.
√ Notice that kγ h,n k0 ≤ kπ ∗ k0 + kθ ∗ k0 = o( n/ log p). The rest of the proof proceeds with essentially the same arguments as in the proof of Lemma 7: show P (A3 ) → 1, P (A4 ) → 1 e ∗,θ ), θ(b e aθ ), sθ ) is reand then the desired result. The only difference is that (Z, θ ∗ , θ(a ∗ ∗ e aγ ), kπ k0 +kθ k0 ). We omit the details to avoid repetition. e (ah,γ ), θ(b placed by (V, γ h,n , γ
Proof of Theorem 7. Define γh,n = π ∗ +n−1/2 hθ ∗ and ε(h) = (ε(h),1 , . . . , ε(h),n )> with ∗ ε(h),i = εi + n−1/2 hui + x> i µ . Recall that V = Y − Zβ0 . Notice that under H1,h in (15), V = n−1/2 hZ + Xγ ∗ + ε = Xγ h,n + ε(h) . 2 b2 and D 2 = E[(x> µ∗ + ε )2 u2 ], where ε b 2 = n−1 Pn u b and bi = vi − x> Let D i i i i i γ i=1 bi ε > b u bi = zi − xi θ. Notice that
Tn (β0 ) =
b n−1/2 (V − Xb γ )> (Z − Xθ) . b D
(49)
The rest of proof proceeds in two steps which establish the behavior of the numerator and the denominator, respectively. b Step 1: Establish that n−1/2 (V − Xb γ )> (Z − Xθ)/D →d N (hκ, 1). Observe that b n−1/2 (V − Xb γ )> (Z − Xθ)
−1/2 > b + n−1/2 (γ − γ b . (50) b )> X> (Z − Xθ) = n−1/2 ε> ε(h) X(θ ∗ − θ) h,n (h) u + n {z } | {z } | {z } | A1
A2
A3
Notice that
b ∞ b )k1 kn−1 X> (Z − Xθ)k |A3 | ≤ kn1/2 (γ h,n − γ (i) b ∞ = OP n−1/2 (kπ ∗ k0 + kθ ∗ k0 ) log(p ∨ n) kn−1 X> (Z − Xθ)k (ii) ≤ OP n−1/2 (kπ ∗ k0 + kθ ∗ k0 ) log(p ∨ n) OP (log(p ∨ n)) = oP (1),
(51)
Zhu and Bradic/Non-sparse Hypothesis Testing
46
where (i) follows by Lemma 11 and (ii) follows by Lemma 7. For A2 , we observe that b 1 |A2 | ≤ kn−1/2 X> ε(h) k∞ k(θ ∗ − θ)k (i)
b 1 = OP (log(p ∨ n))k(θ ∗ − θ)k
(ii)
= OP (log(p ∨ n))OP (n−1/2 sθ log(p ∨ n)) = oP (1),
(52)
where (i) follows by the second result in Lemma 10 and (ii) follows by Lemma 7. For A1 , we notice that A1 = n
−1/2
n X
ε(h),i ui = hn
i=1
−1
n X
u2i
+n
−1/2
i=1
(i)
=
hE(u21 )
+ oP (1) + n
n X
∗ ui (x> i µ + εi )
i=1
−1/2
n X
∗ ui (x> i µ + εi ),
(53)
i=1
where (i) follows by the law of large numbers. Now we combine (51), (52) and (53) with (50) to obtain P ∗ b hE(u21 ) + oP (1) + n−1/2 ni=1 ui (x> n−1/2 (V − Xb γ )> (Z − Xθ) i µ + εi ) = D D P ∗ n−1/2 ni=1 ui (x> (i) i µ + εi ) + oP (1) = hκ + D (ii)
→d N (hκ, 1),
(54)
where (i) follows by the fact that D is bounded away from zero and the assumption that E(u21 )/D → κ and (ii) follows by the central limit theorem. b Step 2: Establish that D/D = 1 + oP (1). ∗ > b Since u bi − ui = xi (θ − θ), we have that ∗ b max |b u2i − u2i | = max |b ui + ui | · |x> i (θ − θ )| 1≤i≤n ∗ > b ∗ b ≤ max |2ui | + |x> i (θ − θ )| |xi (θ − θ )|
1≤i≤n
1≤i≤n
b − θ ∗ k1 )kXk∞ kθ b − θ ∗ k1 ≤ (2kuk∞ + kXk∞ kθ ≤ OP n−1 s2θ log3 (p ∨ n) ,
(i)
(55)
where (i) follows by Lemma 7 and the√sub-Gaussian property ofpui and xi (which, by the union bound, implies kuk∞ = OP ( log n) and kXk∞ = OP ( log(pn))). Similarly, using Lemma 11, we can obtain max |b ε2(h),i − ε2(h),i | = OP n−1 (kπ ∗ k0 + kµ∗ k0 )2 log3 (p ∨ n) . (56) 1≤i≤n
Zhu and Bradic/Non-sparse Hypothesis Testing
47
Now we observe n X −1 b2i − ε2(h),i u2i ) (b ε2(h),i u n i=1 n n X −1 X 2 2 −1 2 2 2 2 ≤ max u bi n (b ui − ui ) (b ε(h),i − ε(h),i ) + max ε(h),i n 1≤i≤n 1≤i≤n i=1 i=1 2 max u bi − u2i ≤ max u b2i max εb2(h),i − ε2(h),i + max ε2(h),i 1≤i≤n 1≤i≤n 1≤i≤n 1≤i≤n 2 2 2 2 2 2 2 2 max u ≤ max ui + max u bi − ui bi − ui max εb(h),i − ε(h),i + max ε(h),i 1≤i≤n
1≤i≤n
1≤i≤n
1≤i≤n
1≤i≤n
(i)
= OP (log n) + OP n−1 s2θ log3 (p ∨ n) OP n−1 (kπ ∗ k0 + kµ∗ k0 )2 log3 (p ∨ n) + OP (log n)OP n−1 s2θ log3 (p ∨ n) =oP (1),
(57)
where (i) follows by (55) and (56) together with maxi u2i = OP (log n) and maxi ε2(h),i = OP (log n) (due to the sub-Gaussian norms of ui and ε(h),i and the union bound). The law of large numbers implies that n
−1
n X
∗ 2 −1 2 2 ε2(h),i u2i = oP (1)+Eε2(h),1 u21 = oP (1)+Eu21 (x> i µ +εi ) +n h Eu1 = oP (1)+D.
i=1
Therefore, (57) and the above display imply that b 2 − D2 = n−1 D
n X i=1
(b ε2(h),i u b2i − ε2(h),i u2i ) + n−1
n X i=1
ε2(h),i u2i − D = oP (1).
Since D is bounded away from zero, we have p b D2 + oP (1) D = = 1 + oP (1). D D
(58)
Now we combine (54) and (58) with (49) to obtain that Tn (β0 ) →d N (hκ, 1). Therefore,
P |Tn (β0 )| > Φ−1 (1 − α/2) → P |hκ + ζ| > Φ−1 (1 − α/2) ,
where ζ ∼ N (0, 1). By elementary computations, we have P |hκ + ζ| > Φ−1 (1 − α/2) = Ψ(h, κ, α). The proof is complete.
Zhu and Bradic/Non-sparse Hypothesis Testing
48
References Arriaga, Juan Mart´ın, Alicia In´es Bravo, Jos´e Mordoh and Michele Bianchini (2017), ‘Metallothionein 1g promotes the differentiation of ht-29 human colorectal cancer cells’, Oncology Reports 37(5), 2633–2651. Belloni, Alexandre, Victor Chernozhukov and Christian Hansen (2014), ‘Inference on treatment effects after selection among high-dimensional controls’, The Review of Economic Studies 81(2), 608–650. Belloni, Alexandre, Victor Chernozhukov and Lie Wang (2011), ‘Square-root lasso: pivotal recovery of sparse signals via conic programming’, Biometrika 98(4), 791–806. Bickel, Peter J, Ya´acov Ritov and Alexandre B Tsybakov (2009), ‘Simultaneous analysis of lasso and dantzig selector’, The Annals of Statistics 37(4), 1705–1732. Borovkov, Aleksandr Alekseeviˇc (2000), ‘Estimates for the distribution of sums and maxima of sums of random variables without the cramer condition’, Siberian Mathematical Journal 41(5), 811–848. Bosse, Konstanze, Silvia Haneder, Christian Arlt, Christian H Ihling, Thomas Seufferlein and Andrea Sinz (2016), ‘Mass spectrometry-based secretome analysis of non-small cell lung cancer cell lines’, Proteomics 16(21), 2801–2814. Boucheron, St´ephane, G´abor Lugosi and Pascal Massart (2013), Concentration inequalities: A nonasymptotic theory of independence, Oxford university press. B¨ uhlmann, P. (2013), ‘Statistical significance in high-dimensional linear models’, Bernoulli 19(4), 1212–1242. B¨ uhlmann, Peter and Sara Van de Geer (2011), Statistics for high-dimensional data: methods, theory and applications, Springer Science & Business Media. Cai, T Tony and Zijian Guo (2015), ‘Confidence intervals for high-dimensional linear regression: Minimax rates and adaptivity’, arXiv preprint arXiv:1506.05539 . Chatterjee, A, SN Lahiri et al. (2013), ‘Rates of convergence of the adaptive lasso estimators to the oracle distribution and higher order refinements by the bootstrap’, The Annals of Statistics 41(3), 1232–1259. Chernozhukov, Victor, Christian Hansen and Martin Spindler (2015), ‘Valid postselection and post-regularization inference: An elementary, general approach’, Annual Review of Economics . Chernozhukov, Victor, Christian Hansen and Yuan Liao (2015), ‘A lava attack on the recovery of sums of dense and sparse signals’. Dicker, Lee H. (2014), ‘Variance estimation in high-dimensional linear models’, Biometrika 101(2), 269. Dicker, Lee H. (2016), ‘Ridge regression and asymptotic minimax estimation over spheres of growing dimension’, Bernoulli 22(1), 1–37. Donoho, David and Jiashun Jin (2015), ‘Higher criticism for large-scale inference, especially for rare and weak effects’, Statist. Sci. 30(1), 1–25. Donoho, David L. and Iain M. Johnstone (1994), ‘Minimax risk over lp -balls for lq -error’, Probability Theory and Related Fields 99(2), 277–303. Dudoit, Sandrine and Mark J. van der Laan (2007), Multiple Testing Procedures and Applications to Genomics, Springer.
Zhu and Bradic/Non-sparse Hypothesis Testing
49
Efron, Bradley (2010), Large-scale inference : empirical Bayes methods for estimation, testing, and prediction, Institute of mathematical statistics monographs, Cambridge University Press, Cambridge. Ellis, Matthew J., Sara Jenkins, John Hanfelt, Maura E. Redington, Marian Taylor, Russel Leek, Ken Siddle and Adrian Harris (1998), ‘Insulin-like growth factors in human breast cancer’, Breast Cancer Research and Treatment 52(1), 175–184. Fan, Jianqing and Jinchi Lv (2010), ‘A selective overview of variable selection in high dimensional feature space’, Statistica Sinica 20(1), 101?148. Fan, Jianqing and Runze Li (2001), ‘Variable selection via nonconcave penalized likelihood and its oracle properties’, Journal of the American statistical Association 96(456), 1348–1360. Feller, William (1968), An introduction to probability theory and its applications: volume I, Vol. 3, John Wiley & Sons London-New York-Sydney-Toronto. Gautier, Eric and Alexandre B Tsybakov (2013), Pivotal estimation in high-dimensional regression via linear programming, in ‘Empirical Inference’, Springer, pp. 195–204. Hall, Peter and Christopher C Heyde (1980), Martingale limit theory and its application, Academic press New York. Hall, Peter, Jiashun Jin and Hugh Miller (2014), ‘Feature selection when there are many influential features’, Bernoulli 20(3), 1647–1671. Hastie, Trevor, Robert Tibshirani and Jerome Friedman (2009), The elements of statistical learning: data mining, inference and prediction, 2 edn, Springer. Holm, Karolina, Johan Staaf, G¨oran J¨onsson, Johan Vallon-Christersson, Haukur Gunnarsson, Adalgeir Arason, Linda Magnusson, Rosa B. Barkardottir, Cecilia Hegardt, Markus Ringn´er and ˚ Ake Borg (2012), ‘Characterisation of amplification patterns and target genes at chromosome 11q13 in ccnd1-amplified sporadic and familial breast tumours’, Breast Cancer Research and Treatment 133(2), 583–594. Ingster, Yuri I., Alexandre B. Tsybakov and Nicolas Verzelen (2010), ‘Detection boundary in sparse regression’, Electron. J. Statist. 4, 1476–1526. Janson, Lucas, Rina Foygel Barber and Emmanuel Cand`es (2015), ‘Eigenprism: Inference for high-dimensional signal-to-noise ratios’, arXiv preprint arXiv:1505.02097 . Javanmard, Adel and Andrea Montanari (2014a), ‘Confidence intervals and hypothesis testing for high-dimensional regression’, The Journal of Machine Learning Research 15(1), 2869–2909. Javanmard, Adel and Andrea Montanari (2014b), ‘Hypothesis testing in highdimensional regression under the gaussian random design model: Asymptotic theory’, Information Theory, IEEE Transactions on 60(10), 6522–6554. Javanmard, Adel and Andrea Montanari (2015), ‘De-biasing the lasso: Optimal sample size for gaussian designs’, arXiv preprint arXiv:1508.02757 . Jin, Jiashun and Zheng T.. Ke (2014), ‘Rare and Weak effects in Large-Scale Inference: methods and phase diagrams’, ArXiv e-prints . Kitange, GasparJ., AnnC. Mladek, MarkA. Schroeder, JennyC. Pokorny, BrettL. Carlson, Yuji Zhang, AshaA. Nair, Jeong-Heon Lee, Huihuang Yan, PaulA. Decker, Zhiguo Zhang and JannN. Sarkaria (2016), ‘Retinoblastoma binding protein 4 modulates
Zhu and Bradic/Non-sparse Hypothesis Testing
50
temozolomide sensitivity in glioblastoma by regulating {DNA} repair proteins’, Cell Reports 14(11), 2587 – 2598. Kraft, Peter and David J. Hunter (2009), ‘Genetic risk prediction ? are we there yet?’, New England Journal of Medicine 360(17), 1701–1703. Lehmann, Erich L and Joseph P Romano (2006), Testing statistical hypotheses, Springer Science & Business Media. Li, Huchun, Tae-Hee Lee and Hava Avraham (2002), ‘A novel tricomplex of brca1, nmi, and c-myc inhibits c-myc-induced human telomerase reverse transcriptase gene (htert) promoter activity in breast cancer’, Journal of Biological Chemistry 277(23), 20965– 20973. Magnanti, T. L. and J. B. Orlin (1988), ‘Parametric linear programming and anti-cycling pivoting rules’, Mathematical Programming 41(1), 317–325. Merlev`ede, Florence, Magda Peligrad and Emmanuel Rio (2011), ‘A bernstein type inequality and moderate deviations for weakly dependent sequences’, Probability Theory and Related Fields 151(3-4), 435–474. Ning, Yang and Han Liu (2014), ‘A general theory of hypothesis tests and confidence regions for sparse high dimensional models’, arXiv preprint arXiv:1412.8765 . Oates, Adam J, Lisa M Schumaker, Sara B Jenkins, Amelia A Pearce, Stacey A DaCosta, Banu Arun and Matthew JC Ellis (1998), ‘The mannose 6-phosphate/insulinlike growth factor 2 receptor (m6p/igf2r), a putative breast tumor suppressor gene’, Breast cancer research and treatment 47(3), 269–281. Pang, Haotian, Han Liu and Robert J Vanderbei (2014), ‘The fastclime package for linear programming and large-scale precision matrix estimation in r.’, Journal of Machine Learning Research 15(1), 489–493. Pe˜ na, Victor H, Tze Leung Lai and Qi-Man Shao (2008), Self-normalized processes: Limit theory and Statistical Applications, Springer Science & Business Media. Poczobutt, Joanna M, Teresa T Nguyen, Dwight Hanson, Howard Li, Trisha R Sippel, Mary CM Weiser-Evans, Miguel Gijon, Robert C Murphy and Raphael A Nemenoff (2016), ‘Deletion of 5-lipoxygenase in the tumor microenvironment promotes lung cancer progression and metastasis through regulating t cell recruitment’, The Journal of Immunology 196(2), 891–901. Pritchard, Jonathan K (2001), ‘Are rare variants responsible for susceptibility to complex diseases?’, The American Journal of Human Genetics 69(1), 124–137. Raskutti, Garvesh, Martin J Wainwright and Bin Yu (2011), ‘Minimax rates of estimation for high-dimensional linear regression over-balls’, Information Theory, IEEE Transactions on 57(10), 6976–6994. Rudelson, Mark and Shuheng Zhou (2013), ‘Reconstruction from anisotropic random measurements’, Information Theory, IEEE Transactions on 59(6), 3434–3447. Shao, Jun (2003), Mathematical Statistics, 2nd edn, Springer-Verlag New York Inc. Sun, Tingni and Cun-Hui Zhang (2012), ‘Scaled sparse linear regression’, Biometrika 99(4), 879–898. Tang, Nou-Ying, Fu-Shin Chueh, Chien-Chih Yu, Ching-Lung Liao, Jen-Jyh Lin, TeChun Hsia, King-Chuen Wu, Hsin-Chung Liu, Kung-Wen Lu and Jing-Gung Chung
Zhu and Bradic/Non-sparse Hypothesis Testing
51
(2016), ‘Benzyl isothiocyanate alters the gene expression with cell cycle regulation and cell death in human brain glioblastoma gbm 8401 cells’, Oncology reports 35(4), 2089– 2096. Terracciano, Daniela, Matteo Ferro, Sara Terreri, Giuseppe Lucarelli, Carolina DElia, Gennaro Musi, Ottavio de Cobelli, Vincenzo Mirone and Amelia Cimmino (2017), ‘Urinary long non-coding rnas in non-muscle invasive bladder cancer: new architects in cancer prognostic biomarkers’, Translational Research . Tibshirani, Robert (1996), ‘Regression shrinkage and selection via the lasso’, Journal of the Royal Statistical Society (Series B) 58, 267–288. Van de Geer, Sara, Peter B¨ uhlmann, Yaacov Ritov and Ruben Dezeure (2014), ‘On asymptotically optimal confidence regions and tests for high-dimensional models’, The Annals of Statistics 42(3), 1166–1202. Van der Vaart, Aad W (2000), Asymptotic statistics, Vol. 3, Cambridge university press. Vanderbei, Robert J (2014), Linear Programming: Foundations and Extensions, Springer. Vershynin, Roman (2010), ‘Introduction to the non-asymptotic analysis of random matrices’, arXiv preprint arXiv:1011.3027 . Wang, Jingshu, Qingyuan Zhao, Trevor Hastie and Art B Owen (2015), ‘Confounder adjustment in multiple hypotheses testing’, arXiv preprint arXiv:1508.04178 . Wang, Yubo, Rui Han, Zhaojun Chen, Ming Fu, Jun Kang, Kunlin Li, Li Li, Hengyi Chen and Yong He (2016), ‘A transcriptional mirna-gene network associated with lung adenocarcinoma metastasis based on the tcga database’, Oncology reports 35(4), 2257– 2269. Ward, Rachel A. (2009), ‘Compressed sensing with cross validation’, Information Theory, IEEE Transactions on 55(12), 5773–5782. Wasserman, Larry and Kathryn Roeder (2009), ‘High-dimensional variable selection’, The Annals of Statistics 37(5A), 2178–2201. White, Halbert (1980), ‘A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity’, Econometrica: Journal of the Econometric Society pp. 817–838. Zhang, Cun-Hui and Stephanie S Zhang (2014), ‘Confidence intervals for low dimensional parameters in high dimensional linear models’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76(1), 217–242. Zhang, Ming, Change Gao, Yi Yang, Gaofeng Li, Jian Dong, Yiqin Ai, Qianli Ma and Wenhui Li (2017), ‘Mir-424 promotes non-small cell lung cancer progression and metastasis through regulating the tumor suppressor gene tnfaip1’, Cellular Physiology and Biochemistry 42(1), 211–221. Zhu, Yinchu and Jelena Bradic (2016a), ‘Linear hypothesis testing in dense highdimensional linear models’, arXiv preprint arXiv:1610.02987 . Zhu, Yinchu and Jelena Bradic (2016b), ‘Two-sample testing in non-sparse highdimensional linear models’, arXiv preprint arXiv:1610.04580 .