efficiently solve a general class of high-dimensional sparse minimum discrepancy problems. This algorithm embeds ..... (A1) For all ϵ > 0 and 1 ≤ j ≤ p, there is a constant C > 0 such that P(|(Xj − µj)| > ϵ) ≤ ...... expression profiling and artificial neural networks', Nature Medicine 7(6), 673–679. ...... If τ = 2c7√log ¯pn/n and.
Sparse Minimum Discrepancy Approach to Sufficient Dimension Reduction with Simultaneous Variable Selection in Ultrahigh Dimension Wei Qian, Shanshan Ding and R. Dennis Cook
∗
Abstract Sufficient dimension reduction (SDR) is known to be a powerful tool for achieving data reduction and data visualization in regression and classification problems. In this work, we study ultrahigh-dimensional SDR problems and propose solutions under a unified minimum discrepancy approach with regularization. When p grows exponentially with n, consistency results in both central subspace estimation and variable selection are established simultaneously for important SDR methods, including sliced inverse regression (SIR), principal fitted component (PFC) and sliced average variance estimation (SAVE). Special sparse structures of large predictor or error covariance are also considered for potentially better performance. In addition, the proposed approach is equipped with a new algorithm to efficiently solve the regularized objective functions and a new data-driven procedure to determine structural dimension and tuning parameters, without the need to invert a large covariance matrix. Simulations and a real data analysis are offered to demonstrate the promise of our proposal in ultrahigh-dimensional settings.
Key Words: Inverse regression, central subspace, sparsity, sliced inverse regression, principal fitted component, sliced average variance estimation 1.
Introduction
Consider a regression setting where Y ∈ R1 is a univariate response and X ∈ Rp is a p-dimensional predictor vector. SDR reduces the dimensionality of X by finding a p × d (d ≤ p) dimension reduction matrix Γ so that given the linear combination of the predictor ΓT X, the response Y is independent of X. A dimension reduction matrix is not normally identifiable in SDR studies, but the column subspace of Γ, denoted span(Γ) and referred to as an SDR subspace, is identifiable. The intersection of all SDR subspaces is itself an SDR subspace under mild conditions and then is called the central subspace (Cook, 1994, 1998). Finding a good estimator of the central subspace is a common goal in SDR. Sliced inverse regression (SIR; Li, 1991) and sliced average variance estimation (SAVE; Cook and Weisberg, 1991) are two early techniques that can be used to estimate the central subspace ∗
Wei Qian and Shanshan Ding are Assistant Professors, Department of Applied Economics and Statistics, University of Delaware. R. Dennis Cook is Professor, School of Statistics, University of Minnesota.
1
under certain conditions. Since then, numerous model-free SDR methods have been developed to improve estimation (e.g., Cook and Ni, 2005; Li and Wang, 2007; Li, 2007; Wang and Xia, 2008; Li et al., 2010; Yin and Li, 2011; Ma and Zhu, 2012; Ding and Cook, 2015a,b, among many others). In contrast, Cook (2007) presented a model-based SDR technique, called principal fitted component (PFC; see also Cook and Forzani, 2008; Ding and Cook, 2014), to obtain the maximum likelihood estimation of the central subspace. Under the setting with diverging number of predictors, Zhu √ et al. (2006) established estimation consistency for SIR when p = o( n), and Wu and Li (2011) studied consistencies of SDR methods under p = o(n/ log n) with variable selection. In contemporary regression applications, we commonly encounter high-dimensional data with predictor dimension p much larger than sample size n and many advances have been made over the past two decades on methods for the analysis of such data. The l1 -regularized linear regression methods such as lasso (Tibshirani, 1996) and adaptive lasso (Zou, 2006) have become increasingly popular tools with extensive theoretical and application studies. See, for example, Bickel et al. (2009); Zhang (2010); B¨ uhlmann and Van De Geer (2011) and references therein for many important shrinkage methods and theoretical results. In addition, pioneered by Fan and Lv (2008), numerous variable screening procedures have been proposed to reduce predictor size and deliver consistent variable selection (e.g., Zhu et al., 2012; Li et al., 2012; Yu, Dong and Shao, 2016; among many others). Stepwise-type procedures are also available (e.g., Yu, Dong and Zhu, 2016; Qian et al., 2017 and many references therein) SDR promises to be a powerful tool for data reduction and visualization in high-dimensional regression contexts and several new methods have been proposed in recent years. These new methods can be arranged roughly into four categories. The first includes SDR methods for high-dimensional data without variable selection. For example, under a novel abundant high-dimensional setting that allows p > n, Cook et al. (2012) showed asymptotic results for a model-based SDR framework. Because of the target of the studies, this type of solutions assume that nearly all the predictors are relevant to the response. Thus, they do not achieve variable selection while doing dimension reduction. The recent work of Lin et al. (2015) provided a theoretical foundation for sparse SDR in which only a small subset of the predictors are relevant to the response. The second category employs different types of regularizations to develop shrinkage based SDR methods for simultaneous central subspace estimation and variable selection. In the large-psmall-n setting, Li and Yin (2008) utilized a least square formulation of SDR for the development of regularization estimation. This is promising but the theoretical properties of this approach
2
are unknown. Yu et al. (2013) showed estimation consistency by applying the Dantzig selector to solve the generalized eigenvalue problem for SIR under sparse covariance assumptions. Most recently, Wang et al. (2016) re-cast SIR into a “pseudo” sparse reduced-rank regression problem and showed consistency in central subspace estimation. Both of these studies focussed mainly on high-dimensional sparse SIR estimation. In addition, the consistency results in these papers were established mainly for central subspace estimation, not for variable selection. Tan et al. (2016) proposed an l0 -constrained non-convex optimization method to target the leading eigenvector in central subspace estimation, which can be ideal if the structural dimension d = 1 but is not applicable for larger structural dimensions. The third type of solutions for high-dimensional SDR covers the new sequential SDR procedures proposed by Yin and Hilafu (2015) and Hilafu and Yin (2016). This framework provides estimators of the central subspace and variable selection via a sequential process that allows p > n. It incorporates well-established (non-sparse or sparse) SDR methods and has shown successful applications in high-dimensional data analysis, although its asymptotic properties have yet to be established. The fourth type of solutions utilizes the thresholding-type procedures. The thresholding technique has shown important applications for variable screening purposes in, e.g., Yu, Dong and Shao (2016). A promising diagonal thresholding screening SIR algorithm (Lin et al., 2015) was developed for sparse predictor covariance scenarios and the estimation consistency was established under ultrahigh dimension, although it does not provide sparse central subspace for variable selection and the consistency results are not yet clear for non-sparse covariance. In this article we propose a new and unified sparse minimum discrepancy approach to a family of SDR problems, including both model-free and model-based SDR contexts, in which p is allowed to grow at an exponential rate with n. Our proposal differs from past methods in at least three important aspects. First, we provide a general framework for high-dimensional SDR problems and establish consistency results in both central subspace estimation and variable selection for SIR, PFC and SAVE simultaneously. Under the sparse SDR framework with p n, to the best of our knowledge, most of previous work mainly focus on SIR with consistency on the central subspace estimation but not on variable selection (or vice versa). No study has provided simultaneous analysis for PFC and SAVE. Our proposal fills these gaps and provides a more general solution for SDR problems in ultrahigh dimension. Second, our approach allows many quantities like the structural dimension, the number of important predictors and number of slices to diverge with n. Unlike many high-dimensional SDR methods, our proposal does not necessarily require a sparsity
3
condition on the predictor covariance matrix (or its inverse), or the maximum eigenvalue of the predictor covariance matrix to be upper bounded. At the same time, with an “approximately sparse” predictor covariance matrix, it can incorporate large covariance matrix estimation to better exploit this special structure and achieve gains in both theory and empirical applications, without compromising computational efficiency. Third, we develop a new algorithm that can efficiently solve a general class of high-dimensional sparse minimum discrepancy problems. This algorithm embeds Stiefel manifold optimization into a parallelizable majorized block coordinate descent algorithm, without requiring the inverse of a large covariance matrix. We further equip our approach with a sequential procedure to provide data-driven determination of all important thresholding/tuning parameters and the SDR structural dimension. The rest of the article is organized as follows. We review and reformulate the general minimum discrepancy method in Section 2. Our sparse SDR framework and variable selection procedures are also described in Section 2. Detailed formulations and consistency results for SIR, PFC and SAVE are given in Sections 3, 4 and 5. In Section 3.2 we review thresholding-based estimation for large covariance matrix estimation and incorporate it into our estimation procedures. Our computational algorithm is presented in Section 6. Simulation studies and a real data analysis that illustrate the use of our general approach are presented in Sections 7 and 8. We conclude with a discussion in Section 9. All technical proofs are given in the Supplement. 2.
SDR via the Minimum Discrepancy Approach
Let {(Xi , Yi )}ni=1 be a random sample from some unknown joint distribution of (X, Y ) with X ∈ Rp and Y ∈ R1 . Let SY |X denote the central subspace, which is assumed to exist. Given any matrix A, define SA := span(A) to be the column span of A. Many SDR methods can be written as a minimization problem using an objective function of the form L1n (Γ, V ) = tr (Υn − Σn ΓV )T Ωn (Υn − Σn ΓV ) ,
(1)
˜ , W and W −1 . Here, M ˜ where Υn , Σn and Ωn are sample estimates for the population matrices M is a p × l kernel matrix associated with a particular SDR method, where d := dim(SM˜ ) ≤ l ≤ p, W is some p × p positive definite matrix, and Γ ∈ Rp×d and V ∈ Rd×l represent parameters to be estimated by minimization of L1n . We will focus on SIR, PFC, and SAVE in this article, and the specific forms of these matrices designed for high-dimensional SIR, PFC and SAVE will be described in Sections 3, 4 and 5.
4
Table 1: Formulations used for SIR, PFC, and SAVE. Method
˜ M
W
M
∆
Υn
SIR
˜ 1 = U1 D g M
Σ
˜1 Σ−1 M
Ip ⊗ Σ
ˆ1 Dgˆ U
PFC
˜ 2 = ΨB M
Φ
˜2 Φ−1 M
Ip ⊗ Φ
(XT F )(F T F )−1
SAVE
˜ 3 = U2 Dg2 U T M 2
Σ
˜3 Σ−1 M
Ip ⊗ Σ
ˆ2 D 2 U ˆT U ˆ 2 g
The general form of (1) is an adaptation of the minimum discrepancy approach proposed by Cook and Ni (2005). Indeed, the population version of objective function of (1) can be written as ˜ − W ΓV )T W −1 (M ˜ − W ΓV ) , L1 (Γ, V ) = tr (M (2) T = vec(M ) − vec(ΓV ) ∆ vec(M ) − vec(ΓV ) , (3) ˜ , ∆ = Ip ⊗ W , and (3) shows the close connection between Cook and Ni where M = W −1 M ˜ ) that, under certain (2005) and our proposal. SDR methods generally focus on matrices M (or M conditions, satisfy SM ⊆ SY |X . Unless stated otherwise, we adopt the commonly assumed condition that SM = SY |X , while leaving the implications of proper containment SM ⊂ SY |X to the discussion in Section 9. Since ∆ is positive definite and dim(SM ) = d, any minimizer (Γ0 , V0 ) of (3) satisfies M = Γ0 V0 , where Γ0 forms a basis matrix of SY |X and V0 is the coordinate matrix. After expansion of (2), minimization of the function does not actually depend on W −1 ; thus, the population optimization does not necessarily involve large covariance matrix inversion. This demonstrates that minimization of (1) produces the desired results in the population. Table 1 provides a quick reference for the population matrices we will use for SIR, PFC and SAVE in ultrahigh dimensions; the notations in the table will be defined in each method’s own section. The proposed approach can be potentially adapted to other SDR methods such as directional regression (DR) and principal Hessian directions (pHd) by using different matrix choices. For identifiability, nearly all SDR studies impose constraints on Γ. In this work, we leave Γ unconstrained and use the alternative constraint that V V T = Id , which is motivated by the following observation. ˆ 01 , Vˆ01 ) and (Γ ˆ 11 , Vˆ11 ) be the solutions of (Γ, V ) in the minimization problem Proposition 1. Let (Γ ˆ 11 are of full-rank, (1) under the constraints ΓT Γ = Id and V V T = Id , respectively. If Vˆ01 and Γ ˆ 01 ) = span(Γ ˆ 11 ). then span(Γ In Proposition 1, two different constraints for (1) are considered, one on Γ and the other on 5
V . Minimizers of (1) under the two constraints are equivalent in the sense that they generate the same subspace. Following this alternative framework, we reformulate the discrepancy function and investigate SDR methods under ultrahigh-dimensional scenarios, with appropriate sparse assumptions on SY |X to be introduced next. To study high-dimensional problems, it is often assumed that only a subset of predictors is associated with the response. Specifically, given any index set A ⊆ {1, 2, · · · , p} with size u = |A| (u < p), the predictor X can be partitioned into two components XA ∈ Ru and XAc ∈ Rp−u . Then XAc is called redundant if the conditional independence condition Y ⊥ ⊥ XAc | XA
(4)
is satisfied. Let HA ∈ Rp×u be the sub-matrix of Ip corresponding to the column indices in A. Then condition (4) is equivalent to the statement that the subspace span(HA ) is an SDR subspace. Inspired by Cook (2004), Yin and Hilafu (2015) formally defined the central variable selection subspace (CVS) H0 to be the intersections of all the subspaces span(HA ) that satisfy (4). Then there exists A0 ⊆ {1, 2, · · · , p} such that H0 = span(HA0 ) with q := dim(H0 ). Since SY |X ⊆ H0 , XAc0 is redundant and ΓT0 ej = 0 for j ∈ Ac0 , where Γ0 is a basis matrix of SY |X and ej is the unit vector with the jth element being 1. Naturally, we call XA0 important variables since ΓT0 ej 6= 0 for j ∈ A0 (otherwise, XAc0 ∪{j} becomes redundant, which makes H0 contradict the definition of the CVS). Sparse SDR requires that q := |A0 | < p. To achieve simultaneous variable selection, we note the equivalence that A0 ≡ {1 ≤ j ≤ p : eTj Γ0 ΓT0 ej > 0}. Motivated by the sparse SDR approach in Li (2007) and the group-lasso method (Yuan and Lin, 2006), Chen et al. (2010) imposed a coordinate-independent penalty P Pv (Γ) on objective functions of classical SDR methods, where Pv (Γ) = pj=1 vj kγ j k2 , γ j = ΓT ej and v = (v1 , · · · , vp ) are the penalty weights. This penalty with equal weights also appears in Wang et al. (2016) to study a modified SIR problem in high dimension, but their motivations and problem settings are different from our proposal, and are not applicable to PFC or SAVE. By Proposition 1 and arguments above, to identify the correct sparsity structure of SY |X under p n scenarios, we propose to adopt this coordinate-independent regularization approach and impose the penalty Pv (Γ) with tuning parameter λn on (1) under the alternative contraint V V T = Id , giving the objective function 1 (5) L2n (Γ, V ) = tr (Υn − Σn ΓV )T Ωn (Υn − Σn ΓV ) +λn Pv (Γ), subject to V V T = Id . 2 ˆ Vˆ ) = argmin(Γ,V ) L2n (Γ, V ), we simultaneously estimate SY |X by span(Γ) ˆ Given its minimizer (Γ,
6
ˆΓ ˆ T ej > 0}. The (unequal) penalty weights and estimate A0 by Aˆ0 = {1 ≤ j ≤ p : eTj Γ v = (v1 , · · · , vp )T are inspired by the adaptive lasso (Zou, 2006), which has shown promising applications in high-dimensional linear regression problems with consistent variable selection (see, e.g., Huang et al., 2008; Zhou et al., 2009; Wei and Huang, 2010). Specifically, we set ˜ nΓ ˜ T ej )−ρ/2 , where (Γ ˜ n , V˜n ) is the minimizer of (Γ, V ) in (5) with a properly chosen tunvj = (eTj Γ n ˜ n and equal penalty weights v = 1p , and ρ > 0 is a pre-specified positive constant. ing parameter λ Sections 3–5 will establish theoretical results for SIR, PFC, and SAVE in ultrahigh dimension, while the computing algorithm for minimizing (5) is given in Section 6. 3.
SIR in Ultrahigh Dimension
We consider SIR in this section and also establish notation that will be used in subsequent sections. Denote the marginal mean and variance of X by µ and Σ. Recall that SY |X is the central subspace with d = dim(SY |X ). If Y is continuous, it is necessary when using basic SDR methods such as SIR and SAVE to discretize the response into h (h ≥ d) non-overlapping intervals Iy , y ∈ {1, . . . , h}. These intervals are called slices, and the response value is re-assigned to be the index of its corresponding slice; the new response is then Y˜ = y if Y ∈ Iy . Under the coverage condition, S ˜ = SY |X for some integer constant h. Define my = E(X|Y˜ = y) − µ for y = 1, · · · , h and Y |X
U1 = (m1 , m2 , · · · , mh ) ∈ Rp×h . The linearity condition, which requires that E(X|ΓT0 X) be a linear function of ΓT0 X for a basis matrix Γ0 of SY |X , is often assumed in SDR studies to insure useful properties for U1 . It is needed only at Γ0 and not at all Γ, and it was shown to hold asymptotically in high-dimensional settings by Hall and Li (1993). Under the linearity condition, my ∈ ΣSY |X and thus under coverage, SU1 = ΣSY |X (Li, 1991; Cook, 1998). Also define py = P (Y ∈ Iy ), √ gy = py and g = (g1 , · · · , gh )T . Given any vector w, let Dw = diag(w). ˜ =M ˜ 1 := U1 Dg , W = Σ Specifically for SIR, we set the population-level matrix in (2) to be M ˜ 1 , as listed in Table 1. Under the settings discussed above and in Section 2, we and M = Σ−1 M have SM˜ 1 = ΣSY |X . The matrix Dg is used here to give a proper weighting during estimation. ˜ 1 = ΣΓ0 V0 , where Γ0 ∈ Rp×d is a basis of SY |X and V0 ∈ Rd×h contains Thus, it follows that M coordinates. The sample version of objective function (1) is obtained by replacing the population quantities by their sample versions. ¯ be the sample mean of ˜ 1 , let X To construct the matrix sample estimate Υn in (1) for M ¯y = P Xi ’s, Ny be the sample size of slice y (y = 1, . . . , h), X Yi ∈Iy Xi /Ny be the sample mean p ˆ = (ˆ of slice y, pˆy = Ny /n, gˆy = pˆy and g g1 , · · · , gˆh )T . Then we set Υn = Uˆ1 Dgˆ , where 7
¯ 1, Z ¯ 2, · · · , Z ¯ h ) ∈ Rp×h and Z ¯y = X ¯ y − X. ¯ Recall that Σn and Ωn denote the sample Uˆ1 = (Z counterparts of W and W −1 , respectively. Same as the population function (2), although (1) shows Ωn in its formulation, the minimization of (1) does not need to involve the estimation of Ωn (large matrix inversion) by treating Σn Ωn = Ip . For Σn , we will consider two possible estimation methods, sample covariance estimation and thresholded covariance estimation, separately in Sections 3.1 and 3.2. In the theoretical development, we allow the number of slices h, the structural dimension d, and the number of important predictors q to all diverge with sample size n. Throughout this section, we assume that the predictor X = (X1 , · · · , Xp )T has the following properties. (A1) For all > 0 and 1 ≤ j ≤ p, there is a constant C > 0 such that P (|(Xj − µj )| > ) ≤ 2 exp(−2 /2C). (A2) For all > 0, 1 ≤ j ≤ p and 1 ≤ y ≤ h, there are constants C1 , C2 > 0 such that P (|(Xj − µyj )| > | Y ∈ Iy ) ≤ C1 exp(−2 /2C2 ). (A3) There are constants σ∗ , σ ˜ > 0 such that σij < σ ˜ for every 1 ≤ i, j ≤ p and λmin (Σ) > σ∗ , where λmin (Σ) is the minimum eigenvalue of Σ. (A4) There is a positive constant cb such that P (Y ∈ Iy ) ≥ cb /h for all 1 ≤ y ≤ h. Condition (A1) requires elements in X to be uniformly sub-Gaussian, which is a common assumption in high-dimensional studies. Condition (A2) requires that for all slices, the conditional distribution of Xj | (Y ∈ Iy ) is sub-Gaussian (1 ≤ j ≤ p). Note that (A2) implies (A1). If h is upper bounded, then (A1) suffices as it also implies (A2). We also note that without (A2) or with a weaker condition, consistency of central subspace estimation can still hold but with a slower rate than that obtained in the following theorems. Rather than using the commonly assumed upperbounded constraint on the maximum eigenvalue λmax (Σ), (A3) requires a weaker condition that each element in Σ is uniformly upper bounded, and it essentially rules out asymptotic collinearity in the predictors. We also use (A4) to ensure that the marginal probability of each slice is not too small. 3.1.
SIR with Sample Covariance
ˆ n := Pn (Xi − As the first choice for Σn , we use the sample covariance matrix; that is, Σn = Σ i=1 T ¯ ¯ X)(Xi − X) /n. Define p¯n = max(p, n). We allow p to grow exponentially fast with n under 8
ultrahigh dimensional scenarios in (A5). (A5) Assume h2 q log p¯n = O(n1−2ξ ) and dq 2 log p¯n = O(n1−2ξ ) for some constant 0 < ξ < 12 . p p Condition (A5) implies that q = o( n/log p¯n ) and h = o( n/log p¯n ). Hence both q and h are allowed to diverge but not to diverge too fast. Such an assumption is reasonable since dimensions observed when p n are typically small and good values of h are not much bigger than d. We further require the following conditions on the kernel matrix M . (A6) Assume minj∈B0 eTj M M T ej > cφ n−φ for some 0 ≤ φ < 2ξ. (A7) The nonzero singular values of M are bounded away from 0. Specifically, (A6) is the marginal utility quantity condition (Zhu et al., 2012) that is used to control the level of the signal, which equivalently means that each row vector (in Rh ) of M that corresponds to the important variables should have norm (or signal) not smaller than cφ n−φ . Condition (A7) is satisfied when the nonzero eigenvalues of cov(E(X|Y˜ )) are bounded away from zero, which is needed for deriving projection matrix consistency in central subspace estimation. We can now describe important aspects of the asymptotic behavior for SIR. We use k·kF to denote the Frobenius norm. ˜ n and λn given in Theorem 1. Under Conditions (A1)–(A7), with tuning parameter sequences λ ˆ Vˆ ) satisfies Theorem 6 of Supplement A.3, the minimizer (Γ, p 1) central subspace estimation consistency: kPSΓˆ − PSY |X kF = Op (h + (dq)1/2 ) q log p¯n /n ; 2) variable selection consistency: P (Aˆ0 = A0 ) → 1 as n → ∞. The results in Theorem 1 allow q, h and d to diverge all together with n, and (A5) essentially makes sure that the central subspace estimation error can converge to zero. For special cases, if p d h q 1/2 or, more generally, h = o((dq)1/2 ), then kPSΓˆ − PSY |X kF = Op (q d log p¯n /n). If h p and d are upper bounded, the convergence rate is Op (q log p¯n /n); if q is upper bounded, the rate p is Op (h log p¯n /n). Also, if p diverges at a polynomial rate (p = nα with 0 < α < 1) and h, q p are upper bounded, we have kPSΓˆ − PSY |X kF = Op ( log n/n), which is faster than the existing p rate Op ( p/n) in Wu and Li (2011). The original asymptotic results of Li (1991) required that p and h be bounded from above. In that case, d and q are also bounded above, and we obtain p kPSΓˆ − PSY |X kF = Op ( log n/n), which is somewhat slower than the known root-n rate; classical SDR methods such as Cook and Ni (2005) should suffice to handle this special case. 9
˜ n and λn (and those in Theorems 2–5) are theoretical constructs The tuning parameters λ that contain several unknown constants. In practice, rather than trying to find plugged-in values for the tuning parameters, we propose a data-driven sequential procedure in Section 6.2 to find the tuning parameters and structural dimension, and show their empirical performance in both simulation and real data analysis. Like many other shrinkage methods, tuning parameter selection consistency remains a challenging open question, and we leave it for future investigations. 3.2.
SIR with Thresholded Covariance Estimation
Estimation of a covariance matrix is known to be challenging when the dimension p is much larger than sample size n. Various methods have been proposed to estimate high-dimensional and “approximately sparse” covariance matrices (Bickel and Levina, 2008a,b; Karoui, 2008; Lam and Fan, 2009; Rothman et al., 2009; Rothman, 2012; Cai and Zhou, 2012; Xue et al., 2012). Thresholding methods (Bickel and Levina, 2008a) have low computational cost, and are therefore attractive when p is very large. In this section, we use a variant of the generalized thresholding technique (Rothman et al., 2009) to estimate Σ, assuming approximate sparsity: P (A8) The covariance matrix Σ satisfies that max1≤i≤p pj=1 |σij |κ is upper bounded with 0 ≤ κ < 1. In particular, if κ = 0 in (A8), we have a true sparse covariance matrix in which the number of nonzero elements for each row is upper bounded; that is, for Σ = [σij ]1≤i,j≤p , it is assumed that P max1≤i≤p pj=1 I(σij 6= 0) is upper bounded. Given thresholding parameter τ > 0, Rothman et al. (2009) proposed a univariate thresholding rule sτ (·) that satisfies three conditions: (i) |sτ (z)| ≤ |z|; (ii) sτ (z) = 0 for |z| ≤ λ; (iii) |sτ (z)−z| ≤ τ . Popular choices of the thresholding rules include hard-thresholding, lasso (Tibshirani, 1996; Donoho et al., 1995), SCAD (Fan and Li, 2001) and adaptive lasso (Zou, 2006). See Rothman et al. (2009) for detailed discussion on the generalized requirements for sτ (·). We use the lasso soft-thresholding rule sτ (z) = sign(z)(|z| − τ )+ in our empirical studies. ˜ n := Sτ (Σ), ˆ where Sτ (·) is a matrix thresholding function. We To estimate Σ, we set Σn = Σ ˆ = [ˆ choose not to impose thresholding on diagonal elements. Specifically, with Σ σij ]1≤i,j≤p and ˜ n = [˜ Σ σij ]1≤i,j≤p , we set σ ˜ij =
nσ ˆij ,
if i = j,
(6)
sτ (ˆ σij ), if i 6= j.
(A9) Assume h2 q log p¯n = O(n1−2ξ ) and q 2 log p¯n = O(n1−2ξ ) for constant 0 < ξ < 21 . In addition, assume that either dq 2 log p¯n = O(n1−2ξ ) or q(log p¯n )1−κ = O(n1−κ−2ξ ) holds. 10
Condition (A9) sets the constraints on h, d and q to ensure the consistency in central subspace estimation. In particular, if Σ is truly sparse with κ = 0, then (A9) simplifies to h2 q log p¯n = O(n1−2ξ ) and q 2 log p¯n = O(n1−2ξ ). Theorem 2 establishes the asymptotic properties of SIR with thresholded covariance estimation. Theorem 2. Let ln = (n/ log p¯n )κ . Under (A1)-(A4) and (A6)-(A9), with tuning parameter ˜ n and λn given in Theorem 7 of Supplement A.3, the minimizer (Γ, ˆ Vˆ ) satisfies sequences τ , λ p p 1) central subspace estimation consistency: kPSΓˆ − PSY |X kF = Op h + (dq) ∧ ln q log p¯n /n ; 2) variable selection consistency: P (Aˆ0 = A0 ) → 1 as n → ∞. By considering the sparsity structure of (A8), the convergence rate of Theorem 2 improves over that of Theorem 1. For example, assume Σ is truly sparse with κ = 0. Then, if d h q 1/2 , p p kPSΓˆ − PSY |X kF = Op (q log p¯n /n) (compared to Op (q d log p¯n /n) in Theorem 1); if h is upper p p bounded, the convergence rate is Op ( q log p¯n /n) (compared to Op (q log p¯n /n) in Theorem 1). The empirical performance of SIR is reported in Section 7 (Table 2). 4.
PFC in Ultrahigh Dimension
Another important example of SDR methods is the model-based approach called PFC (Cook, 2007; Cook and Forzani, 2008), where we begin with the inverse model for the conditional distribution of X | (Y = y): X = µ + ΨBf (y) + ε,
(7)
where µ = E(X), the random error ε has mean E(ε|Y ) = 0 and positive definite variance ˜ Var(ε|Y ) = Φ, Ψ is a p × d semi-orthogonal matrix, f (·) is a known h-dimensional vector-valued ˜ ≥ d, and the link matrix B is a full-rank d × h ˜ matrix with function of Y for some constant h bounded elements. The mean function in (7) does not have to be precise, as there is considerable flexiblity in the choice of f (Cook and Forzani, 2008, Sec. 3.2). Without loss of generality, we assume E(f ) = 0 and always center the sample version of f in practice. Cook and Forzani (2008, Thm. 2.1) show that under (7), E[(X − µ) | Y = y] = ΨBf (y) ∈ ΦSY |X for all y. Define U = E(f f T ). Then ΨB = E[(X − µ)f T ]U −1 and span(ΨB) = ΦSY |X . ˜ =M ˜ 2 := ΨB and W = Φ. Letting Therefore, we set the population-level matrix in (2) to be M ˜ 2 = ΦΓ0 V0 , where V0 ∈ Rd×h˜ carries the coordinates, and Γ0 ∈ Rp×d be a basis of SY |X , we have M (Γ0 , V0 ) is a minimizer of (2).
11
˜ 2 , we define fi = f (Yi ) ∈ Rh˜ , ¯f = Pn fi /n, To construct the estimator Υn in (1) of M i=1 ˜ T n×h T n×p ¯ ¯ ¯ ¯ F = (f1 − f , · · · , fn − f ) ∈ R and X = (X1 − X, · · · , Xn − X) ∈ R . Then we set Υn = (XT F )(F T F )−1 , which is equivalent to the least square regression coefficient of centered X on centered f . Note that Σn in (1) now is an estimator of the error covariance Φ. We will consider two specific estimators of Φ, the sample residual covariance matrix and a thresholded residual covariance matrix, separately in Sections 4.1 and 4.2. For the theoretical development, using a notion in Cook et al. (2012), we next define a signal function to quantify the rate of increase in the population signal of the regression coefficients: Let ω(p) : Rp 7→ R1 be a positive monotonically increasing function so that elements of the d × d positive definite matrix Gd (p) := ΨT Φ−2 Ψ/ω(p) are uniformly upper-bounded for all p. Denote ω = ω(p) for notation brevity. To establish estimation consistency of SY |X and variable selection consistency, we require the following assumptions. ˜ there is a constant C > 0 such that P (|εj | > ) ≤ (B1) For all > 0, 1 ≤ j ≤ p and 1 ≤ k ≤ h, 2 exp(−2 /2C) and P (|fk (Y )| > ) ≤ 2 exp(−2 /2C). (B2) There is some positive constant u∗ such that λmin (U ) ≥ u∗ , where U = E(f f T ), as defined previously. (B3) There are constants ψ∗ , ψ˜ > 0 such that ψij < ψ˜ for every 1 ≤ i, j ≤ p and λmin (Φ) > ψ∗ , where ψij = [Φ]ij . (B4) Assume minj∈B0 eTj M M T ej > cφ ωn−φ for some 0 ≤ φ < 2ξ < 1. (B5) The nonzero eigenvalues of B T Gd B/kBk22 and kBk2 are bounded away from zero. Assumption (B1) gives the sub-Gaussian conditions on ε = (ε1 , · · · , εp )T in addition to f (Y ) = (f1 (Y ), · · · , fh (Y ))T . Assumptions (B2) and (B3) are similar to (A3) to basically rule out asymptotic collinearity. Assumption (B4) is the marginal utility condition similar to (A6). In addition, we assume (B5) that regulates the kernel matrix and the link matrix B for deriving projection matrix consistency in central subspace estimation. 4.1.
PFC with Sample Residual Covariance
For the choice of Σn , without sparsity assumptions on Φ, we can simply use the sample covariance of the regression residuals as its estimator. Specifically, recall that X is the n × p design matrix 12
of centered Xi ’s. Let PF = F (F T F )−1 F T be the projection matrix onto the column span of F = ˆ n := XT (In − PF )X/n. The following condition allows p (f1 − ¯f , · · · , fn − ¯f )T . Then we set Σn = Φ p ˜ = o( nω/log p¯n ) to increase exponentially with n, and implies that the number of link functions h p and that the number of important variables q = o( n/log p¯n ). ˜ 2 q log p¯n = O(ωn1−2ξ ) and dq 2 log p¯n = O(n1−2ξ ) for some 0 < ξ < 1 . (B6) Assume h 2 ˜ n and λn given in Theorem 3. Under Conditions (B1)-(B6), with tuning parameter sequences λ ˆ Vˆ ) satisfies Theorem 8 of Supplement A.4, the minimizer (Γ, p ˜ 1/2 + (dq)1/2 ) q log p¯n /n ; 1) central subspace estimation consistency: kPSΓˆ − PSY |X kF = Op (h/ω 2) variable selection consistency: P (Aˆ0 = A0 ) → 1 as n → ∞. Theorem 3 indicates that under the ultrahigh dimensional setting, both central subspace estimation and variable selection consistency can be simultaneously established for PFC without requiring sparsity conditions on the error covariance matrix or upper-bounded eigenvalues. The ˜ and d to diverge all together with n. In particular, if d h ˜ q 1/2 or, more results also allow q, h p ˜ 2 = o(dq), then kPS −PS kF = Op (q d log p¯n /n). If h ˜ is upper bounded, we have generally, ω −1 h ˆ Y |X Γ p p ˜ log p¯n /n). convergence rate Op (q log p¯n /n); If q is upper bounded, the rate is Op ((1 ∨ (ω −1/2 h)) 4.2.
PFC with Thresholded Residual Covariance
We consider here the plausible scenario that the large error covariance matrix Φ is approximately sparse by adopting the following conditions. (B7) The error covariance matrix Φ satisfies that max1≤i≤p
Pp
κ j=1 |ψij |
is upper bounded with
0 ≤ κ < 1. ˜ 2 q log p¯n = O(ωn1−2ξ ) and q 2 log p¯n = O(n1−2ξ ) for constant 0 < ξ < 1 . In addition, (B8) Assume h 2 assume that either dq 2 log p¯n = O(n1−2ξ ) or q(log p¯n )1−κ = O(n1−κ−2ξ ) holds. ˆ for more efficient Under (B7), rather than simply using the sample residual covariance Φ, estimation, we consider imposing the generalized thresholding rule for the estimation of Φ; that ˜ n := Sτ (XT (In − PF )X/n), where Sτ (·) is the matrix is, in the objective function (1), set Σn = Φ thresholding function described in (6). The lasso soft-thresholding rule is used for off-diagonal terms of Sτ (·) in empirical studies. If error covariance matrix Φ is truly sparse with κ = 0, then ˜ 2 q log p¯n = O(ωn1−2ξ ) and q 2 log p¯n = O(n1−2ξ ). (B8) is simplified to h 13
Theorem 4. Let ln = (n/ log p¯n )κ . Under (B1)–(B5), (B7) and (B8), with tuning parameter ˜ n and λn given in Theorem 9 of Supplement A.4, the minimizer (Γ, ˆ Vˆ ) satisfies sequences τ , λ 1) central subspace estimation consistency: p 1/2 1/2 1/2 ˜ kPSΓˆ − PSY |X kF = Op h/ω + d (q ∧ ln ) q log p¯n /n . 2) variable selection consistency: P (Aˆ0 = A0 ) → 1 as n → ∞. In addition to consistency in both central subspace estimation and variable selection, we note that the convergence rate of central subspace in Theorem 4 improves over Theorem 3, which does not consider the error covariance sparsity structure. For example, assume that Φ is truly sparse p ˜ q 1/2 , kPS −PS kF = Op ( dq log p¯n /n), which is faster compared to with κ = 0. Then if d h ˆ Y |X Γ p p ˜ is upper bounded, the convergence rate is Op ( q log p¯n /n), Op (q d log p¯n /n) in Theorem 3; if h p which is again faster than Op (q log p¯n /n) in Theorem 3. The empirical performance of PFC methods will be illustrated in Section 7 (Table 4), which also suggests that it can be beneficial for central subspace estimation to impose the extra step of residual covariance thresholding if the approximate sparsity condition is satisfied. 5.
SAVE in Ultrahigh Dimension
It is known that for highly symmetric regression, the coverage condition of SIR may not hold. Second moment-based SDR methods such as SAVE (Cook and Weisberg, 1991) can be useful at finding directions with strong curvature. Specifically, SAVE considers a conditional second moment Λy = Σ − Σy , where Σ = cov(X) and Σy = cov(X|Y ∈ Iy ) with Iy being slice y (y = 1, . . . , h). It is known that SΛy ⊆ ΣSY |X under the linearity and constant covariance conditions. Accordingly, ˜ =M ˜ 3 := U2 Dg2 U2T and W = Σ, where Dg is the diagonal define U2 = (Λ1 , · · · , Λh ) ∈ Rp×ph , M ˜ 3 can be matrix defined in Section 3. Then under coverage, S ˜ ≡ ΣSY |X . Correspondingly, M M
˜ 3 = ΣΓ0 V0 , where Γ0 ∈ Rp×d is a basis of SY |X , V0 ∈ Rd×p contains coordinates, represented as M ˜ 3 , let Σy,n and (Γ0 , V0 ) is a minimizer of (2). To construct the sample estimate Υn in (1) for M denote the sample version of Σy and recall that Σn is the estimator of Σ. The explicit forms of Σy,n and Σn will be given later. Then we set Υn = Uˆ2 Dg2ˆ Uˆ2T , where Uˆ2 = (Ξ1 , · · · , Ξh ) and Ξy = Σn − Σy,n . Under the conventional condition that p is fixed or much smaller than n, this extension to SAVE is straightforward. However, under high-dimensional settings, the extension becomes much more ˜ 3 involves the conditional second-moments and their estimation inherits similar challenging as M 14
challenges encountered in large covariance matrix estimation. Specifically, if we simply use sample p ˜ 3 becomes Op ( p log p¯n /n), which covariance estimation, l2 norm estimation error bound for M is not ideal for high-dimensional study. As a result, some “approximate sparsity” assumptions may need to be imposed. Assume the distribution of X and conditional distributions of X|Y ∈ Iy (1 ≤ y ≤ h) are jointly sub-Gaussian in (C1). (C1) For all > 0 and all t in the unit sphere of Rp , there is a constant C > 0 such that P (|tT (X − µ)| > ) ≤ 2 exp(−2 /2C). In addition, for all > 0 and all t in the unit sphere of Rp , there are constants C1 , C2 > 0 such that for every 1 ≤ y ≤ h, P (|tT (X − µ)| > |Y ∈ Iy ) ≤ C1 exp(−2 /2C2 ). In particular, if h is upper bounded, the first component in (C1) immediately implies the second component, and therefore suffices in following studies. The next condition is needed to handle situations in which the conditional covariance matrices Σy are “approximately sparse.” Similar assumptions have been applied to other high-dimensional problems and can be true in many applications (see, e.g., Shao et al., 2011). Let Σy = [σyij ]1≤i,j≤p for 1 ≤ y ≤ h. (C2) For every conditional covariance matrix Σy , max1≤i≤p
Pp
κ j=1 |σyij |
is upper bounded with
some 0 ≤ κ < 1. In addition, we assume that (A8) is satisfied. As a simple but illustrative example, if there ⊥ X(A0 ∪A1 )c and |A0 ∪ A1 | is upper bounded, then exists an index set A1 ⊆ Ac0 such that XA0 ∪A1 ⊥ (C2) already implies (A8), and span(Σ−1 SΛy ) is sparse. In fact, under the weaker condition that the lκ -norm of every E(X|Y˜ ) is upper bounded for some 0 ≤ κ < 1, since Σ = cov(E(X|Y˜ )) + E(cov(X|Y˜ )), (A8) is also satisfied. Similar to Section 3.2, we can apply the generalized thresholding scheme to estimate Σy as well. ¯ y )(Xi − X ¯ y )T /Ny =: ˆ y,n = P Specifically, define the sample conditional covariance Σ (Xi − X Yi ∈Iy
ˆ y,n ) =: [˜ [ˆ σyij ]1≤i,j≤p . Then we set the estimation of Σy to be Σy,n = Sτ (Σ σyij ]1≤i,j≤p for some properly chosen τ . (C3) Assume that h2 q log p¯n = O(n1−2ξ ), q 2 log p¯n = O(n1−2ξ ) and q(h log p¯n )1−κ = O(n1−κ−2ξ ) holds for some constant 0 < ξ
2), where β 4 = (1, 1, 1, 0, 0, · · · , 0)T ,
21
as an example to illustrate the performance of SAVE in high dimension. Except for the model, all other simulation settings remain the same as previous cases. The predictor dimension p is 1000 and the sample size n is 200. Again, the aforementioned three scenarios for Σ are considered. In this case, the true dimension of the central subspace is one and the response Y is binary. Since h = 2, SIR naturally suggests the structural dimension d to be one without selection. For SAVE, we again applied the parameter selection procedure proposed in Section 6 to determine the structural dimension. Table 3 summarizes the results regarding dimension selection, central subspace estimation, and variable selection from the proposed SIR and SAVE methods. We can see that in all three scenarios, SAVE is able to correctly select the structural dimension d = 1. In addition, the results regarding the averaged spectral-norm loss, C v , and IC v confirm our expectation that neither SC-SIR nor TC-SIR can handle this case, while TC-SAVE, the SAVE method with thresholding described in Section 5, performs relatively well in both the central subspace estimation and variable selection. Table 3: Simulation results of Case 4 based on 100 runs. Σ
Method
Averaged
Frequency (%)
Cv
IC v
spectral-norm loss
d=1
d=2
d=3
Oracle SC-SIR TC-SIR TC-SAVE
0.9995 (0.0001) 0.9994 (0.0001) 0.195 (0.008)
100 100
0 0
0 0
3 0.31 1.08 3
0 105.00 338.37 10.72
(b)
SC-SIR TC-SIR TC-SAVE
0.9997 (0.0001) 0.9995 (0.0001) 0.291 (0.013)
100
0
0
0.24 1.07 3
90.67 356.64 9.09
(c)
SC-SIR TC-SIR TC-SAVE
0.9995 (0.0001) 0.9996 (0.0001) 0.280 (0.013)
100
0
0
0.27 0.88 3
86.01 286.52 20.20
(a)
We next provide an example of PFC under the inverse regression setting. Assume that response Y follows the standard normal distribution, ε ∈ Rp follows normal distribution with mean 0 and covariance Φ, and Y ⊥ ⊥ ε. We consider the same three covariance structures for Φ as that of Σ in previous cases: Scenario (a) has an exponential decay structure and Scenarios (b) and (c) have the block-diagonal structure. The data were generated according to the model Case 5: X = ΨνY + ε, where νY = (Y, Y 2 )T , Ψ = Φβ and β = (β 51 , β 52 ). The columns of β are β 51 = (1, 1, 1, 1, 1, 0, · · · , 0)T ∈ Rp and β 52 = (1, −1, 1, −1, 1, 0, · · · , 0)T ∈ Rp .
The central subspace SY |X in Case 5 is span(β). All the other simulation settings are the same as ˆ n by SCprevious cases. We denote the PFC method using the sample residual covariance Σn = Φ 22
PFC, and denote the PFC method using thresholded residual covariance matrix estimation Σn = ˜ n = Sτ (Φ ˆ n ) by TC-PFC. To apply SC-PFC and TC-PFC, we chose f (Y ) = (f1 (Y ), f2 (Y ), f3 (Y ))T Φ to be the first three Legendre polynomials; that is, f1 (Y ) = Y , f2 (Y ) = 12 (3Y 2 − 1) and f3 (Y ) = 1 (5Y 3 2
− 3Y ). For comparison, we also applied SC-SIR and TC-SIR to the simulated data. Table 4: Simulation results of Case 5 based on 100 runs. Φ
Method
Averaged spectral-norm loss
Frequency (%) d=1
d=2
d=3
Cv
IC v
Oracle SC-SIR TC-SIR SC-PFC TC-PFC
0.770 0.629 0.470 0.315
(0.021) (0.030) (0.013) (0.023)
0 34 32 0 0
100 63 67 100 91
0 3 1 0 9
5 4.40 4.78 5.00 5.00
0 15.81 0.54 3.69 4.76
(b)
SC-SIR TC-SIR SC-PFC TC-PFC
0.857 0.732 0.542 0.472
(0.016) (0.026) (0.015) (0.024)
41 34 2 9
55 58 98 85
4 8 0 6
3.96 4.74 5.00 4.99
15.45 0.45 12.52 7.28
(c)
SC-SIR TC-SIR SC-PFC TC-PFC
0.805 0.550 0.665 0.468
(0.017) (0.033) (0.012) (0.020)
20 27 0 0
73 72 99 93
7 1 1 7
4.07 4.77 5.00 5.00
15.50 1.68 11.65 12.14
(a)
The simulation results summarized in Table 4 show that SC-PFC and TC-PFC can perform favorably compared to the SIR methods. In particular, TC-PFC has the smallest estimation error in central subspace estimation in all three scenarios. By using the procedures proposed in Section 6.2, both SC-PFC and TC-PFC provide predominantly correct structural dimensions, and both successfully identify almost all important varaibles with very small compromise in false discovery. Besides the (approximately) sparse predictor covariances, it is also interesting to consider nonsparse scenarios as our methods do not necessarily rely on the sparse covariance assumption. Specifically, we considered Σ = p−1 AT A for SIR, where A is 1000 × 1000 matrix with independent standard normal entries. The results from repeating the simulation procedures described above are summarized in Table 5, where SC-SIR performs reasonably well in all three cases. Not surprisingly, TC-SIR no longer outperforms SC-SIR as Σ is non-sparse. Under the SAVE setting, however, we do not expect TC-SAVE to perform well as its consistency results rely on Σ satisfying sparsity condition (A8). Under the PFC setting, if the error covariance Φ is non-sparse, the proposed SC-PFC methods can still perform satisfactorily. For brevity, the simulation results for the PFC and SAVE settings are left in Supplement B. In addition, we note that the proposed method is 23
designed for sparse rather than abundant settings. Indeed, previous conditions on q such as (A5) √ require q to grow slower than n and the proposed method is not intended for settings with a very large number of important predictor variables. Table 5: Simulation results of Cases 1-3 based on 100 runs with Σ = p−1 AT A. Case
Method
Averaged
Frequency (%)
spectral-norm loss
d=1
d=2
d=3
Cv
IC v
1
Oracle SC-SIR TC-SIR
0.252 (0.018) 0.498 (0.026)
100 98 81
0 2 15
0 0 4
5 4.95 4.98
0 14.62 39.59
2
Oracle SC-SIR TC-SIR
0.475 (0.019) 0.626 (0.019)
0 0 0
100 99 93
0 1 7
8 7.79 7.67
0 21.55 24.95
3
Oracle SC-SIR TC-SIR
0.434 (0.015) 0.580 (0.017)
0 1 2
100 98 97
0 1 1
8 7.93 7.93
0 16.88 15.18
8.
Real Data Example
In this section, we demonstrate the application of our proposal in a high-dimensional real data example with multi-class responses. The data come from a small round blue-cell tumors (SRBCT) microarray experiment (Khan et al., 2001) in which a sparse predictor covariance structure (Rothman et al., 2009) seems plausible. There are 63 training and 20 testing tissue samples, and these tissue samples have four tumor classes: EWS, BL, NB and RMS. Each sample contains expression values of 2,308 genes that were filtered down from the original 6,567 genes (Khan et al., 2001). We will show that our approach not only enables investigators to find important variables/genes and achieve very accurate prediction, but also simultaneously construct sufficient predictors from different directions of the estimated central subspace to allow convenient visualization to gain new explainable and useful information. We applied both TC-SIR and TC-PFC to the training dataset with tuning parameters and structural dimension determined by the procedures in Section 6.2. Given the four-category response, we naturally used a vector of indicator variables of the first three categories as the PFC f (·) function (Cook, 2007). Therefore, for both SIR and PFC, we considered the possible structural dimensions d = 1, 2 or 3. Both methods selected dimension d = 3, with TC-SIR finding an active set of 14 genes and TC-PFC finding 37 genes. With TC-SIR as the illustration, we ˆ as well as the used the three different regression directions (denoted by DR1, DR2 and DR3) of Γ 24
(a)
(b)
2
1
0
-1
2
-2
1
-1
-2
N
3
N
0
3
N NN
N
N
N
N N
2
N
2
N
N
N
N N N N
1
1
SP3 E
BB B BB
E E EE EE EE EE EEE EE E EEE E
E
SP3
R
E E RR RR R R
B
R
R R R R R R
B B
E E
0 R
R R
R
BB
E E E
B
0
R
R
R
R
R
-1
R
-1
R
SP1
SP1
-1 0
-1 0
1 1
2 2
SP2
SP2
Figure 1: 3D scatter plots with estimated sufficient predictor variables from TC-SIR: (a) training samples; (b) testing samples. Both plots use initial letters to represent tumor classes: EWS uses E; BL uses B; NB uses N; RMS uses R. (a) SP1
SP2
EWS.T1 EWS.T2 EWS.T3 EWS.T4 EWS.T6 EWS.T7 EWS.T9 EWS.T11 EWS.T12 EWS.T13 EWS.T14 EWS.T15 EWS.T19 EWS.C8 EWS.C3 EWS.C2 EWS.C4 EWS.C6 EWS.C9 EWS.C7 EWS.C1 EWS.C11 EWS.C10 BL.C5 BL.C6 BL.C7 BL.C8 BL.C1 BL.C2 BL.C3 BL.C4 NB.C1 NB.C2 NB.C3 NB.C6 NB.C12 NB.C7 NB.C4 NB.C5 NB.C10 NB.C11 NB.C9 NB.C8 RMS.C4 RMS.C3 RMS.C9 RMS.C2 RMS.C5 RMS.C6 RMS.C7 RMS.C8 RMS.C10 RMS.C11 RMS.T1 RMS.T4 RMS.T2 RMS.T6 RMS.T7 RMS.T8 RMS.T5 RMS.T3 RMS.T10 RMS.T11
SP3
(b) DR1
DR2
DR3
78
42
24
77
03
94
20
72
74
79
62
58
24
46
18
81
45
26
62
43
60
78
92
04
14
35
86
2 38
31
88
32
51
82
81
21
05
18
33
37
23
62
82
Figure 2: (a) Heat maps of sufficient predictors of TC-SIR using the training samples. Columns are sorted by four tumor types: EWS, BL, NB and RMS (left to right). (b) Loadings of each regression direction. In both figures, the lighter (darker) the color, the more positive (negative) the value. Column labels are gene IDs.
gene expression values to construct the sufficient predictors (denoted by SP1, SP2 and SP3). The 3D-scatterplots of the sufficient predictors show that both the training samples (Figure 1(a)) and testing samples (Figure 1(b)) are very well-separated into four clusters of their respective tumor
25
types. The distinct patterns of different tumor types can be better visualized with the heat maps in Figure 2(a) using the values of the sufficient predictors. In Figure 2(a), the training samples (columns) are sorted by the tumor types (from left to right: EWS, BL, NB and RMS), labeled on the horizontal axis. In each cell of the heat map, the dotted horizontal lines represent 0, and the distance of the solid black line gives the relative magnitude and sign of the cell’s value. Correspondingly, the lighter (darker) the color, the more positive (negative) the value. Then, we can clearly see that EWS, BL and NB type tumors tends to have relatively large positive values in the first, second and third regression directions, respectively, while RMS tends to have relatively large negative values in all three directions. The heat map in Figure 2(b) shows the loadings of the three directions, and to some extent, reflects the relative contribution of the important genes in each direction, where the important genes are labeled on the horizontal axis. Also, with the low-dimensional 3D sufficient predictors, we can easily build a K-nearest neighbor classification model using the training data. Satisfactorily, the resulting predictive model of both TC-SIR and TC-PFC correctly identifies the true tumor types for all tissue samples in the testing set. Table 6: classification and model selection results with 100 random partitioning of the SRBCT data. Method
DS-MSIR TC-SIR TC-PFC
Frequency (%) d=1
d=2
d=3
0 0
2 3
98 97
Avg. Classif. Error Rate Training Sets
Testing Sets
0.0076 0.0015 (0.0005) 0.0050 (0.0011)
0.0632 0.0310 (0.0034) 0.0433 (0.0043)
Avg. Model Size
9.06 20.13 34.10
We performed the data splitting experiment of Yu, Dong and Shao (2016) and randomly partitioned the original data into 55 training tissue samples and 28 testing tissue samples. With the randomly generated training and testing datasets, we applied TC-SIR and TC-PFC to perform central subspace estimation and sample classification with the same procedures as previously described. The data splitting experiment was repeated 100 times. Table 6 summarizes the results. As a performance benchmark for classification, we cite the results of DS-MSIR method in Yu, Dong and Shao (2016). Clearly, TC-SIR and TC-PFC predominantly choose the structural dimension to be 3. Both have lower averaged classification error rates than that of the benchmark for the training and testing sets, with TC-PFC selecting the largest model size among the three. It is also worthy to note that the benchmark DS-MSIR was designed as a powerful variable screening tool, and with the screened variables, LDA was chosen as the classification method in Yu, Dong and Shao (2016); accordingly, our proposal may serve as a useful second-step procedure following
26
variable screening of DS-MSIR by creating the sufficient predictors so that popular nonparametric classification methods like KNN may be applied in subsequent analysis. 9.
Discussion
For model-free methods, we implicitly assumed the complete coverage condition SM ≡ SY |X in our previous development. Under the incomplete coverage that SM ⊂ SY |X , it is still worthwhile to focus on estimation of SM to infer about a proper subset of SY |X . It is also of interest to estimate B0 := B0 (M ) = {1 ≤ j ≤ p : eTj M M T ej > 0}. Indeed, Proposition 2 in Supplement A.2 suggests that B0 ⊆ A0 and thus B0 will include at least some important variables while excluding all redundant variables. In fact, whether or not the coverage condition holds, Theorems 6-10 in Supplement A suggests that our proposed methodology achieves the estimation consistency of SM at the same rate as previously presented theorems, and maintains the variable selection consistency for B0 . If the coverage condition holds, we immediately have consistency results of Theorems 1-5 on the full central subspace SY |X . These connections between estimations of SM and SY |X are reminiscent of the seminal work in Cook and Ni (2005), which is extended in this work to the important ultrahigh dimensional scenarios. The formal formulations of the aforementioned connections, theorems and their proofs are all left in Supplement A. As interesting but open questions for future investigation, we note that the slice numbers required to cover full central subspace is typically unknown, and developing a general procedure of optimal slice number selection that is applicable to ultrahigh-dimensional SDR methods deserves a comprehensive study of its own. Also, it remains to see if our approach can be extended to sparse inverse covariance scenarios without compromise in computational efficiency. ACKNOWLEDGEMENTS
The authors sincerely thank the Editor, the Associate Editor and anonymous referees for their valuable comments that help improving this manuscript significantly. Ding’s research is partially supported by DE-CTR ACCEL/NIH U54 GM104941 SHoRe award, and the University of Delaware GUR award. REFERENCES Bickel, P. J. and Levina, E. (2008a), ‘Covariance regularization by thresholding’, The Annals of Statistics 36(6), 2577–2604.
27
Bickel, P. J. and Levina, E. (2008b), ‘Regularized estimation of large covariance matrices’, The Annals of Statistics 36(1), 199–227. Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009), ‘Simultaneous analysis of lasso and dantzig selector’, The Annals of Statistics 37(4), 1705–1732. Breiman, L., Friedman, J., Stone, C. J. and Olshen, R. A. (1984), Classification and regression trees, CRC press. B¨ uhlmann, P. and Van De Geer, S. (2011), Statistics for high-dimensional data: methods, theory and applications, Springer Science & Business Media. Cai, T. T. and Zhou, H. H. (2012), ‘Minimax estimation of large covariance matrices under l1 -norm’, Statistica Sinica 22(4), 1319–1349. Chen, L. and Huang, J. Z. (2012), ‘Sparse reduced-rank regression for simultaneous dimension reduction and variable selection’, Journal of the American Statistical Association 107(500), 1533–1545. Chen, X., Zou, C. and Cook, R. D. (2010), ‘Coordinate-independent sparse sufficient dimension reduction and variable selection’, The Annals of Statistics 38(6), 3696–3723. Cook, R. D. (1994), ‘On the interpretation of regression plots’, Journal of the American Statistical Association 89(425), 177–189. Cook, R. D. (1998), Regression graphics: ideas for studying regressions through graphics, John Wiley & Sons. Cook, R. D. (2004), ‘Testing predictor contributions in sufficient dimension reduction’, The Annals of Statistics 32(3), 1062–1092. Cook, R. D. (2007), ‘Fisher lecture: Dimension reduction in regression’, Statistical Science 22(1), 1–26. Cook, R. D. and Forzani, L. (2008), ‘Principal fitted components for dimension reduction in regression’, Statistical Science 23(4), 485–501. Cook, R. D., Forzani, L. and Rothman, A. J. (2012), ‘Estimating sufficient reductions of the predictors in abundant high-dimensional regressions’, The Annals of Statistics 40(1), 353–384. Cook, R. D. and Ni, L. (2005), ‘Sufficient dimension reduction via inverse regression’, Journal of the American Statistical Association 100(470), 410–428. Cook, R. D. and Weisberg, S. (1991), ‘Discussion of “sliced inverse regression for dimension reduction”’, Journal of the American Statistical Association 86(414), 328–332. Ding, S. and Cook, R. D. (2014), ‘Dimension folding PCA and PFC for matrix-valued predictors’, Statistica Sinica 24, 463–492. Ding, S. and Cook, R. D. (2015a), ‘Higher-order sliced inverse regressions’, Wiley Interdisciplinary Reviews: Computational Statistics 7(4), 249–257. Ding, S. and Cook, R. D. (2015b), ‘Tensor sliced inverse regression’, Journal of Multivariate Analysis 133, 216–231. Donoho, D. L., Johnstone, I. M., Kerkyacharian, G. and Picard, D. (1995), ‘Wavelet shrinkage: asymptopia?’, Journal of the Royal Statistical Society. Series B (Methodological) 57(2), 301–369. Fan, J. and Li, R. (2001), ‘Variable selection via nonconcave penalized likelihood and its oracle properties’, Journal of the American statistical Association 96(456), 1348–1360. Fan, J. and Lv, J. (2008), ‘Sure independence screening for ultrahigh dimensional feature space’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70(5), 849–911. Hall, P. and Li, K.-C. (1993), ‘On almost linearity of low dimensional projections from high dimensional data’, The
28
Annals of Statistics 21(2), 867–889. Hilafu, H. and Yin, X. (2016), ‘Sufficient dimension reduction and variable selection for large-p-small-n data with highly correlated predictors’, Journal of Computational and Graphical Statistics, accepted . Huang, J., Ma, S. and Zhang, C.-H. (2008), ‘Adaptive lasso for sparse high-dimensional regression models’, Statistica Sinica 18(4), 1603–1618. Karoui, N. E. (2008), ‘Operator norm consistent estimation of large-dimensional sparse covariance matrices’, The Annals of Statistics 36(6), 2717–2756. Khan, J., Wei, J. S., Ringner, M., Saal, L. H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C. R., Peterson, C. and Meltzer, P. S. (2001), ‘Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks’, Nature Medicine 7(6), 673–679. Lam, C. and Fan, J. (2009), ‘Sparsistency and rates of convergence in large covariance matrix estimation’, Annals of statistics 37(6), 4254–4278. Li, B., Kim, M. K. and Altman, N. (2010), ‘On dimension folding of matrix-or array-valued statistical objects’, The Annals of Statistics 38, 1094–1121. Li, B. and Wang, S. (2007), ‘On directional regression for dimension reduction’, Journal of the American Statistical Association 102(479), 997–1008. Li, K.-C. (1991), ‘Sliced inverse regression for dimension reduction’, Journal of the American Statistical Association 86(414), 316–327. Li, L. (2007), ‘Sparse sufficient dimension reduction’, Biometrika 94(3), 603–613. Li, L. and Yin, X. (2008), ‘Sliced inverse regression with regularizations’, Biometrics 64(1), 124–131. Li, R., Zhong, W. and Zhu, L. (2012), ‘Feature screening via distance correlation learning’, Journal of the American Statistical Association 107(499), 1129–1139. Lin, Q., Zhao, Z. and Liu, J. S. (2015), ‘On consistency and sparsity for sliced inverse regression in high dimensions’, The Annals of Statistics, accepted . Ma, Y. and Zhu, L. (2012), ‘A semiparametric approach to dimension reduction’, Journal of the American Statistical Association 107(497), 168–179. Qian, W., Li, W., Sogawa, Y., Fujimaki, R., Yang, X. and Liu, J. (2017), ‘An interactive greedy approach to group sparsity in high dimension’, preprint arXiv:1707.02963 . Qian, W. and Yang, Y. (2016), ‘Kernel estimation and model combination in a bandit problem with covariates’, Journal of Machine Learning Research 17(149), 1–37. Qian, W., Yang, Y. and Zou, H. (2016), ‘Tweedie’s compound Poisson model with grouped elastic net’, Journal of Computational and Graphical Statistics 25(2), 606–625. Rothman, A. J. (2012), ‘Positive definite estimators of large covariance matrices’, Biometrika 99(3), 733–740. Rothman, A. J., Levina, E. and Zhu, J. (2009), ‘Generalized thresholding of large covariance matrices’, Journal of the American Statistical Association 104(485), 177–186. Shao, J., Wang, Y., Deng, X., Wang, S. et al. (2011), ‘Sparse linear discriminant analysis by thresholding for high dimensional data’, The Annals of Statistics 39(2), 1241–1265. Tan, K. M., Wang, Z., Liu, H. and Zhang, T. (2016), ‘Sparse generalized eigenvalue problem: Optimal statistical rates via truncated rayleigh flow’, preprint arXiv:1604.08697 .
29
Tibshirani, R. (1996), ‘Regression shrinkage and selection via the lasso’, Journal of the Royal Statistical Society. Series B (Methodological) 58(1), 267–288. Tseng, P. and Yun, S. (2009), ‘A coordinate gradient descent method for nonsmooth separable minimization’, Mathematical Programming 117(1), 387–423. Wang, H. and Xia, Y. (2008), ‘Sliced regression for dimension reduction’, Journal of the American Statistical Association 103(482), 811–821. Wang, T., Zhao, H., Chen, M. and Zhu, L. (2016), ‘Estimating a sparse reduction for general regression in high dimensions’, Statistics and Computing pp. 1–14. Wei, F. and Huang, J. (2010), ‘Consistent group selection in high-dimensional linear regression’, Bernoulli 16(4), 1369–1384. Wu, T. T., Lange, K. et al. (2010), ‘The MM alternative to EM’, Statistical Science 25(4), 492–505. Wu, Y. and Li, L. (2011), ‘Asymptotic properties of sufficient dimension reduction with a diverging number of predictors’, Statistica Sinica 21, 707–730. Xue, L., Ma, S. and Zou, H. (2012), ‘Positive-definite l1 -penalized estimation of large covariance matrices’, Journal of the American Statistical Association 107(500), 1480–1491. Yin, X. and Hilafu, H. (2015), ‘Sequential sufficient dimension reduction for large p, small n problems’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 77(4), 879–892. Yin, X. and Li, B. (2011), ‘Sufficient dimension reduction based on an ensemble of minimum average variance estimators’, The Annals of Statistics 39(6), 3392–3416. Yu, Z., Dong, Y. and Shao, J. (2016), ‘On marginal sliced inverse regression for ultrahigh dimensional model-free feature selections’, The Annals of Statistics 44(6), 2594–2623. Yu, Z., Dong, Y. and Zhu, L.-X. (2016), ‘Trace pursuit: A general framework for model-free variable selection’, Journal of the American Statistical Association 111(514), 813–821. Yu, Z., Zhu, L., Peng, H. and Zhu, L. (2013), ‘Dimension reduction and predictor selection in semiparametric models’, Biometrika 100(3), 641–654. Yuan, M. and Lin, Y. (2006), ‘Model selection and estimation in regression with grouped variables’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68(1), 49–67. Zhang, C.-H. (2010), ‘Nearly unbiased variable selection under minimax concave penalty’, The Annals of Statistics 38(2), 894–942. Zhou, S., van de Geer, S. and B¨ uhlmann, P. (2009), ‘Adaptive lasso for high dimensional regression and Gaussian graphical modeling’, preprint arXiv:0903.2515 . Zhu, L., Miao, B. and Peng, H. (2006), ‘On sliced inverse regression with high-dimensional covariates’, Journal of the American Statistical Association 101(474), 630–643. Zhu, L.-P., Li, L., Li, R. and Zhu, L.-X. (2012), ‘Model-free feature screening for ultrahigh-dimensional data’, Journal of the American Statistical Association 106(496), 1464–1475. Zou, H. (2006), ‘The adaptive lasso and its oracle properties’, Journal of the American Statistical Association 101(476), 1418–1429. Zou, H., Hastie, T. and Tibshirani, R. (2006), ‘Sparse principal component analysis’, Journal of Computational and Graphical Statistics 15(2), 265–286.
30
Supplement to “Sparse Minimum Discrepancy Approach to Sufficient Dimension Reduction with Simultaneous Variable Selection in Ultrahigh Dimension” A. A.1.
Supplement: Propositions, Theorems and Proofs
Proof of Proposition 1
Proof of Proposition 1. We first note that the minimizers of (1) under both constraints are unique ˆ 1 , Vˆ1 ) and up to a d × d orthonormal matrix. Take the constraint ΓT Γ = Id for example. If (Γ ˆ 2 , Vˆ2 ) are two minimizers, by the constraint, both Γ ˆ 1 and Γ ˆ 2 are full-rank matrices and there (Γ ˆ1 = Γ ˆ 2 G1 . Therefore, (Γ ˆ 2 , G1 Vˆ1 ) is also a exists a d × d orthonormal matrix G1 such that Γ minimizer, and since for a fixed Γ, L1n (Γ, V ) is a convex function of V , we can conclude that ˆ 01 , Vˆ01 ) and (Γ ˆ 11 , Vˆ11 ) Vˆ1 = GT Vˆ2 . To see the equivalence in terms of the Grassman manifold, let (Γ 1
ˆ 11 is full-rank, be minimizers under the constraints ΓT Γ = Id and V V T = Id , respectively. Since Γ ˆ T11 Γ ˆ 11 )−1/2 and observe that Γ ˆ 11 Q satisfies the constraint ΓT Γ = Id . Therefore, we can define Q = (Γ ˆ 11 , Vˆ11 ). Similarly, we can show that L1n (Γ ˆ 01 , Vˆ01 ) ≥ ˆ 01 , Vˆ01 ) ≤ L1n (Γ ˆ 11 Q, Q−1 Vˆ11 ) = L1n (Γ L1n (Γ ˆ 11 , Vˆ11 ), which implies that (Γ ˆ 11 Q, Q−1 Vˆ11 ) is also a minimizer under constraint ΓT Γ = Id . By L1n (Γ ˆ 01 = Γ ˆ 11 QG2 . Therefore, solution uniqueness, there is a d × d orthonormal matrix G2 such that Γ ˆ 01 ) = span(Γ ˆ 11 ). span(Γ A.2.
Connections between SM and SY |X
T ej = 0 and Proposition 2. Let βM be any basis matrix of SM and assume SM ⊆ SY |X . Then βM
eTj M M T ej = 0 for every j ∈ Ac0 . If we further assume the coverage condition SY |X ⊆ SM , then T ej 6= 0 and eTj M M T ej > 0 for every j ∈ A0 . βM
Proof of Proposition 2. The first statement follows immediately by noting that SM ⊆ SY |X ⊆ H0 . T For the second statement, if βM ej = 0, then the variables corresponding to the index set Ac0 ∪ {j}
are redundant variables, which makes H0 contradict the definition of CVS. Also, since M T ej 6= 0 for j ∈ A0 , we have eTj M M T ej > 0. A.3.
Proof of Theorems 1 and 2
Following discussion in Section 9, we first present two theorems (Theorems 6 and 7) without assuming coverage. Then in together with the coverage condition and Proposition 2, Theorems 6 and 7 immediately imply Theorems 1 and 2, respectively. 1
ˆ n. First, we consider SIR when we use sample covariance Σn = Σ p ˜ n = 2˜ Theorem 6. Suppose (A1)–(A5) are satisfied. If λ c(h + (dq)1/2 ) log p¯n /n with a constant ˜ V˜ ) satisfies c˜ given in (A.15), then the minimizer (Γ, p ˜ V˜ − M kF = Op (h + (dq)1/2 ) q log p¯n /n . kΓ (A.1) p ρ/2 If we further assume (A6) and (A7) and let λn = 21−ρ c˜cφ (h + (dq)1/2 ) log p¯n /n1+ρφ and 2ρ(ξ − ˆ Vˆ ) satisfies φ/2) ≥ 1 − 2ξ, then the minimizer (Γ, p kPSΓˆ − PSM kF = Op (h + (dq)1/2 ) q log p¯n /n , (A.2) P (Aˆ0 = B0 ) → 1
as n → ∞.
(A.3)
Proof of Theorem 6. We first show (A.2) holds. Let (Γ0 , V0 ) be a minimizer of (2) with Γ0 = ˆ = (ˆ ˆ Vˆ ) ≤ (γ 01 , · · · , γ 0p )T and the constraint that V0 V0T = Id . Let Γ γ 1, · · · , γ ˆ p )T . Since L2n (Γ, L2n (Γ0 , V0 ), then ˆ Vˆ )+ 1 tr(Vˆ T Γ ˆ T Σn Γ ˆ Vˆ )+λ −tr(ΥTn Γ 2
p X
vj kˆ γj k2 ≤
1 −tr(ΥTn Γ0 V0 )+ tr(V0T ΓT0 Σn Γ0 V0 )+λ 2
j=1
p X
vj kγ 0j k2 ,
j=1
which implies p X 1 A1 + A2 + A3 + λ vj kˆ γ j k2 2 j=1
ˆ Vˆ − Γ0 V0 )] + tr[V0T ΓT0 (Σn − Σ)(Γ ˆ Vˆ − Γ0 V0 )] := − tr[(Υn − ΣΓ0 V0 )T (Γ p X 1 T ˆ ˆ ˆ ˆ vj kˆ γ j k2 + tr[(ΓV − Γ0 V0 ) Σn (ΓV − Γ0 V0 )] + λ 2 j=1
≤λ
p X
vj kγ 0j k2
(A.4)
j=1
Next, we provide an upper bound for A1 . Without loss of generality, assume µ = 0. Given slice P ˜ 1, X ˜ 2, · · · , X ˜ Ny be the sample covariates y (1 ≤ y ≤ h), define Ny = ni=1 I(Yi ∈ Iy ), and let X ¯ 1, Z ¯ 2, · · · , Z ¯ h) ∈ corresponding to the slice y. Recall from Section 3 that Υn = Uˆ1 Dgˆ , where Uˆ1 = (Z p ¯y = X ¯ y − X, ¯ pˆy = Ny /n, gˆy = pˆy and g ˆ = (ˆ Rp×h , Z g1 , · · · , gˆh )T . Define µy = E(X|Y ∈ Iy ). Given unit vector ej ∈ Rp (1 ≤ j ≤ p) and > 0, using (A2) and the similar arguments for (A.51) in Lemma 4, we have that there exist constants v, c0 > 0 such that for every k ≥ 2, k!v 2 ck−2 0 . E(|eTj (Xi − µy )|k | Yi ∈ Iy ) ≤ 2 √ Also, by Hoeffding’s inequality, for any constant c1 > 2, r log p¯n 2 P |ˆ p y − p y | ≥ c1 ≤ 4. n p¯n 2
(A.5)
Then for every > 0 and large enough n, PNy eT (X N PNy eT (X ˜ k − µy ) ˜ k − µy ) y k=1 j k=1 j P > =P ≤ py + P > , Ny ≥ npy Ny n Ny P N n y T ˜ X 2 k=1 ej (Xk − µy ) > Ny = j ≤ 4 + P p¯n Ny j=dnpy e
≤
n X 2 j2 + exp − p¯4n 2(v 2 + c0 ) j=dnpy e
2 2 npy 2 . 1 − exp(− ) , + exp − p¯4n 2(v 2 + c0 ) 2(v 2 + c0 ) where the first inequality follows by (A.5) and the second inequality follows by an extended q p Bernstein inequality (see, e.g., Lemma 1 in Qian and Yang, 2016). Taking = c2 log with np∗ p p∗ = min1≤y≤h py and c22 > 12v 2 , for large enough n such that c2 log p¯n /(np∗ ) ≤ v 2 /(2c0 ), we ≤
have from above and (A4) that r PNy eT (X ˜ k − µy ) h log p¯n 4 −1/2 k=1 j P > c2 cb ≤ 4. Ny n p¯n Consequently, by union bound, for large enough n, with probability greater than 1 − 4h/¯ p3n , r ¯ yj − µyj | ≤ c2 c−1/2 h log p¯n , max |X (A.6) b 1≤y≤h, 1≤j≤p n ¯ y and µy , respectively. Similarly, we can obtain that ¯ yj and µyj are the jth elements of X where X r ¯ j | ≥ c2 log p¯n ≤ 2 , P |X n p¯4n and with probability larger than 1 − 2/¯ p3n , r ¯ j | ≤ c2 log p¯n , max |X (A.7) 1≤j≤p n ¯ Let Z¯yj be the jth element of Z ¯ y and note that |µyj | ≤ c3 where Xj is the jth element of X. for some constant c3 . Then by (A.5), (A.6) and (A.7), we have that for large enough n, with probability greater than 1 − c5 /¯ p3n for some c5 > 0, for every 1 ≤ j ≤ p and 1 ≤ y ≤ h, p p √ ¯ yj − X ¯ j ) pˆy − µyj √py | |Z¯yj pˆy − µyj py | = |(X p p p ¯ yj − µyj ) pˆy | + |µyj ( pˆy − √py )| + |X ¯ j pˆy | ≤ |(X r r h log p¯n c1 c3 log p¯n −1/2 ≤ 2c2 cb + 1/2 , n n p∗
(A.8)
which implies that r max kej (Υn − ΣΓ0 V0 )k2 ≤ c4 h
1≤j≤p
3
log p¯n n
(A.9)
and r |A1 | ≤ c4 h −1/2
where c4 = (2c2 + c1 c3 )cb
p
log p X ˆ kδ j k2 , n j=1
(A.10)
and δˆj = Vˆ T γ ˆ j − V0T γ 0j .
To provide the upper bound for A2 , we first note the known result (see, e.g., Bickel and Levina, 2008a) that under (A1), there exist constants c7 , c8 > 0 such that for every 1 ≤ i, j ≤ p, 8n2 P (|ˆ σij − σij | > ) ≤ c7 exp − , (A.11) c8 q √ where σ ˆij and σij are (Σn )i,j and (Σ)i,j , respectively. Take = c6 lognp¯n with c6 > c8 and define σ ˆm = max1≤i,j≤p |ˆ σij − σij |. Using the union bound, with probability greater than 1 − c7 /p6 , r log p¯n σ ˆm < c6 . (A.12) n ˜ 1 R1 and In addition, Γ0 V0 = Σ−1 M # " l X ˜ 1 R1 R1T M ˜ T = Σ−1/2 M py E(Z|Y = y)E(ZT |Y = y) Σ−1/2 , 1
y=1
where Z = Σ−1/2 (X − µ). Then by (A3), kΓ0 k2F ≤ tr(Σ−1 cov(E(Z|Y ))) ≤ d/σ∗ . As a result, denoting the jth column of Σn − Σ by ξ j , we have √ X σ ˆm dq √ T . kΓ0 ξ j k2 = k (ˆ σjk − σjk )γ 0k k2 ≤ σ ˆm qkΓ0 kF ≤ 1/2 σ ∗ k∈B0 The display above together with (A.12) implies that r p c6 (dq)1/2 log p¯n X ˆ A2 ≤ kδ j k2 . 1/2 n j=1 σ∗
(A.13)
(A.14)
Next, we assume (A.8) and (A.12) hold. Then by (A.4), (A.10) and (A.14), p X 1 A 3 + λn vj kδˆj k2 2 j=1 r p p p X X X log p¯n X ˆ ˆ kδ j k2 ≤λn vj kγ 0j k2 − λn vj kˆ γ j k 2 + λn vj kδ j k2 + c˜n n j=1 j=1 j=1 j∈B0 r p X log p¯n X ˆ ˆ ≤2λn vj kδ j k2 + c˜n kδ j k2 (A.15) n j=1 j∈B0 √ −1/2 with c˜n := c˜(h + dq) and c˜ := c4 + c6 σ∗ , where the last inequality holds because |kγ 0j k2 − kˆ γ j k2 | ≤ kδˆj k2 for every j ∈ B0 and kˆ γ j k = kδˆj k2 for every j ∈ B0c . The display above implies that X 1 A3 + λn vj − c˜n 2 c j∈B 0
r
X log p¯n ˆ kδ j k2 ≤ λn vj + c˜n n j∈B 0
4
r
log p¯n ˆ kδ j k2 , n
and consequently, if r λn vj ≤ 2˜ cn
log p¯n , ∀j ∈ B0 and λn vj ≥ 2˜ cn n
r
log p¯n , ∀j ∈ B0c , n
(A.16)
then X 1X 1 1X λn vj kδˆj k2 ≤ A3 + λn vj kδˆj k2 ≤ 3 c˜n 2 j∈Bc 2 2 j∈Bc j∈B 0
r
log p¯n ˆ kδ j k2 , n
0
0
which gives X
kδˆj k2 ≤ 3
j∈B0c
X
kδˆj k2 .
(A.17)
j∈B0
ˆ Vˆ − Γ0 V0 )T Σ(Γ ˆ Vˆ − Γ0 V0 )], we get Denoting B3 := tr[(Γ |A3 − B3 | p p X X T (ˆ σjk − σjk )δˆj δˆk = j=1 k=1
X X X X X X T T T ˆ ˆ ˆ ˆ ˆ ˆ (ˆ σjk − σjk )δ j δ k (ˆ σjk − σjk )δ j δ k + 2 ≤ (ˆ σjk − σjk )δ j δ k + j∈B0c k∈B0c
j∈B0 k∈B0
≤ˆ σm
X
kδˆj k2
2
j∈B0
r ≤16c6
j∈B0 k∈B0c
X 2 X 2 + 3 kδˆj k2 + 6 kδˆj k2 j∈B0
j∈B0
log p¯n X ˆ 2 kδ j k2 , n j∈B
(A.18)
0
where the last two inequalities follow by (A.17) and (A.12). Also, define Bˆ11 be the index subset in B0c that corresponds to the q largest kδˆj k2 ’s for j ∈ B0c . Define B˜0 = B0 ∪ Bˆ11 . Then, by (A.15) and (A.18), r
r
log p¯n X ˆ 2 kδ j k2 , n j∈B0 0 r 2q log p¯ X 1/2 log p¯n X ˆ 2 n ≤ 6˜ cn kδˆj k22 + 64c6 q kδ j k2 n n ˜ ˜
ˆ Vˆ − Γ0 V0 )T Σ(Γ ˆ Vˆ − Γ0 V0 )] ≤ 6˜ tr[(Γ cn
log p¯n X ˆ kδ j k2 + 32c6 n j∈B
j∈B0
j∈B0
which implies that X j∈B˜0
kδˆj k22
1/2
p 6˜ cn 2q log p¯n /n p ≤ P B3 /( j∈B˜0 kδˆj k22 ) − 64c6 q log p¯n /n p 6˜ cn 2q log p¯n /n p ≤ σ∗ − 64c6 q log p¯n /n r 12˜ cn q log p¯n ≤ σ∗ n
5
(A.19)
for large n. In addition, by definition of Bˆ11 , p−q X X 1 X ˆ 2 1 X ˆ 2 2 ˆ kδ j k2 ≤ kδ j k2 ≤ kδ j k2 , 2 k q c c c ˆ j∈B j∈B k=q+1 j∈B0 \B11
0
which implies with (A.17) that X j∈B0c \Bˆ11
kδˆj k22 ≤
0
X 9 X ˆ 2 kδ j k2 ≤ 9 kδˆj k22 . q j∈B ˜ j∈B0
0
Then the display above together with (A.19) implies that r q log p¯n −1 ˆ Vˆ − Γ0 V0 kF ≤ 48˜ . (A.20) kΓ cn σ ∗ n Therefore, by (A7), Wedin’s theorem and probability bounds for (A.8) and (A.12), we have that (A.2) holds if we can show (A.16) is satisfied. ˜ n , similar To show (A.16), define β0 = minj∈B0 (eTj M M T ej )1/2 . Note that by the choice of λ ˜ Then by (A5) arguments for (A.17) and (A.19) can be applied to get corresponding results for Γ. and (A6), with large enough n, for every j ∈ B0 , r k˜ γ j k2 ≥ β0 − 12˜ cn σ ∗
β0 q log p¯n ≥ , n 2
(A.21)
and for every j ∈ B0c , r
q log p¯n . n From above two displays, for every j ∈ B0 , by the choice of λn and (A6), r ρ λn λn log p¯n λn λn vj = ≤ 1/2 = 2˜ cn , ρ ≤ ρ −φ/2 k˜ γ j k2 (β0 /2) n cφ n /2 k˜ γ j k2 ≤ 36˜ cn σ ∗
(A.22)
and for every j ∈ B0c , λn λ pn λn vj = ≥ cm nρ(ξ−φ/2) c˜n ρ ≥ ρ k˜ γ j k2 (36˜ cn σ∗ q log p¯n /n)
r
log p¯n ≥ 2˜ cn n
r
log p¯n , n
(A.23)
where cm is some positive constant. Therefore, (A.16) is satisfied and (A.2) holds. To show that (A.3) holds, note that under (A.8) and (A.12), we have kˆ γ j k2 ≥ β0 /2 by a similar argument of (A.21). Therefore, P (B0 ⊆ Aˆ0 ) → 1 as n → ∞. Also, given any j ∈ B c , if kˆ γ j k2 > 0, 0
by Karush-Kuhn-Tucker (KKT) conditions, ˆ Vˆ )Vˆ T k2 = λvj . keTj (Υn − Σn Γ
6
(A.24)
Note that ˆ Vˆ Υn − Σn Γ ˆ Vˆ − Γ0 V0 )] − [Σ(Γ ˆ Vˆ − Γ0 V0 )] + [(Σn − Σ)Γ0 V0 ] = [Υn − ΣΓ0 V0 ] − [(Σn − Σ)(Γ =: A11 − A12 − A13 + A14 .
(A.25) q
Assume (A.8) and (A.12) hold. Then by (A.9), we have keTj A11 k2 ≤ c4 lognp¯n ; by (A.20), we have q P ˆ Vˆ − Γ0 V0 kF ≤ c¯1 σ keTj A12 k2 ≤ σ ˆm 1≤j≤p kδˆj k2 ≤ c¯1 c6 c˜n q log p/n and keTj A13 k2 ≤ σ ˜ kΓ ˜ c˜n q lognp¯n , √ q where c¯1 = 48σ∗ ; by (A.13), we have keTj A14 k2 ≤ c6 1/2dq lognp¯n . These inequalities together with σ∗
(A.25) imply that there is a positive constant c˜6 such that r keTj (Υn
ˆT
ˆ Vˆ )V k2 ≤ c˜6 cq q − Σn Γ
log p . n q
But by (A5), (A.23) and the choice of ρ, we have q = o(n ) and λvj > c˜6 c˜n q lognp¯n , which contradicts (A.24). Therefore, P (B0 ⊇ Aˆ0 ) → 1 as n → ∞, and consequently, (A.3) holds. ˜ and v = 1p , we can show (7) using Lastly, noting that (A.16) is satisfied with the choice of λ ρ(ξ−φ/2)
same arguments for (A.20). We complete the proof of Theorem 6. ˜ n. Next, we provide the results with the thresholded covariance estimation Σn = Σ p ˜n = Theorem 7. Suppose (A1), (A3), (A8) and (A9) are satisfied. If τ = 2c6 log p¯n /n and λ p p 2˜ c∗ h+ (dq) ∧ ln ) log p¯n /n, where ln = (n/ log p¯n )κ , c6 is given after (A.11) and c˜∗ is in (A.33), ˜ V˜ ) satisfies then the minimizer (Γ, p p ˜ ˜ kΓV − M kF = Op h + (dq) ∧ ln q log p¯n /n . p p ρ/2 If we further assume (A6) and (A7) and let λn = 21−ρ c¯∗ cφ h + (dq) ∧ ln ) log p¯n /n1+ρφ and ˆ Vˆ ) satisfies 2ρ(ξ − φ/2) ≥ 1 − 2ξ, then the minimizer (Γ, p p kPSΓˆ − PSM kF = Op h + (dq) ∧ ln q log p¯n /n , (A.26) P (Aˆ0 = B0 ) → 1 as n → ∞. Proof of Theorem 7. The proof can be modified from that of Theorem 6. First, under the event ˜ − Σk2 and an improved upper bound for A2 in (A.14). (A.12), we show an upper bound for kΣ P Assume (A.12) holds. Define s to be the upper bound of max1≤i≤p pj=1 |σij |κ . Then, similar to
7
the proof of Theorem 1 in Bickel and Levina (2008a), we have that for every 1 ≤ i ≤ p, p X |ˆ σij − σij |I(|ˆ σij | ≥ τ, |σij | ≥ τ ) ≤ σ ˆm sτ −κ ,
(A.27)
j=1 p
p p X X X |ˆ σij |I(|ˆ σij | < τ, |σij | ≥ τ ) |ˆ σij − σij |I(|σij | ≥ τ ) + |σij |I(|ˆ σij | < τ, |σij | ≥ τ ) ≤ j=1
j=1
j=1
≤σ ˆm sτ −κ + sτ 1−κ ,
(A.28)
p
p
X X |ˆ σij |I(|ˆ σij | ≥ τ, |σij | < τ ) ≤ B0i + |σij |I(|σij | < τ ) ≤ B0 + sτ 1−κ , j=1
j=1
where B0i :=
p X
|ˆ σij − σij |I(|ˆ σij | ≥ τ, |σij | < τ )
j=1 p
≤
p
X 1 1 |ˆ σij − σij |I(|σij | > τ ) + |ˆ σij − σij |I(|ˆ σij | ≥ τ, |σij | < τ ) 2 2 j=1 j=1
X
κ
≤σ ˆm (2σij /τ ) + σ ˆm
p X
1 I(|ˆ σij − σij | > τ ) 2 j=1
≤ 2κ sˆ σm τ −κ ,
(A.29)
and (A.29) follows by (A.12) and the choice of τ . Therefore, p X |ˆ σij |I(|ˆ σij | ≥ τ, |σij | < τ ) ≤ (2κ σ ˆm + τ )sτ −κ .
(A.30)
j=1
Then, note that following the argument similar to Theorem 1 of Rothman et al. (2009), we have kSτ (Σ) − Σk2 ≤ kSτ (Σ) − Σk1 ≤ sτ 1−κ , Also, note that ˜ n − Sτ (Σ)k2 ≤ kΣ ˜ n − Sτ (Σ)k1 kΣ X X ≤ˆ σm + max |ˆ σij |I(ˆ σij ≥ τ, |σij | < τ ) + max |σij |I(ˆ σij < τ, |σij | ≥ τ ) 1≤i≤p
+ max
1≤i≤p
1≤i≤p
j6=i
X
j6=i
(|ˆ σij − σij | + |sτ (ˆ σij ) − σ ˆij | + |σij − Sτ (σij )|)I(ˆ σij ≥ τ, |σij | ≥ τ )
i6=j
≤ˆ σm + (2κ σ ˆm + τ )sτ −κ + (ˆ σm + τ )sτ −κ + σ ˆm sτ −κ + 2sτ 1−κ , where the last inequality follows by (A.27)-(A.30) and |sτ (z) − z| ≤ τ . Therefore, log p¯ 1−κ 2 n ˜ ˜ kΣn − Σk2 ≤ kSτ (Σ) − Σk2 + kΣn − Sτ (Σ)k2 ≤ 13c6 s . (A.31) n ˜ n − Σ)k2 ≤ 13c1/26 s ( log p¯n ) 1−κ 2 . Then, using (A.31) and the derivations similar This implies that kΓT0 (Σ n σ∗
8
for (A.14), we have r A2 ≤ (13s +
1)c6 σ∗−1/2 ((dq)
1/2
∧ ln )
p log p¯n X ˆ kδ j k2 . n j=1
(A.32)
˜ n is positive definite since The upper bound for A1 is given in (A.10). Also, for large enough n, Σ for every v ∈ Sp−1 , ˜ n v ≥ vT Σv − |vT (Σ ˜ n − Σ)v| ≥ σ∗ − kΣ ˜ − Σk2 > 0. vT Σ Define c˜∗n = [(13s + 1)c6 σ∗−1/2 + c4 ](h +
p p (dq) ∧ ln ) =: c˜∗ (h + (dq) ∧ ln ).
(A.33)
Then we note that the rest of proof for Theorem 7 can be completed the same way as that of Theorem 6 by replacing c˜ with c˜∗ and replacing c˜n with c˜∗n . Thus we omit the straightforward details. A.4.
Proof of Theorems 3 and 4
¯ be the sample means of Given the sample (Xi , Yi ), i = 1, · · · , n, recall that fi = f (Yi ). Let ¯f and ε ¯, respectively. fi ’s and εi ’s, respectively. Let εij and ε¯j be the jth (1 ≤ j ≤ p) elements of εi and ε ˜ elements of fi and ¯f , respectively. Denote U = E(F T F ). Let fik and f¯k be the kth (1 ≤ k ≤ h) Define E0 and E to be the n×p design matrix of εi and εi −¯ ε, respectively. The proof of Theorems 3 and 4 relies on the following three useful lemmas. Lemma 1. Let n be positive numbers with n → 0 and nn → ∞. Suppose (B1) holds. Then for all large n, there exist constants β1 , β2 > 0 such that n X ˜ 1 exp(−nβ2 2 ). P max | (εij − ε¯j )(fik − f¯k )| ≥ nn ≤ hpβ n ˜ 1≤j≤p, 1≤k≤h
(A.34)
i=1
Proof of Lemma 1. By (B1) and using arguments similar to that of Lemma 4 in Supplement A.5, ˜ and r ≥ 2, E(|εij fik |r ) ≤ cK r!K r−2 for we have that for any given integers 1 ≤ j ≤ p, 1 ≤ k ≤ h some positive constants cK and K. Then, by the extended Hoeffding inequality, for large n, n n2 X n2n ≤ 2 exp − n . P | εij fik | ≥ nn ≤ 2 exp − 2(cK + Kn ) 4cK i=1 Also, we can similarly obtain that for some positive constant c¯K , √ √ P (|¯ εj f¯k | ≥ n ) ≤ P (|¯ εj | ≥ n ) + P (|f¯k | ≥ n ) ≤ 2 exp(−nn /4¯ cK ). P P Since ni=1 (εij − ε¯j )(fik − f¯k ) = ni=1 ij fik − n¯ εj f¯k , the two displays above implies (A.34) by union bound. 9
Lemma 2. Suppose conditions in Lemma 1 and (B2) are satisfied. Then for all large n, there exists constants β3 , β4 > 0 such that ˜ n ≤h ˜ 2 β3 exp(−nβ4 2 ), P kUˆ −1 − U −1 k2 ≥ h n
(A.35)
P where Uˆ = n−1 ni=1 (fi − ¯f )(fi − ¯f )T . Proof of Lemma 2. It is known that (see, e.g., Theorem 1 of Bickel and Levina, 2008a) n X −1 ˜ 2 β21 exp(−nβ22 2 ) =: pn1 fij fik | > n ≤ h P max |n n ˜ 1≤i,j≤h
i=1
˜ 23 exp(−nβ24 n ) =: pn2 . P ( max |f¯j |2 ≥ n ) ≤ hβ ˜ 1≤j≤h
Therefore, P (kUˆ − U k2 ≥ 2hn ) ≤ pn1 + pn2 .
(A.36)
Also, using that Uˆ −1 − U −1 = Uˆ −1 (U − Uˆ )U −1 and Uˆ = U (I + U −1 (Uˆ − U )), we can show by (A.36) and (B2) that for large n, with probability greater than 1 − 2pn1 , kU −1 k22 kU − Uˆ k2 u2∗ kU − Uˆ k2 kUˆ −1 − U −1 k2 ≤ ≤ ≤ 4hu2∗ n . −1 ˆ ˆ 1 − kU k2 kU − U k2 1 − u∗ kU − U k2 2 Then replacing 4u∗ n above with n , we obtain (A.35). Lemma 3. For all large n, there exist positive constants cψ and β5 such that r log p¯n β5 ˆ P max |ψij − ψij | ≥ cψ ≤ 6. 1≤i,j≤p n p¯n
(A.37)
Proof of Lemma 3. Note that (Ip − PF )X = (Ip − PF )E = (Ip − PF − P1n )E0 . This implies that ˆ n = XT (Ip − PF )X/n = E0T (Ip − PF − P1n )E0 /n. Therefore, by Proposition G.1 of Cook et al. Φ (2012), there exist positive constants β31 and β32 such that for > 0, P max |ψˆij − ψij | ≥ ≤ p2 β31 exp(−nβ32 ( ∧ 2 )). 1≤i,j≤p p Then (A.37) is obtained by taking = cψ log p¯n /n with constant cψ > 8/β32 . Next, we present two theorems (Theorems 8 and 9) without assuming coverage. Then under the coverage condition and Proposition 2, Theorems 8 and 9 immediately imply Theorems 3 and 4, respectively. ˆ n. The following theorem for PFC uses sample residual covariance Σn = Φ p ˜ n = 2˜ ˜ + (ωdq)1/2 ) log p¯n /n with Theorem 8. Suppose (B1)–(B3) and (A5) are satisfied. If λ cf (h ˜ V˜ ) satisfies c˜f given in (A.43), then the minimizer (Γ, p ˜ 1/2 + (dq)1/2 ) q log p¯n /n . ˜ V˜ − M kF = Op (h/ω ω −1/2 kΓ 10
p ρ/2 ˜ If we further assume (B4) and let λn = 21−ρ c˜f cφ (h + (ωdq)1/2 ) log p¯n /n1+ρφ and 2ρ(ξ − φ/2) ≥ ˆ Vˆ ) satisfies 1 − 2ξ, then the minimizer (Γ, p ˜ 1/2 + (dq)1/2 ) q log p¯n /n , kPS − PS kF = Op (h/ω (A.38) ˆ Γ
M
P (Aˆ0 = B0 ) → 1 as n → ∞. ¯ = ΨB(fi − ¯f ) + (εi − ε ¯), which can be Proof of Theorem 8. First, note from (7) that Xi − X written as the matrix form X = F B T ΨT + E. Therefore, Υn − ΦΓ0 V0 = XT F (F T F )−1 − ΨB = E T F (F T F )−1 . Then, to provide upper bound for keTj (Υn − ΦΓ0 V0 )k2 , define η j = Eej (1 ≤ j ≤ p). Also note that (F T F )−1 F η j = [(F T F )−1 − nU −1 ](F T η j ) + nU −1 (F T η j ). Therefore, using Lemma 1 and p Lemma 2 and taking n = cf log p¯n /n, where cf is a positive constant satisfying c2f (β2 ∧ β4 ) ≥ 4, ˜ 2 /¯ we obtain that for all large n, with probability greater than 1 − cβ h p3n (cβ = β1 ∨ β3 ), r T −1 ˜ log p¯n , max k(F F ) F η j k2 ≤ cf h 1≤j≤p n which implies that A1 in (A.4) satisfies r p log p¯n X ˆ ˜ |A1 | ≤ cf h kδ j k2 . (A.39) n j=1 Next, note that by Lemma 3, with probability greater than 1 − β5 /¯ p6n , r log p¯n ψˆm =: max |ψij − ψˆij | ≤ cψ . 1≤i,j≤p n Also, since Φ−1 ΨB = Γ0 V0 , for every 1 ≤ j ≤ p and large n, X √ ˆ n − Φ)ej k2 = k kV0T ΓT0 (Φ (ψkj − ψˆkj )(B T ΨT Φ−1 ek )k2 ≤ ψˆm qkB T ΨT Φ−1 kF
(A.40)
k∈B0
≤ cb ψˆm
p p qtr(ΨT Φ−2 Ψ) ≤ 2cb ψˆm qωtr(Gd ),
where cb = kBk2 . Then, when (A.40) holds, we have r p log p¯n X ˆ 1/2 A2 ≤ 2cψ c˜g cb (ωdq) kδ j k2 , n j=1
(A.41)
(A.42)
where c˜g is the uniform upper bound for all elements in Gd . Let c˜f := cf + 2cψ c˜g cb
(A.43)
˜ + (ωdq)1/2 . After obtaining upper bounds for A1 and A2 in (A.39) and (A.42), and c˜f,n := h respectively, the rest of the proof follows straightforwardly from the similar arguments as that of Theorem 2 by replacing c˜ with c˜f and replacing c˜n with c˜f,n . Thus we omit the rest of details.
11
˜ n. The next theorem for PFC uses the thresholded error covariance estimation Σn = Φ p ˜+ ˜ n = 2˜ Theorem 9. Suppose (B1)–(B7) and (A9) are satisfied. If τ = 2cψ log p¯n /n and λ c∗f (h p (ωd(q ∧ ln ))1/2 ) log p¯n /n, where ln = (n/ log p)κ , cψ is given in (A.37) and c˜∗f is in (A.48), then ˜ V˜ ) satisfies the minimizer (Γ, p ˜ 1/2 + (d(q ∧ ln ))1/2 ) q log p¯n /n . ˜ V˜ − M kF = Op (h/ω ω −1/2 kΓ p ρ/2 ˜ + (ωd(q ∧ ln ))1/2 ) (q ∧ ln )ω log p¯n /n1+ρφ and If we further assume (B4) and let λn = 21−ρ c˜∗f cφ (h ˆ Vˆ ) satisfies 2ρ(ξ − φ/2) ≥ 1 − 2ξ, then the minimizer (Γ, p ˜ 1/2 + (d(q ∧ ln ))1/2 ) q log p¯n /n , kPSΓˆ − PSM kF = Op (h/ω (A.44) P (Aˆ0 = B0 ) → 1 as n → ∞. Proof of Theorem 9. Following the arguments in the proof of Theorem 3, we intend to provide an upper bound for A2 under the event that (A.40) holds. By (4), we can use the same proof technique for that of (A.31) to show that 1−κ ˜ n − Φk2 ≤ 13cψ s log p¯n 2 , kΦ (A.45) n P where s is the upper bound of max1≤i≤p pj=1 |ψij |κ . Then, by (A.42) and (A.45), we obtain that r p n log p¯n o X ˆ −1 1/2 ˜ A2 ≤ min kΦ ΨBk2 kΦn − Φk2 , 2cψ c˜g cb (ωdq) kδ j k2 n j=1 r p n log p¯ 1−κ log p¯n o X ˆ 2 n 1/2 1/2 1/2 ≤ min 52cb kGd k2 ω scψ , 2cψ c˜g cb (ωdq) kδ j k2 n n j=1 r p log p¯n X ˆ 1/2 ≤ (52cψ s˜ c1/2 kδ j k2 (A.46) g + 2cψ cg1 )cb (ωd(ln ∧ q)) n j=1 (A.47) Define c˜∗f := cf + (52cψ s˜ cg1/2 + 2cψ cg1 )cb
(A.48)
˜ + (ωd(q ∧ ln ))1/2 ). The remaining proof can be completed by following similar and c˜∗f,n := c˜∗f (h procedures as the proof of Theorem 8 by replacing c˜f with c˜∗f and replacing c˜f,n with c˜∗f,n . Thus the details are omitted. A.5.
Proof of Theorem 5
The proof of Theorem 5 relies on the following lemma.
12
Lemma 4. Suppose (C1) and (A4) hold. Then with probability greater than 1 − 12h/¯ p2n , r log p¯n max |ˆ σyij − σyij | ≤ c7 h , (A.49) 1≤y≤h n 1≤i,j≤p
for some positive constant c7 and large enough n. Proof of Lemma 4. Without loss of generality, assume µ = 0. Consider predictor X = (X1 , · · · , Xp )T and response Y . Given a slice y (1 ≤ y ≤ h), let µy = E(X|Y = y) = (µy1 , · · · , µyp )T . Note that by (C1), there is a constant cg > 0 such that |σyij | ≤ cg for all 1 ≤ y ≤ h and 1 ≤ i, j ≤ p. Then, for every 1 ≤ i, j ≤ p and > 2cg , P (|(Xi − µyi )(Xj − µyj ) − σyij | > | Y ∈ Iy ) ≤P (|(Xi − µyi )(Xj − µyj )| ≥ | Y ∈ Iy ) 2 p p ≤P (|Xi − µyi | ≥ /2 | Y ∈ Iy ) + P (|Xj − µyj | ≥ /2 | Y ∈ Iy ) ≤2C1 exp(−/(8C2 )), which implies that with C3 = (2C1 ) ∨ exp(cg /(4C2 )), P (|(Xi − µyi )(Xj − µyj ) − σyij | > | Y ∈ Iy ) ≤ C3 exp(−/(8C2 )).
(A.50)
Let Tij = (Xi − µyi )(Xj − µyj ). Then by (A.50), for every 1 ≤ i, j ≤ p and k ≥ 1, E(|Tij − σyij |k | Y ∈ Iy ) Z ∞ = ktk−1 P (|Tij − σyij | > t | Y ∈ Iy ) dt Z0 ∞ ≤ ktk−1 C3 exp(−t/(8C2 )) dt 0
k!˜ v 2 c˜0k−2 , 2 where c˜0 = 8C2 and v˜2 = 128C22 C3 . ≤
(A.51)
P Then, given a slice y (1 ≤ y ≤ h), define the sample size in slice y to be Ny = ni=1 I(Yi ∈ Iy ). ˜ 1, X ˜ 2, · · · , X ˜ Ny be the response corresponding to the slice y, X ˜ ki (1 ≤ k ≤ Ny , 1 ≤ i ≤ p) be Let X ˜ k, X ¯ y be the sample mean of slice y, and X ¯ y . Then, ¯ yi be the ith element of X the ith element of X Ny X ˜ ki − µyi )(X ˜ kj − µyj ) (X − σyij > P Ny k=1 Ny X ˜ ki − µyi )(X ˜ kj − µyj ) npy (X npy ≤P Ny ≥ +P − σyij > , Ny ≥ 2 Ny 2 k=1 n X 2 j2 ≤ 4 + exp − 2 , p¯n 4(˜ v + c˜0 ) j=dnpy /2e
13
(A.52)
where the last inequality follows by (A.5) and the extended Bernstein inequality. Then taking q q p¯n log p¯n v2 with C > 6˜ v , since C ≤ 2˜ for large enough n, we have by (A.52) that = C4 log 4 4 np∗ np∗ c0 with probability greater than 1 − 4/¯ p4n , s r Ny X ˜ ki − µyi )(X ˜ kj − µyj ) (X log p h log p¯n −1/2 − σyij ≤ C4 ≤ C 4 cb . Ny np∗ n k=1
(A.53)
Also, note that there is a constant µ∗ such that µyj ≤ µ∗ for every 1 ≤ y ≤ h and 1 ≤ j ≤ p. By (A.6), with probability greater than 1 − 8/¯ p4n , ¯ yi X ¯ yj | ≤ |(µyi − X ¯ yi )µyj | + |(µyi − X ¯ yi )(µyj − X ¯ yj )| + |µyi (µyj − X ¯ yj )| |µyi µyj − X r h log p¯n −1/2 ∗ ≤ 3µ c2 cb . (A.54) n Therefore, noting that σ ˆyij − σyij
Ny X ˜ ki − µyi )(X ˜ kj − µyj ) (X ¯ yi X ¯ yj , = − σyij + µyi µyj − X N y k=1
by (A.53) and (A.54), with probability greater than 1 − 12h/¯ p2n , we have (A.49) where c7 =: −1/2
C 4 cb
−1/2
+ 3µ∗ c2 cb
. This completes the proof of Lemma 4.
Following discussion in Section 9, we first present the following Theorem 10 without assuming coverage. Then with the coverage condition and Proposition 2, Theorem 10 immediately implies Theorem 5. 1−κ
Theorem 10. Let ¯ln = max(h(log p¯n /n)κ/2 , h 2 ). Suppose (C1)-(C3), (A3) and (A4) are satisp 1−κ ˜ n = 2¯ fied. If τ = 2c7 log p¯n /n and λ c ¯ln (log p¯n /n) 2 with c7 given in Lemma 4 and c¯ given in ˜ V˜ ) satisfies (A.61), then the minimizer (Γ, 1−κ ˜ V˜ − M kF = Op q 1/2 log p¯n 2 . kΓ n p ρ/2 If we further assume (A6) and (A7) and let λn = 21−ρ c¯cφ (log p¯n )1−κ /n1−κ+ρφ and 2ρ(ξ−φ/2) ≥ ˆ Vˆ ) that satisfies 1 − 2ξ, then there is a minimizer (Γ, log p¯n 1−κ 2 kPSΓˆ − PSM kF = Op ¯ln q 1/2 , (A.55) n P (Aˆ0 = B0 ) → 1 as n → ∞. (A.56) Proof of Theorem 10. First, note that the inequality (A.4) still holds. Under (A.12), we still have (A.31). With the generalized thresholding scheme, using Lemma 4 and arguments similar to that of (A.31), we have that under (A.49), for every y kΣy,n − Σy k2 ≤ 13c7 s 14
h log p¯ 1−κ 2 n
n
.
(A.57)
By (A.5), with probability greater than 1 − 2h/¯ p4n , for every 1 ≤ y ≤ h, r pˆy log p¯n . (A.58) 1 − ≤ c1 c−1 b h py n Next, we assume that (A.12), (A.49) and (A.58) hold. Then, by (A.31), (A.57) and (A.58), 1−κ p ˜ n − Σy,n ) pˆy − (Σ − Σy )√py k2 ≤ 13(c6 + c7 )s h log p¯n 2 , k(Σ n which implies in together with (A.58) that there exists some constant C6 > 0 r log p¯ log p 1−κ h log p¯n 1−κ 2 n T 2 ¯ . max kej (Υn − ΣΓ0 V0 )k2 ≤ C6 max h , =: C6 ln 1≤j≤p n n n As a result, p log p 1−κ X 2 ¯ |A1 | ≤ C6 ln kδˆj k2 . (A.59) n j=1 Also, noting that there exist positive constants σ ∗ and C2∗ that are upper bounds of kΣk2 and kΣy k2 , respectively, by (A.31) and kΓ0 V0 k2 ≤ (σ ∗ + C2∗ )2 /σ∗ , p p 1−κ log p¯ 1−κ X X 2 2 n −1 ∗ 2 log p ¯ ˆ |A2 | ≤ 13c6 sσ∗ (σ + C2 ) kδ j k2 =: C6 kδˆj k2 . n n j=1 j=1 Then by (A.4), (A.59), (A.60) and the arguments of (A.15), we have p p log p¯ 1−κ X X X 1 2 n ¯ ˆ kδˆj k2 , A3 + λn vj kˆ γ j k2 ≤ 2λn vj kδ j k2 + c¯ ln 2 n j=1 j=1 j∈B
(A.60)
(A.61)
0
where c¯ = C6 ∨ C¯6 . From above, log p¯ 1−κ log p¯ 1−κ X X 1 2 2 n n A3 + λvj − c¯ ¯ln kδˆj k2 ≤ kδˆj k2 . λvj + c¯ ¯ln 2 n n c j∈B j∈B
(A.62)
0
0
Consequently, if log p 1−κ log p 1−κ 2 2 ¯ ¯ λvj ≤ 2¯ c ln , ∀j ∈ B0 and λvj ≥ 2¯ c ln , ∀j ∈ B0c , n n
(A.63)
by (A.62), X
kδˆj k2 ≤ 3
j∈B0c
X
kδˆj k2 .
(A.64)
j∈B0
In addition, with arguments similar to (A.18), we can show that r log p¯n X ˆ 2 |A3 − B3 | ≤ 48c6 kδ j k2 . n j∈B 0
15
(A.65)
Define Bˆ11 and B˜0 as given after (A.18). Then, by (A.62) and (A.65), r log p¯ 1−κ X log p¯n X ˆ 2 2 n T ¯ ˆ ˆ ˆ ˆ ˆ kδ j k2 + 96c6 kδ j k2 tr[(ΓV − Γ0 V0 ) Σ(ΓV − Γ0 V0 )] ≤ 6¯ c ln n n j∈B0 j∈B0 r X 1/2 log p¯ 1−κ log p¯n X ˆ 2 2 n kδˆj k22 + 96c6 q ≤ 6¯ c ¯ln q 1/2 kδ j k2 n n ˜ ˜ j∈B0
j∈B0
which implies that with large enough n, 1−κ X 1/2 1−κ 6¯ c ¯ln (log p¯n /n) 2 2 ˆ p kδ j k2 ≤ ≤ 12¯ cσ∗−1 ¯ln q 1/2 (log p¯n /n) 2 . P 2 B3 /( j∈B˜0 kδˆj k2 ) − 96c6 q log p¯n /n j∈B˜ 0
In together with (A.64), ˆ Vˆ − Γ0 V0 kF ≤ c¯∗ ¯ln q 1/2 kΓ
log p 1−κ 2
, (A.66) n where c¯∗ = 48¯ c/σ∗ . Therefore, by (A.66), (A7), Wedin’s theorem and probability bounds of (A.12), (A.49) and (A.58), we have that (A.55) holds if we can show (A.63) is satisfied. To show (A.63), define β0 = minj∈B0 (eTj M3 M3T ej )1/2 . Note that with large enough n, for every j ∈ B0 , 1−κ 1−κ γ j k2 ≤ 43 c¯∗ ¯ln q 1/2 ( lognp¯n ) 2 . As a k˜ γ j k2 ≥ β0 − 14 c¯∗ ¯ln q 1/2 ( lognp¯n ) 2 ≥ β20 and for every j ∈ B0c , k˜ result, for every j ∈ B0 , by the choice of λ and (A6), we can verify that (A.63) indeed holds using similar arguments of (A.22) and (A.23). This proves (A.55). To show variable selection results of (A.56), we can follow the same arguments as that of Theorem 6 by noting that under (A.12), (A.49) and (A.58), the four components in (A.25) satisfy P 1−κ ˆm 1≤j≤p kδˆj k2 ≤ 2c6 c¯∗ ¯ln q( lognp¯n )1−κ/2 , keTj A13 k2 ≤ keTj A11 k2 ≤ C6 ¯ln ( lognp¯n ) 2 , keTj A12 k2 ≤ σ 1−κ 1−κ c¯∗ σ ∗ ¯ln q( log p¯n ) 2 and keT A14 k2 ≤ (σ ∗ +C ∗ )2 σ −1 c6 s( log p¯n ) 2 , which implies that there is a positive 2
j
n
constant C7 such that
keTj (Υn
∗
n
ˆ Vˆ )Vˆ T k2 ≤ C7 ¯ln q( log p¯n ) 1−κ 2 . − Σn Γ The proof of Theorem 5 is n
complete.
B.
Supplement: Additional Numerical Results
Following the discussion on simulation results in Section 7, we provide the results corresponding to the non-sparse predictor covariance Σ = p−1 AT A of Case 4 in Table 7 and the non-sparse error covariance Φ = p−1 AT A of Case 5 in Table 8. For Case 4 shown in Table 7, since our proposal for TC-SAVE requires Σ to satisfy the sparsity condition (A8), it is not surprising that the central subspace estimation does not perform so well, although the variable selection performance of TCSAVE appears acceptable. In Table 8, SC-PFC performs better than TC-PFC. These empirical results confirm our claim that for the first-moment SDR methods, we do not require sparsity 16
condition on predictor/error covariance (or its inverse). Table 7: Simulation results of Cases 4 based on 100 runs with Σ = p−1 AT A. Method
Oracle SC-SIR TC-SIR TC-SAVE
Averaged
Frequency (%)
spectral-norm loss
d=1
d=2
d=3
0.999 (0.001) 0.999 (0.001) 0.946 (0.013)
100 17
0 33
0 50
Cv
IC v
3 0.28 0.82 2.44
0 94.80 306.37 3.98
Table 8: Simulation results of Cases 5 based on 100 runs with Φ = p−1 AT A. Method
Averaged spectral-norm loss
Oracle SC-SIR TC-SIR SC-PFC TC-PFC
0.708 0.406 0.304 0.329
(0.023) (0.022) (0.016) (0.027)
Frequency (%) d=1
d=2
d=3
0 8 0 0 0
100 87 97 99 89
0 5 3 1 11
Cv
IC v
5 4.93 4.96 5.00 5.00
0 17.42 4.80 5.91 11.68
In addition, to further confirm that our first-moment SDR methods do not necessarily require sparsity conditions on predictor/error covariance (or its inverse), and is able to perform simultaneous variable selection, structural dimension selection as well as central subspace estimation, we repeat our simulation studies using SC-SIR for Cases 1-3 with non-sparse predictor covariance Σ = p−1 AT A. In particular, we replace Seq-SIR with another promising benchmark method known as DT-SIR (Lin et al., 2015). In particular, DT-SIR is based on a thresholding technique and is well-designed for sparse covariance matrix scenarios that satisfy both a sparsity condition of Bickel and Levina (2008b) and the sparsity condition (A8) with κ = 0. Following Lin et al. (2015), the true structural dimension is assumed to be known for the benchmark method. Results in Table 9 confirm that, compared to the benchmark, SC-SIR remains to perform well in estimation under the non-sparse covariance scenarios, in addition to satisfactory variable selection and structural dimension selection results.
17
Table 9: A comparative simulation study of Cases 1-3 based on 100 runs with Σ = p−1 AT A. Case
Method
Averaged
Frequency (%)
spectral-norm loss
d=1
d=2
d=3
Cv
IC v
1
Oracle SC-SIR DT-SIR
0.271 (0.016) 0.525 (0.017)
100 100 -
0 0 -
0 0 -
5 4.95 -
0 14.33 -
2
Oracle SC-SIR DT-SIR
0.501 (0.018) 0.593 (0.015)
0 0 -
100 98 -
0 2 -
8 7.92 -
0 23.16 -
3
Oracle SC-SIR DT-SIR
0.436 (0.015) 0.724 (0.014)
0 0 -
100 99 -
0 1 -
8 7.96 -
0 16.51 -
18