A Conditional Dependence Measure with Applications to ... - arXiv

4 downloads 246 Views 651KB Size Report
Jan 8, 2015 - Undirected graphs are important tools to capture dependence among random ... the Gaussian graphical model framework to estimate the network structure. .... numerical studies and Section 5 demonstrates the performance of ...
A Conditional Dependence Measure with Applications to Undirected Graphical Models



arXiv:1501.01617v2 [stat.ME] 8 Jan 2015

Jianqing Fan1 , Yang Feng2 , and Lucy Xia1 1

Department of Operations Research and Financial Engineering, Princeton University 2

Department of Statistics, Columbia University

Abstract Measuring conditional dependence is an important topic in statistics with broad applications including graphical models. Under a factor model setting, a new conditional dependence measure is proposed. The measure is derived by using distance covariance after adjusting the common observable factors or covariates. The corresponding conditional independence test is given with the asymptotic null distribution unveiled. The latter gives a somewhat surprising result: the estimating errors in factor loading matrices, while of root-n order, do not have material impact on the asymptotic null distribution of the test statistic, which is also in the root-n domain. It is also shown that the new test has strict control over the asymptotic significance level and can be calculated efficiently. A generic method for building dependency graphs using the new test is elaborated. Numerical results and real data analysis show the superiority of the new method. Key Words: conditional dependency, dependency graph, distance covariance, graphical model, independence test. ∗

This work was partially supported by National Science Foundation grants DMS-1206464, DMS-

1308566, DMS-1406266 and National Institutes of Health grant R01-GM072611.

1

1

Introduction

Undirected graphs are important tools to capture dependence among random variables and have drawn tremendous attention in various fields including signal processing, bioinformatics and network modeling (Wainwright and Jordan, 2008). Let z = (z (1) , · · · , z (p) ) be a p-dimensional random vector. We denote the undirected graph corresponding to z by (V, E), where vertices V correspond to components of z and edges E = {eij , 1 ≤ i, j ≤ p} indicates whether node z (i) and z (j) are conditionally independent given the remaining nodes. In particular, the edge eij is absent if and only if z (i) ⊥ ⊥ z (j) |z \ {z (i) , z (j) }. When z follows multivariate Gaussian distribution with mean µ and covariance matrix Σ, the precision matrix Ω = (wij )p×p = Σ−1 captures exactly this relationship; that is wij = 0 if and only if eij is absent (Lauritzen, 1996; Edwards, 2000). Therefore, under the Gaussian assumption, this problem reduces to the estimation of precision matrix, where a rich literature on model selection and parameter estimation can be found in both low dimensional and high dimensional settings, including Dempster (1972), Drton and Perlman (2004), Meinshausen and B¨ uhlmann (2006), Friedman et al. (2008), Fan et al. (2009), and Cai et al. (2011). While Gaussian graphical model can be useful, the stringent requirement on normality is not always satisfied in real applications where the observed data usually have fat tails or are skewed (Xue et al., 2012). To relax the Gaussian assumption, Liu et al. (2009) proposed the nonparanormal model, where they find transformations that Gaussianize the data and then work under the Gaussian graphical model framework to estimate the network structure. Under the nonparanormal model, Xue et al. (2012) proposed rank-based estimators to approximate the precision matrix. The nonparanormal model, although flexible, still assumes the transformed data follow a multivariate Gaussian distribution, which can also be restrictive. Instead of using these nonparametric methods to find transformations and work under Gaussian graphical model, we would like to propose a more natural way of constructing graphs. That is, we work directly on the conditional dependence structure by introducing a measure for conditional dependence between node i and j given the remaining nodes or other factors (variables). Then, we can introduce a hypothesis testing procedure to decide whether the edge eij is present or not. 2

In economics, there has been abundant literature on different conditional independence tests. Linton and Gozalo (1996) proposed two nonparametric tests of conditional independence based on a generalization of the empirical distribution function; however, a complicated bootstrap procedure is needed to calculate the critical values of the test, which leads to limited practical value for high-dimensional applications. Su and White (2007, 2008, 2014) proposed conditional independence tests based on Hellinger distance, conditional characteristic function, and empirical likelihood, respectively. However, all those tests either have tuning parameters or are computationally expensive. Motivated by the above problems, we consider the following factor model setup. Suppose x1 , · · · , xn ∈ Rp and y1 , · · · , yn ∈ Rq are from the following model xi = Bx fi + i,x ,

yi = By fi + i,y , i = 1, · · · , n,

(1)

where Bx and By are factor loading matrices of dimension p × K and q × K respectively, {(i,x , i,y )}ni=1 are i.i.d. idiosyncratic errors with the same distribution as (x , y ), and {fi }ni=1 are i.i.d. observations of the K-dimensional vector of common factors f . Here, we assume independence between {(i,x , i,y )}ni=1 and {fi }ni=1 . Our goal is to test whether x and y are independent given f , i.e., H0 : x ⊥ ⊥ y|f vs. H1 : not H0 .

(2)

Under model (1), the testing problem is equivalent to test whether x and y are independent, i.e., H0 : x ⊥ ⊥ y vs. H1 : not H0 .

(3)

In the case of building graphical models, x and y are nodes of a graph and f represents the rest of the nodes. To complete our proposal, we would need to find a suitable measure of dependence between variables. In this regards, many different measures of dependence have been proposed. Some of them rely heavily on Gaussian assumptions, such as Pearson correlation, which measures linear dependence and the uncorrelated-ness is equivalent to independence only when Gaussian distribution is satisfied; or Wilks’ Lambda (Wilks, 1935), where normality is adopted to calculate the likelihood ratio. To deal with nonlinear dependence and non-Gaussian distribution, statisticians have proposed rank-based 3

dependence measures, including Spearman’s ρ and Kendall’s τ , which are more robust than Pearson correlation against deviations from normality. However, these dependence measures are usually only effective for monotonic types of dependence. In addition, under the null hypothesis that two variables are independent, no general statistical distribution of the coefficients associated with these measures has been derived. Other related works include Hoeffding (1948), Blomqvist (1950), Blum et al. (1961), and some methods described in Hollander et al. (2013) and Anderson (1958). Taking these into consideration, distance correlation (Sz´ekely et al., 2007) was introduced to address all these deficiencies. The benefits of distance correlation are two-fold: first, zero distance correlation implies independence, and hence it is a true dependence measure. Second, distance correlation can measure the dependence between any two vectors which can be of different dimensions. It also excels the classical methods like Puri-Sen likelihood ratio tests (Puri and Sen, 1971) which are not directly applicable when the dimension exceeds the sample size. Due to these advantages, we will employ distance correlation (or distance covariance) in this paper as our measure of dependence. Another important application of the proposed test is to build graphs that incorporates covariate information. Some existing research along this line includes Fan et al. (2011), Yin and Li (2011) and Cai et al. (2013); however, they either has the Gaussian assumption on errors or the estimated precision matrix does not represent the general dependence structure. In this work, the proposed conditional dependence measure has mild assumptions on errors. The main contribution of this paper is two-fold. First, under the factor model assumption, we propose a computationally efficient conditional independence test. Both the response vectors and the common factors can be of different dimensions. Second, we apply this test to build conditional dependency graph and covariates-adjusted dependency graph, as a relaxation of the Gaussian graphical model. The rest of this paper is organized as follows. In Section 2, we present our new procedure for testing conditional independence via distance covariance (C-DCov) and describe how to construct conditional dependency graphs based on the proposed test. Section 3 gives theoretical properties including the asymptotic distribution of the test

4

statistic under the null hypothesis as well as the type I error guarantee. Section 4 contains numerical studies and Section 5 demonstrates the performance of C-DCov via two real data sets. We conclude the paper with a short discussion in Section 6. Several technical lemmas and all proofs are relegated to the appendix.

2

Methods

First, we introduce some notations. For a p-dimensional random vector z, |z| represents its Euclidean norm. A collection of n i.i.d. observations for z is denoted as {z1 , · · · , zn }, (1)

(p)

where zk = (zk , · · · , zk ) represents the k-th sample. For any matrix M, kMkF and kMk denote its Frobenius norm and operator norm, respectively. Also, given random vectors x and y, their characteristic functions are denoted by gx and gy respectively and the joint characteristic function is denoted by gx,y .

2.1

A review of distance covariance

As an important tool, distance covariance is briefly reviewed in this section with further details available in Sz´ekely et al. (2007). We introduce several definitions as follows. Definition 1. (w-weighted L2 norm) Let cd =

π (d+1)/2 , Γ((d+1)/2)

for any positive integer d, where

Γ is the Gamma function. Then for function γ defined on Rp × Rq , the w-weighted L2 norm of γ is defined by Z 2 kγ(τ , ρ)kw = |γ(τ , ρ)|2 w(τ , ρ)dτ dρ, where w(τ , ρ) = (cp cq |τ |1+p |ρ|1+q )−1 . Rp+q

Definition 2. (Distance covariance) The distance covariance between vectors x ∈ Rp and y ∈ Rq with finite first moments is the nonnegative number V(x, y) defined by V 2 (x, y) = kgx,y (τ , ρ) − gx (τ )gy (ρ)k2w . Suppose we observe a random sample {(xk , yk ) : k = 1, · · · , n} from joint distribution of (x, y). Definition 3. (Empirical distance covariance) The empirical distance covariance between vectors x ∈ Rp and y ∈ Rq is the nonnegative Vn (x, y) defined by Vn2 (x, y) = S1 (x, y) + S2 (x, y) − 2S3 (x, y), 5

where S1 (x, y) =

n n n 1 X 1 X 1 X |x − x ||y − y |, S (x, y) = |x − x | |yk − yl |, k l k l 2 k l n2 k,l=1 n2 k,l=1 n2 k.l=1

S3 (x, y) =

n n 1 X X |xk − xl ||yk − ym |. n3 k=1 l,m=1

With above definitions, Lemma 1 depicts the consistency of Vn (x, y). Lemma 2 shows the asymptotic distribution of Vn (x, y) under the null hypothesis that x and y are independent. Corollary 1 reveals properties of the test statistics nVn2 /S2 proposed in Sz´ekely et al. (2007). Lemma 1. (Theorem 2 in Sz´ekely et al. (2007)) If E(x) < ∞ and E(y) < ∞, then almost surely lim Vn (x, y) = V(x, y).

n→∞

Lemma 2. (Theorem 5 in Sz´ekely et al. (2007)) If x and y are independent and E(|x| + |y|) < ∞, then D

nVn2 → kζ(τ , ρ)k2w , n→∞

where ζ(·) denotes a complex-valued zero-mean Gaussian random process with covariance function R(u, u0 ) = (gx (τ − τ 0 ) − gx (τ )gx (τ 0 ))(gy (ρ − ρ0 ) − gy (ρ)gy (ρ0 )), where u = (τ , ρ), u0 = (τ 0 , ρ0 ) and gx (τ 0 ) is the complex conjugate of gx (τ 0 ). Corollary 1. (Corollary 2 in Sz´ekely et al. (2007)) If E(|x| + |y|) < ∞, then: D

D

(i) If x and y are independent, then nVn2 → Q with Q = n→∞

P∞

j=1

i.i.d

λj Zj2 , where Zj ∼

N (0, 1) and {λj } are non-negative constants depending on the distribution of (x, y); E(Q) = 1. P

(ii) If x and y are dependent, then nVn2 /S2 → ∞. n→∞

6

2.2

Conditional independence test via distance covariance (CDCov)

Now, we are in the position to propose a test for problem (2). From the equivalence of (2) and (3), it seems that one can directly apply the distance covariance techniques to x and y . However, realizations of true errors like i,x ’s are not observed. As a result, we first provide an estimate for the errors and calculate distance covariance for the estimated errors. The conditional independence test is summarized in the following steps. Step 1: Estimate factor loading matrices Bx and By by ordinary least square (OLS). ˆ x and B ˆ y. The estimates are denoted as B Step 2: Estimate the error vectors i,x and i,y by ˆ x fi = (Bx − B ˆ x )fi + i,x , ˆi,x = xi − B ˆ y fi = (By − B ˆ y )fi + i,y , i = 1, · · · , n, ˆi,y = yi − B Step 3: Calculate the empirical distance covariance between ˆx and ˆy as Vn2 (ˆx , ˆy ) = S1 (ˆx , ˆy ) + S2 (ˆx , ˆy ) − 2S3 (ˆx , ˆy ). Step 4: Define the C-DCov test statistic as T (x, y, f ) = nVn2 (ˆx , ˆy )/S2 (ˆx , ˆy ). Step 5: With predetermined significance level α, we reject the null hypothesis when T (x, y, f ) > (Φ−1 (1 − α/2))2 . Remark 1. In Step 1, when the dimensionality of f is large, one can replace the OLS estimator by the penalized least square estimator, which will be elaborated in Section 2.4. Theoretical properties of the proposed conditional independence test will be studied in Section 3. See Theorem 3 for the justification of the critical value used in Step 5.

2.3

Building graphs via conditional independence tests

Generally speaking, there are two different types of applications for building dependency graphs. In the first type, the factors are external, having impacts on the majority of variables in a graph. In this scenario, the observed data are {(fi , zi )}ni=1 , in which fi represents the observed factors or covariates that have influence on the outcomes zi in a graph. We 7

then employ model (1) to adjust the factor or covariate effect for every component in z. Given the factors or covariates, the conditional graph is then constructed based on the distance covariance of the residuals. That is, for each pair (z (j) , z (k) ), we apply the test statistic T (z (j) , z (k) , f ) in the previous section to check whether the corresponding components are conditionally dependent, by setting x = z (j) and y = z (k) in model (1). See Section 5.1 for such an application. The second type of applications is to create the factors or covariates internally, as exemplified in Section 5.2. An interesting question is to identify the conditional independence relationship between two nodes given the remaining nodes, i.e., z (j) ⊥ ⊥ z (k) |z \ {z (j) , z (k) }. We assume (j)

(j)

(k)

zi = β > 1,jk fi + i , zi

(k)

= β> 2,jk fi + i , i = 1, · · · , n,

(−j,−k) >

where fi = (zi

(j)

) represents all coordinates of zi other than zi

(k)

and zi , and β 1,jk

and β 2,jk are p − 2 dimensional regression coefficients. The absence of edge ejk between node j and k coincides with the corresponding conditional independence, if only the linear space spanned by f (denoted by L(f )) is considered. This motivates us to construct a graph by directly testing z (j) ⊥ ⊥ z (k) |L(z \ {z (j) , z (k) }). Therefore, for each node pair (j, k), we define z(−j,−k) the data without z (j) and z (k) and T (j,k) = T (z (j) , z (k) , z(−j,−k) ) using the same steps as in Section 2.2. The statistic is used to test the conditional independence hypothesis: H0 : (j) ⊥ ⊥ (k) vs. H1 : not H0 .

(4)

If we reject T (j,k) at level α, ejk as the edge between node j and node k is drawn; and if H0 is accepted, edge ejk is absent.

2.4

Conditional independence test for high-dimensional settings

In the conditional independence test proposed in Section 2.2, it is implicitly assumed that the dimension of f is relatively small compared with the sample size n. In many scenarios, it will be of interest to extend this test to the case where f is high-dimensional. One example is the construction of dependency graph as described in Section 2.3. If there 8

are many nodes in the graph, the corresponding f would be a high-dimensional vector. This motivates us to modify the first step of the conditional independence test introduced in Section 2.2 as follows. Step 1’: Estimate factor loading matrices Bx and By by the penalized least square (PLS) ˜ x and B ˜ y defined as follows. estimators B ˜ x = arg min kx − Bf k2 + B 2 B

X

pλ (|Bjk |),

(5)

j,k

where pλ (·) is a penalty function with penalty level λ. It can be taken as the ridge penalty or folded-concave penalty (Fan and Li, 2001).

3

Theoretical Results

In this section, we derive the asymptotic distribution of our conditional independence test under the null hypothesis. First, we introduce several assumptions on x , y and f . Condition 1. Ex = Ey = 0, E|x | < ∞, E|y | < ∞, and E|f |2 < ∞. Condition 2. Let il,x , it,y , and fik be the l-th, t-th and k-th components of i,x , i,y and fi respectively. We also define hx as the density function of random variable x. (i) There exist constants r1 > 0 and b1 > 0, such that for any s > 0, 1 ≤ l ≤ p and 1 ≤ t ≤ q, P (|il,x | > s) ≤ exp(−(s/b1 )r1 ), P (|it,y | > s) ≤ exp(−(s/b1 )r1 ). (ii) There exist constants r2 > 0, b2 > 0, such that for any s > 0 and 1 ≤ k ≤ K, P (|fik | > s) ≤ exp(−(s/b2 )r2 ). (iii) The densities of |1,x − 2,x | and |1,y − 2,y | are bounded on [0, 1], i.e., max h|i,x −j,x | (t) ≤ M,

t∈[0,1]

max h|i,y −j,y | (t) ≤ M.

t∈[0,1]

Condition 1 puts mild moment conditions on x , y and f . Condition 2 assumes the tails of x , y and f are not too heavy, and it is more general than the sub-exponential assumption and imply the finite moment conditions in Condition 1. 9

Theorem 1. Under Condition 1, P

Vn2 (ˆx , ˆy ) → V 2 (x , y ). P

In particular, when x and y are independent, Vn2 (ˆx , ˆy ) → 0. Theorem 1 shows that the sample distance covariance between the estimated residual vectors converges to the population one. It enables us to use the distance covariance of the estimated residual vectors to construct the conditional independence test as described in Section 2.2. The result of Theorem 1 cannot be implied from that of Theorem 2 below, since independence between x and y is not assumed. Theorem 2. Under Conditions 1 and 2, and the null hypothesis that x ⊥ ⊥ y (or equivalently x ⊥ ⊥ y|f ), D

nVn2 (ˆx , ˆy ) → kζk2 , where ζ is a zero-mean Gaussian process as in Lemma 2. Theorem 2 provides an asymptotic distribution for the test statistic T (x, y, f ) under the null hypothesis, which is the basis of the results of Theorem 3. It also indicates that the estimation errors in ˆx and ˆy are indeed negligible for the null distribution. This is somewhat surprising since the estimation errors in the residuals are of order Op (n−1/2 ) when the test statistic in Theorem 2 is also at O(n−1/2 ) scale, which makes the technical proof challenging. Corollary 2. Under the same conditions of Theorem 2, D nVn2 (ˆx , ˆy )/S2 (ˆx , ˆy ) →

D

Q, where Q =

∞ X

λj Zj2 ,

j=1 i.i.d

where Zj ∼ N (0, 1) and {λj } are non-negative constants depending on the distribution of (x, y); E(Q) = 1. Theorem 3. Suppose C-DCov rejects independence when nVn2 (ˆx , ˆy ) > (Φ−1 (1 − α/2))2 , S2 (ˆx , ˆy )

(6)

where Φ(·) is the cumulative distribution function of N (0, 1). Let αn (x, y, f ) denote the achieved significance level, then for all 0 < α ≤ 0.215, 10

(i) limn→∞ αn (x, y, f ) ≤ α, (ii) supx ⊥⊥y limn→∞ αn (x, y, f ) = α. Part (i) of Theorem 3 indicates the proposed test with reject region (6) has an asymptotic type I error at most α. As described in Sz´ekely et al. (2007), the theoretical critical value in (6) could sometimes be too conservative in practice. Therefore, we recommend a data driven estimate for the critical value, which will be described in detail in Section 4. Part (ii) of Theorem 3 implies that there exists a pair (x , y ) such that the pre-specified significance level α is achieved asymptotically. In other words, (6) is a critical region with size α for the nonparametric testing problem (3).

4

Monte Carlo Experiments

In this section, we investigate the performance of C-DCov with three simulation examples. In Example 4.1, we consider a factor model and test the conditional independence between two vectors x and y given their common factor f , via C-DCov. In Examples 4.2 and 4.3, we consider a Gaussian graphical model and a discrete graphical model, respectively. Example 4.1. [Factor model] Let p = 5, q = 10 and K = 3. We generate rows of Bx , rows of By and {fi }ni=1 independently from N (0, IK ). We generate n i.i.d. copies {ri }ni=1 of log-normal distribution lnN (0, Σ) where Σ is an equal correlation matrix of size (p + q) × (p + q) with Σjk = ρ + (1 − ρ)1{j = k}. i,x and i,y are the centered version of the first p coordinates and the last q coordinates of ri . Then, {xi }ni=1 and {yi }ni=1 are generated according to xi = Bx fi + i,x and yi = By fi + i,y correspondingly. We calculate T (x, y, f ) in the C-DCov test, and T0 (x, y, f ) in which we replace ˆi,x and ˆi,y with the true i,x and i,y as an oracle test to compare with. To get the null distributions of T (x, y, f ) and T0 (x, y, f ) for small sample size n, we compute R(n) replicates of T (x, y, f ) and T0 (x, y, f ) by randomly decoupling the observation indices of ˆx and ˆy respectively, where we choose R(n) = b200 + 5000/nc following Sz´ekely et al. (2007). In other words, we use the permuted data {(ˆi,x , ˆπ(i),y )}ni=1 to compute the test statistic, denoted by Tπ , where {π(1), · · · , π(n)} is a random permutation of {1, · · · , n}. 11

This process is repeated R(n) times, resulting in an estimate of the null distribution of T (x, y, f ). The null distribution of T0 (x, y, f ) can be obtained in a similar fashion. In this example, we set the significance level α = 0.1. We vary the sample size from 30 to 200 with increment of 10 and show the empirical power based on a testing sample of size 1000 for both T (x, y, f ) and T0 (x, y, f ) in Figure 1 for ρ ∈ {0.05, 0.1, · · · , 0.4}. From Figure 1, it is clear that as the sample size or ρ increases, the empirical power also increases in general. Also, comparing the panels (a) and (b) in Figure 1, we can see that they are nearly indistinguishable, which indicates that the C-DCov test works as well as the oracle test where the true regression coefficients are known. When ρ = 0, Table 1 reports the empirical type I error for both C-DCov as well as the oracle version. It is clear that the type I error is under good control for all sample sizes. Table 1: Type I error of Example 1 Test based on ˆx and ˆy n

30

40

60

80

100

120

140

160

180

200

0.120

0.122

0.106

0.090

0.114

0.116

0.101

0.113

0.104

0.115

30

40

60

80

100

120

140

160

180

200

0.100

0.105

0.093

0.087

0.112

0.114

0.096

0.107

0.100

0.112

Test based on x and y n

Example 4.2. [Gaussian graphical model] We consider a Gaussian graphical model with precision matrix Ω = Σ−1 , where Ω is a tridiagonal precision matrix of size p × p, which is associated with the autoregression process of order one. We set p = 30 and the (i, j)element in Σ to be σi,j = exp(−|si − sj |), where s1 < s2 < · · · < sp . In addition, i.i.d

si − si−1 ∼ Unif (0.5, 1), i = 2, · · · , p. We construct dependency graphs through C-DCov at pre-specified significance level α = 0.05 with sample sizes n = 150, 200, 250, 300. Then we compare the graphs with those induced by the estimators corresponding to the LASSO, adaptive LASSO and SCAD penalized likelihoods for precision matrix (Friedman et al., 2008; Fan et al., 2009). We examine their performances by reporting two types of errors: “FP”, the number of false positive errors (i.e., the true entry of the precision matrix is zero but estimated 12

1

0.9

0.9

0.8

0.8

0.7

0.7

power

power

1

0.6

0.5

0.6

0.5

0.4

0.4 ρ=0.05 ρ=0.1 ρ=0.15 ρ=0.2 ρ=0.25 ρ=0.3 ρ=0.35 ρ=0.4

0.3

0.2 20

40

60

80

100

120

140

160

180

ρ=0.05 ρ=0.1 ρ=0.15 ρ=0.2 ρ=0.25 ρ=0.3 ρ=0.35 ρ=0.4

0.3

200

0.2 20

40

60

80

100

n

120

140

160

180

200

n

(a) T (x, y, f )

(b) T0 (x, y, f )

Figure 1: Power-sample size graph of Example 1 as nonzero), and “FN”, the number of false negative errors (i.e., the true entry of the precision matrix is nonzero but estimated as zero.) We follow the implementation in Fan et al. (2009) for the three penalized likelihood estimators. Results over 1000 replications of different methods are reported for each simulation example in Table 2 and Figure 2. Table 2: The average number of false positive and false negative errors over 1000 replicates for Example 4.2 as the sample size varies. FN

FP

n

LASSO

adaptive LASSO

SCAD

C-DCov

LASSO

adaptive LASSO

SCAD

C-DCov

150

0.002

0.012

0.014

1.318

308.995

160.574

88.823

61.174

200

0

0

0

0.370

296.681

161.251

97.255

54.986

250

0

0

0

0.084

269.503

157.174

88.022

52.440

300

0

0

0

0.026

285.880

151.823

83.388

50.454

Table 2 shows that as the sample size increases, both FN and FP of C-DCov decrease. Interestingly, the penalized likelihood methods do not exhibit such a monotonic trend for FP when the sample size increases. In addition, C-DCov outperforms all other methods in terms of FP across different sample sizes, and when sample size n gets large enough (in this situation, n = 300), C-DCov achieves similar FN as the other methods. Ideally, the choice of α = 0.05 would lead to the number of false positive being 0.05×[p2 −p−(p−1)−(p−1)] = 40.6, which is not far from the FP of C-DCov when n = 300. 13

Figure 2: The average recovered sparsity pattern of Example 4.2 In Figure 2, we show the average sparsity patterns (frequency of nonzero estimates over 1000 replicates) of the three penalized methods and C-DCov, together with the true sparsity pattern. We use the grayscale version of Matlab function “imagesc” to plot these sparsity pattern graphs. It is clear from the figure that the newly proposed C-DCov approach can recover the sparsity pattern very well and has the fewest false positives. Among the penalized likelihood estimators, it appears that the folded concave penalty SCAD leads to the sparsest solution on average. Example 4.3. [Discrete graphical model] In this example, we consider the “cheating students scenario” in the UGM package implemented in Schmidt (2010). Suppose we have ten students S1 , S2 , · · · , S10 taking an exam in a classroom. They sit next to each other in a row like the following graph, and can see their neighbors’ answers.

Their answers x1 , x2 , · · · , x10 are binary random variables and form an undirected graphical model of ten nodes (p = 10) and nine edges with discrete joint probability distribution 10 9 Y 1Y p(x1 , x2 , · · · , x10 ) = φi (xi ) φe (xe , xe+1 ), Z i=1 e=1 9 where {φi }10 i=1 and {φe }e=1 are potential functions for nodes and edges respectively, and Z

is the normalization constant. Suppose several students prepare for the exam better than others. For instance, when working independently, let us assume the 2nd, the 6th and the 10th students have 75%, 80% and 75% chance of getting the correct answer for each question, respectively; other students get the correct or the wrong answer with probability of 50%. Therefore, the 14

potential functions for nodes take form:   3 4 if x = 1 φ2 (x) = φ10 (x) = φ6 (x) = 1 1 if x = 0,

if x = 1

φi (x) = 1, ∀i = {1, 3, 4, 5, 7, 8, 9}.

if x = 0,

In addition, suppose students believe that their neighbors’ answers carry useful information and they have the incentive to learn from their neighbors. In our example, we assume that the potential functions for edges take form:   2 if xe = xe+1 φe (xe , xe+1 ) = ∀ e = 1, · · · , 9.  1 if xe 6= xe+1 , We use Matlab package UGM to generate answers of the ten students for 200 questions (n = 200) independently in each experiment. This results in 200 × 10 binary observations. We replicate the process for 1000 times. Table 3 reports the average FN and FP over the 1000 replications of C-DCov at α = 0.05, and the three penalized methods as described in Example 4.2 before. From the table, it is clear that C-DCov has much smaller FP than those of the penalized estimators as in Example 4.2 with FN close to zero. Furthermore, we plot the average sparsity patterns of the graphs resulted from different methods in Figure 3. From the figures, we can see that C-DCov closely mimics the true sparsity pattern while the penalized likelihood methods identify many false edges. Table 3: FP and FN for the discrete graphical model LASSO

adaptive LASSO

SCAD

C-DCov α = 0.05

FN

0.014

0.03

0.064

0.088

FP

41.132

25.664

17.5

12.122

Figure 3: Average sparsity pattern recovery of Example 4.3, where three students prepare better for the exam than their seven other classmates.

15

5

Real Data

5.1

Financial Data

In the first empirical example, we consider the Fama-French three-factor model (Fama and French, 1993). We collect daily excess returns of 90 stocks among the S&P 100 stocks, which are available between August 19, 2004 and August 19, 2005. We choose the starting date as Google’s Initial Public Offering date, and consider one year of daily excess returns since then. In particular, we consider the following Fama-French three-factor model rit − rf t = βi,MKT (MKTt − rf t ) + βi,SMB SMBt + βi,HML HMLt + uit , for i = 1, · · · , 90 and t = 1, · · · , 252. At time t, rit represents the return for stock i, rf t is the risk free return rate, and MKT, SMB and HML constitute market, size and value factors, which are observable at time t. We perform C-DCov test on all pairs of stocks and study the dependence between stocks conditional on the Fama-French three-factors. Under significance level α = 0.001, we found out that 15.01% of the pairs of stocks are conditionally dependent given the three factors, which implies that the three factors may not be sufficient to explain the dependencies among stocks. As a comparison, we also implemented the conditional independence test with the distance covariance based test replaced by Pearson correlation based test. It turns out the 9.54% of the pairs are significant under the same significance level. This shows the C-DCov test is more powerful in discovering significant conditionally dependent pairs than the Pearson correlation test. We then investigate the top 5 pairs of stocks that correspond to the largest test statistic values using the C-DCov test. They are (BHI, SLB), (CVX, XOM), (HAL, SLB), (COP, CVX), and (BHI, HAL). Interestingly, all six stocks involved are closely related to the oil industry. This reveals the strong dependence among oil industry stocks that cannot be well explained by the Fama-French three-factor model. The sector of industry plays a role in capturing additional risks. In addition, we examine the stock pairs that are conditionally dependent under the C-DCov test but not under the Pearson correlation test. The two most significant pairs are (C, USB) and (MRK, PFE). The first pair is in the financial industry (Citigroup and U.S. Bancorp) and the second pair belongs to the 16

pharmaceutical sector (Merck & Co. and Pfizer). This shows that by using the proposed C-DCov, some interesting conditional dependence structures could be recovered.

5.2

Breast Cancer Data

In this section, we explore the difference in genetic networks between breast cancer patients who achieve pathologic Complete Response (pCR) and patients who do not achieve pCR. Achieving pCR is defined as no invasive and no in situ residuals left in breast in the surgical specimen. As studied in Kuerer et al. (1999) and Kaufmann et al. (2006), pCR has predicted long-term outcome in several neoadjuvant studies and hence serves as a potential surrogate marker for survival. In this study, we use the normalized gene expression data of 130 patients with stages I-III breast cancers analyzed by Hess et al. (2006). Among the 130 patients, 33 of them achieved pCR (class 1), while the other 97 patients did not achieve pCR (class 2). To construct the conditional dependence network for each class, we first perform a two-sample t-test between the two groups and select the most significant 100 genes that have the smallest p-values. Afterwards, we construct networks of these 100 selected genes for each class using C-DCov, at significance level α = 0.01. Notice, in this case, p = 100 and the corresponding sample sizes in two groups are n1 = 33 and n2 = 97 respectively. Since we are in the scenario of p > n, as described in Section 2.4, we adopt ridge regression in estimation of the regression coefficients. For a network, the degree of a particular node describes how many edges are connected to this node, and the average degree serves as a measure of connectivity of the graphs. In Figure 4, we summarize the distributions of degrees for the genetic networks of class 1 and class 2 respectively. We see that the average degree of genetic network for class 1 is 17.525 which is much smaller than that for class 2, which is 47.172. To look at the networks more closely, we select 7 genes among the 100 genes and draw the corresponding networks in Figure 5. We see that for Class 1 where pCR is achieved, gene MCCC2 is a hub and is connected with four other genes. However, in the network for Class 2, gene MCCC2 is disconnected from the other six genes. On the other hand, gene MAPT is isolated in the network for Class 1, but is connected with two other genes in class 2. These findings imply that these two classes may have very different conditional depen-

17

Figure 4: Degree distribution of the genetic networks.

(a) Class 1

(b) Class 2

Figure 5: Genetic networks for the two classes based on 7 selected genes dence structures, and hence likely to have different precision matrices. As a result, when classification is the target, linear discriminant analysis may be too simple to capture the actual decision boundary.

6

Discussion

The proposed conditional independence test is based on the linearity assumption. A natural extension of the proposed method is to work under an additive model framework where the factors contribute to the response in a nonparametric way. The current theoretical results assume that number of factors are finite. How to extend the theory to the 18

case where factors are high-dimensional would be an interesting future work.

A

Proofs

We apply Taylor expansion to |ˆi,x − ˆj,x | at i,x − j,x and get c> . i,j,x ˆ x )(fi − fj ) = (Bx − B |i,x − j,x | + Di,j,x , |ci,j,x | c> . i,j,y ˆ y )(fi − fj ) = |ˆi,y − ˆj,y | = |i,y − j,y | + (By − B |i,y − j,y | + Di,j,y , |ci,j,y |

|ˆi,x − ˆj,x | = |i,x − j,x | +

(7)

where ci,j,x = λ1 (ˆi,x −ˆj,x )+(1−λ1 )(i,x −j,x ) and ci,j,y = λ2 (ˆi,y −ˆj,y )+(1−λ2 )(i,y −j,y ), for λ1 , λ2 ∈ [0, 1]. ˆ x and Lemma 3. Under Condition 2, we have the following bounds on the estimators B ˆ y. B √ ˆ x kF = Op (1/ n), kBx − B

√ ˆ y kF = Op (1/ n). kBy − B

(8)

Proof. The results can be carried out similarly as in Lemma B.2 in Fan et al. (2011).

Lemma 4. Under Condition 2, we have the following characterization on the norm of the factors fi . max |fi | = Op ((log n)1/r2 ). i

Proof. By Condition 2, we see that P ( max |fi | > C(log n)

1/r2

i∈{1,···n}

) ≤ nP (|fi | > C(log n)

1/r2

   r2  r2 C 1− bC 2 . log n = n ) ≤ n exp − b2

Therefore, for every  > 0, we can always find a pair (C, n0 ) such that P (max |fi | > C(log n)1/r2 ) ≤ , i

(9)

for any n ≥ n0 . For n < n0 , we could find corresponding Cn ’s to ensure P ( max |fi | > Cn (log n)1/r2 ) ≤  for each specific n < n0 . i∈{1,··· ,n}

To make (9) holds for all n, we take the maximum of C, C1 , · · · , Cn0 . Thus, maxi |fi | = Op ((log n)1/r2 ). 19

A.1

Proof of Theorem 1

From now on, we use S1 as abbreviation of S1 (ˆx , ˆy ), and similarly we define S2 and S3 . Using the Taylor expansion in (7), we have the following decomposition Vn2 (ˆx , ˆy ) − Vn2 (x , y ) = T1 + T2 + T3 , where n n n n n 1 X 1 X 1 X 2XX T1 = 2 Di,j,x |i,y − j,y |+ 2 Di,j,x 2 |k,y − l,y |− 3 Di,j,x |i,y − k,y |, n i,j=1 n i,j=1 n k,l=1 n i=1 j,k=1

(10) T2 =

n n n n n 1 X 1 X 2XX 1 X D | −  |+ D | − | − Di,j,y |i,x − k,x |, i,j,y i,x j,x i,j,y 2 k,x l,x n2 i,j=1 n2 i,j=1 n k,l=1 n3 j=1 k,i=1

(11) T3 =

n X

n X

1 1 1 Di,j,x Di,j,y + 2 Di,j,x 2 2 n i,j=1 n i,j=1 n

n X

Di,j,y −

i,j=1

n X

2 n3 i=1

n X

Di,j,x Di,k,y .

(12)

j,k=1

There are two facts that we easily see: ˆ x kF maxi |fi | by definition; (i) ∀ {i, j}, |Di,j,x | ≤ 2kBx − B (ii)

1 n2

Pn

|i,x − j,x | = Op (1), since E|i,x − j,x | is uniformly bounded over all (i, j) P pairs and so is E( n12 ni,j=1 |i,x − j,x |). i,j=1

Therefore, the three terms T1 , T2 and T3 can be bounded as follows. n 1 1 1 X ˆ T1 ≤ 8|Bx − Bx kF max |fi | 2 |i,y − j,y | = Op ((log n) r2 n− 2 ), i n i,j=1 n 1 1 1 X ˆ T2 ≤ 8kBy − By kF max |fi | 2 |i,x − j,x | = Op ((log n) r2 n− 2 ), i n i,j=1 1

ˆ x kF kBy − B ˆ y kF max |fi |2 = Op ((log n) r2 n−1 ), T3 ≤ 8kBx − B i

P

where we used Lemma 3 and Lemma 4. Hence we know Vn2 (ˆx , ˆy ) − Vn2 (x , y ) → 0. This combined with Lemma 1 leads to P

Vn2 (ˆx , ˆy ) → V 2 (x , y ).

20

Lemma 5. For the ci,j,x and ci,j,y defined in (7), we have the following approximation error bound on the normalized version. ci,j,x  −  4 i,x j,x ˆ − max |fi |, |ci,j,x | |i,x − j,x | ≤ |i,x − j,x | kBx − Bx kF i∈{1,··· ,n} ci,j,y i,y − j,y 4 ˆ max |fi |. |ci,j,y | − |i,y − j,y | ≤ |i,y − j,y | kBy − By kF i∈{1,··· ,n} Proof. Similar to Lemma 3, it suffices to show (13). First, we will show ˆi,x − ˆj,x ci,j,x  −   −  i,x j,x i,x j,x ≤ − − |ci,j,x | |i,x − j,x | |ˆi,x − ˆj,x | |i,x − j,x | .

(13) (14)

(15)

Denote by α1 and α2 the angle between ci,j,x and i,x − j,x , and the angle between ˆi,x − ˆj,x and i,x − j,x , respectively. It is easy to see that 0 ≤ α1 ≤ α2 ≤ π, and hence cos α1 ≥ cos α2 . By cosine formula, 2 2 ci,j,x ˆi,x − ˆj,x  −   −  i,x j,x i,x j,x |ci,j,x | − |i,x − j,x | = 2 − 2 cos α1 , and |ˆi,x − ˆj,x | − |i,x − j,x | = 2 − 2 cos α2 . Therefore, (15) is proved and it remains to show that ˆi,x − ˆj,x 4 i,x − j,x ˆ |ˆi,x − ˆj,x | − |i,x − j,x | ≤ |i,x − j,x | kBx − Bx kF

max |fi |.

(16)

i∈{1,··· ,n}

Left hand side of (16) can be rewritten as ˆi,x − ˆj,x i,x − j,x |ˆi,x − ˆj,x | − |i,x − j,x | [(ˆi,x − ˆj,x ) − (i,x − j,x )]|ˆi,x − ˆj,x | − (|ˆi,x − ˆj,x | − |i,x − j,x |)(ˆi,x − ˆj,x ) = |ˆi,x − ˆj,x ||i,x − j,x | 1 (|(ˆi,x − ˆj,x ) − (i,x − j,x )| + ||ˆi,x − ˆj,x | − |i,x − j,x ||) ≤ |i,x − j,x | 2 4 ˆ x kF |fi − fj | ≤ ˆ x kF max |fi |. ≤ kBx − B kBx − B i∈{1,··· ,n} |i,x − j,x | |i,x − j,x | Combining (15) and (16), the lemma is proved. Lemma 6. Under Conditions 1 and 2, and the null hypothesis that x ⊥ ⊥ y , for any γ > 0, " # n 1 1 X 1 P → 0, γ 2 n log n n i,j=1 |i,x − j,x |

" # n 1 1 X 1 P → 0. γ 2 n log n n i,j=1 |i,y − j,y |

21

Proof. We will only show the first result involving x with the other one follows similarly. For any δ > 0, let n n 1 X 1 1 X 1 ¯ Rn = 2 , Rn = 2 ∧ n2+δ . n i,j=1 |i,x − j,x | n i,j=1 |i − j |

Then for any  > 0, ¯ n | > ] ≤ n2 P[|i,x − j,x | < n−2−δ ] ≤ Cn2 n−2−δ = Cn−δ , P[|Rn − R

(17)

due to the Condition 2 that the density function of |i,x − j,x | is pointwise bounded. P ¯n| → Therefore, |Rn − R 0, which leads to ¯n P Rn R nγ log n − nγ log n → 0.

(18)

On the other hand, Z ∞ 1 1 1 1 1 1 2+δ 2+δ 2+δ E[ ∧n ] = P( > n )n + h| − | (t)dt log n |i,x − j,x | log n |i,x − j,x | log n n−2−δ t i,x j,x Z 1 Z ∞ 1 C 1 1 1 ≤ + h|i,x −j,x | (t)dt + h| − | (t)dt log n log n n−2−δ t log n 1 t i,x j,x Z 1 C C 1 1 ≤ + dt+ P(|i,x − j,x | > 1) log n log n n−2−δ t log n C C 1 ≤ + log(n2+δ )+ log n log n log n 1 C +C 0 + , ≤ log n log n (19) where h|i,x −j,x | is the density of |i,x − j,x |. In the above derivation, the first inequality can be easily seen from (17) and the second inequality utilizes Condition 2. ¯ n /log n is bounded in L1 and since nγ → ∞, R ¯ n /[nγ log(n)] converges to Therefore, R 0 in L1 and hence in probability, i.e., ¯n R P → 0. nγ log(n)

(20)

Rn P → 0. log(n)

(21)

This combined with (18) leads to nγ

22

Lemma 7. Under Conditions 1 and 2, and the null hypothesis that x ⊥ ⊥ y , for any γ > 0, " # n 1 1 X 1 P → 0. nγ (log n)2 n2 i,j=1 |i,x − j,x ||i,y − j,y | Proof. Similar to Lemma 6, for any δ > 0, let   n n  X 1 X 1 1 1 1 2+δ 2+δ Un = 2 , U¯n = 2 ∧n ∧n . n i,j=1 |i,x − j,x ||i,y − j,y | n i,j=1 |i,x − j,x | |i,y − j,y | Then we know for any  > 0, P[|Un −U¯n | > ] ≤ n2 P[|i,x −j,x | < n−2−δ ]+n2 P[|i,y −j,y | < n−2−δ ] ≤ 2Cn2 n−2−δ = 2Cn−δ , due to the Condition 2 that the density functions of |i,x −j,x | and |i,y −j,y | are pointwise bounded. As a result,

¯n P U U n − nγ (log n)2 nγ (log n)2 → 0.

(22)

On the other hand,   1 1 1 2+δ 2+δ E ∧n ∧n (log n)2 |i,x − j,x | |i,y − j,y |     1 1 1 1 2+δ 2+δ ∧n ∧n =E E ≤ C 00 , log n |i,x − j,x | log n |i,y − j,y | where the first equality is due to the independence between x and y and the second inequality can be derived similarly as in (19). Therefore, U¯n /(log n)2 is bounded in L1 P and for any γ > 0, U¯n /[nγ (log n)2 ] → 0. Together with (22), we proved

Un P → γ 2 n (log n)

0.

Lemma 8. Under Conditions 1 and 2, and the null hypothesis that x ⊥ ⊥ y , for any γ > 0, " # n n 1 XX 1 1 P → 0. nγ (log n)2 n3 i=1 j,k=1 |i,x − j,x ||i,y − k,y | Proof. Similar to Lemma 7, for any δ > 0, let   n n  X 1 X 1 1 1 1 3+δ 3+δ Vn = 3 , V¯n = 3 ∧n ∧n . n i,j,k=1|i,x − j,x ||i,y − k,y | n i,j,k=1 |i,x − j,x | |i,y − k,y | 23

Then we know for any  > 0, P[|Vn − V¯n | > ] ≤ n3 P[|i,x −j,x | < n−3−δ ]+n3 P[|i,y −j,y | < n−3−δ ] ≤ 2Cn3 n−3−δ = 2Cn−δ , due to the Condition 2 that the density functions of |i,x −j,x | and |i,y −k,y | are bounded pointwisely. Thus

¯n P V V n − nγ (log n)2 nγ (log n)2 → 0.

(23)

On the other hand,   1 1 1 3+δ 3+δ E ∧n ∧n (log n)2 |i,x − j,x | |i,y − k,y |     1 1 1 1 3+δ 3+δ =E ∧n E ∧n ≤ C 00 , log n |i,x − j,x | log n |i,y − k,y | due to the independence between x and y , and similar reasoning as in (19). Therefore, P V¯n /(log n)2 is bounded in L1 and for any γ > 0, V¯n /[nγ (log n)2 ] → 0. Together with (23),

we proved Vn P → γ 2 n (log n)

A.2

0.

Proof of Theorem 2

Recall the notations we used in the proof of Theorem 1, Vn2 (ˆx , ˆy ) − Vn2 (x , y ) = T1 + T2 + T3 . By Propositions 1 and 2 (to be introduced next), we have for any γ > 0, 1

1

n(T1 + T2 + T3 ) = Op (n− 2 ) + op ((log n)3/r2 +1 n− 2 +γ ). Combined with Lemma 2, the theorem is proved.

Proposition 1. Under Conditions 1 and 2, and the null hypothesis that x ⊥ ⊥ y , 3

T1 = Op (n− 2 )

and

24

3

T2 = Op (n− 2 )

Proof. From (10), we rewrite T1 as ! n n n n 1 X 1 X 1X 1X T1 = 2 |i,y − j,y | Di,j,x + 2 Dk,l,x − Di,k,x − Dj,k,x n i,j=1 n k,l=1 n k=1 n k=1 # " n n n n X X X X 1 1 1 ˆ x) 1 |i,y − j,y |(Ai,j,x + 2 Ak,l,x − Ai,k,x − Aj,k,x ) = Tr (Bx − B 2 n i,j=1 n k,l=1 n k=1 n k=1 # " n h i 1 X . . ˆ x )M , ˆ |i,y − j,y |T i,j,x = Tr (Bx − B = Tr (Bx − Bx ) 2 n i,j=1 where A

i,j,x

c> i,j,x = (fi − fj ) , |ci,j,x |

and T i,j,x , M are self-defined by the above equation. Let us consider any single element M (t, s), the element at t-th row and s-th column in matrix M . It is easy to see that Ai,j,x are identically distributed with respect to different pairs of (i, j), and E|Ai,j,x (t, s)|2 ≤ E|fi − fj |2 ≤ C for some constant C by Condition 1. Therefore, with E(|i,y − j,y ||k,y − l,y |) = c1 , for any i 6= j 6= k 6= l, E(|i,y − j,y ||i,y − k,y |) = c2 for any i 6= j 6= k and E(|i,y − j,y |2 ) = c3 for any i 6= j, we know !

n X

var(nM (t, s)) = var n−1

|i,y − j,y |T i,j,x (t, s)

i,j=1

≤E

X n

1 n2 "

|i,y − j,y |T

i,j,x

2 ! (t, s)

i,j=1

1 =E E 2 n

X n

#! 2 |i,y − j,y |T i,j,x (t, s) |{T i,j,x (t, s)}i,j=1,··· ,n

i,j=1

c1 X c2 X i,j,x i,j,x k,l,x T (t, s)T (t, s) + T (t, s)T i,l,x (t, s) n2 i6=j6=k6=l n2 i6=j6=l ! c3 X i,j,x (T (t, s))2 + 2 n i6=j

=E

Since Ai,j,x (t, s) is uniformly bounded in L2 as stated above, we know T i,j,x (t, s) is uniformly bounded in L2 as well and E|T i,j,x (t, s)| = O(1), E(

1 X i,j,x (T (t, s))2 ) = O(1). 2 n i6=j

25

Also, we observe that X

T

i,j,x

(t, s)T

i,l,x

Pn

j=1

T i,j,x (t, s) = 0 by definition and Di,j,x = Dj,i,x , so we have

n X n n X n n X n X X X i,j,x 2 i,j,x 2 (t, s) = ( T (t, s)) − (T (t, s)) = − (T i,j,x (t, s))2 . i=1 j=1

i6=j6=l

Therefore, E( n12

i=1 j=1

i=1 j=1

T i,j,x (t, s)T i,l,x (t, s)) = O(1). Finally, since

P

i6=j6=l

Pn Pn i=1

j=1

T i,j,x (t, s) =

0, X

T i,j,x (t, s)T k,l,x (t, s)

i6=j6=k6=l n X n X

=(

T

i,j,x

2

(t, s)) −

i=1 j=1

=−

X

X

T

i,j,x

(t, s)T

i,l,x

(t, s) −

i6=j6=k

T i,j,x (t, s)T i,l,x (t, s) −

i6=j6=k

X

(T i,j,x (t, s))2 −

X

(T

i,j,x

2

(t, s)) −

n X

(T i,i,x (t, s))2

i=1

i6=j n X

(T i,i,x (t, s))2 .

i=1

i6=j

This combined with our previous calculations leads to E( n12

P

i6=j6=k6=l

T i,j,x (t, s)T k,l,x (t, s)) =

O(1). As a result, we have E(nM (t, s))2 = O(1). Together with Chebychev inequality, we know nM (t, s) = Op (1) and equivalently, M (t, s) = Op (n−1 ) for any entry M (t, s) in M . 3

3

Hence, T1 ≤ pKOp (n− 2 ) from Lemma 3. Similarly, we could show that T2 = Op (n− 2 ). Proposition 2. Under Conditions 1 and 2, and the null hypothesis that x ⊥ ⊥ y , 3

T3 = op ((log n)3/r2 +1 n− 2 +γ ). Proof. Recall that n n n n n 1 X 1 X 1 X 2 XX T3 = 2 Di,j,x Di,j,y + 2 Di,j,x 2 Di,j,y − 3 Di,j,x Di,k,y . n i,j=1 n i,j=1 n i,j=1 n i=1 j,k=1

Here, we write Di,j,x in another form as a sum of two terms and bound them separately.   (i,x − j,x )> c  −  i,j,x i,x j,x ˆ x )(fi − fj ) + ˆ x )(fi − fj ) Di,j,x = (Bx − B − (Bx − B |i,x − j,x | |ci,j,x | |i,x − j,x | . (i,x − j,x )> ˆ x )(fi − fj ) + di,j,x . = (Bx − B |i,x − j,x | (24) By Lemma 5, we know ˆ x k2 |di,j,x | ≤ ( max |fi |)2 kBx − B F i∈{1,··· ,n}

26

4 . |i,x − j,x |

(25)

We approach the three terms in T3 separately and by showing the following three statements, we will prove this proposition. n 3 1 X . Di,j,x Di,j,y = op ((log n)3/r2 +1 n− 2 +γ ) = op (an ), 2 n i,j=1

(26)

n n 1 X 1 X . Di,j,x 2 Di,j,y = op ((log n)4/r2 +2 n−2+2γ ) = op (bn ), 2 n i,j=1 n i,j=1

(27)

n n 1 XX Di,j,x Di,k,y = op (an ). n3 i=1 j,k=1

(28)

Step 1: We will show (26). By (24), we have: n 1 X Di,j,x Di,j,y n2 i,j=1   n  > 1 X ( −  ) (i,x − j,x )> i,y j,y ˆ x )(fi − fj ) di,j,y + ˆ y )(fi − fj ) = 2 (Bx − B (By − B di,j,x + n i,j=1 |i,x − j,x | |i,y − j,y | n n 1 X 1 X (i,y − j,y )> ˆ y )(fi − fj ) = 2 di,j,x di,j,y + 2 di,j,x (By − B n i,j=1 n i,j=1 |i,y − j,y | n 1 X (i,x − j,x )> ˆ x )(fi − fj ) + 2 di,j,y (Bx − B n i,j=1 |i,x − j,x | n > 1 X (i,x − j,x )> ˆ x )(fi − fj ) (i,y − j,y ) (By − B ˆ y )(fi − fj ). + 2 (Bx − B n i,j=1 |i,x − j,x | |i,y − j,y |

(29)

We look at them term by term. n n 1 X 1 X di,j,x di,j,y ≤ 2 |di,j,x ||di,j,y | n2 i,j=1 n i,j=1 n X 1 2 2 4 ˆ ˆ ≤ (max |fi |) kBx − Bx kF kBy − By kF 2 i n i,j=1 |i,x − j,x ||i,y − j,y | 4

≤ Op ((log n)4/r2 +2 n−2+γ )op (1), where the second inequality is due to (25) and the last inequality is by Lemma 3, Lemma 4 and Lemma 7. n 1 X (i,y − j,y )> ˆ y )(fi − fj ) d (By − B i,j,x n2 i,j=1 |i,y − j,y |

27

n 4 X 1 2 ˆ ˆ ≤(max |fi |) kBx − Bx kF kBy − By kF 2 i n i,j=1 |i,x − j,x | 3

3

≤Op ((log n)3/r2 +1 n− 2 +γ )op (1) = op (an ), where the inequalities are due to Lemma 3, Lemma 4 and Lemma 6. It is easy to see that the third term in (29) has the rate op (an ). So till now, we know by (29), n 1 X Di,j,x Di,j,y n2 i,j=1 n > 1 X (i,x − j,x )> ˆ x )(fi − fj ) (i,y − j,y ) (By − B ˆ y )(fi − fj ) (Bx − B =op (an ) + 2 n i,j=1 |i,x − j,x | |i,y − j,y |

. =op (an ) + T31 . Note that the rate of first term in (29) is omitted because it is a small order of an . Now we only need to consider T31 . By writing this scalar in form of trace, we rearrange the ˆ x as a common term. terms and make Bx − B " # n > X ( −  ) 1 ( −  ) i,x j,x i,y j,y ˆ x )> ˆ y )(fi − fj )(fi − fj )> T31 = Tr (Bx − B (By − B n2 i,j=1 |i,x − j,x | |i,y − j,y | n X . > 1 ˆ ˆ y )J i,j ] = Tr[(Bx − Bx ) 2 H i,j (By − B n i,j=1 n X . > 1 ˆ = Tr[(Bx − Bx ) 2 H i,j LJ i,j ] n i,j=1

. ˆ x )> M1 ], = Tr[(Bx − B (30) where matrices H i,j , J i,j , L and M1 are self-defined. We will look at the order of any single element in M1 , for instance, consider the element M1 (t, s) at the t-th row and s-th column of M1 . Since i,x and j,x are mutually independent of y and f with any observation indices, and the fact that E[(i,x − j,x )/|i,x − j,x |] = 0, we have EM1 = 0, and it follows that E(M1 (t, s)) = E(nM1 (t, s)) = 0. On the other hand, notice p K p K n X X 1 X i,j 1 X XX i,j i,j H (t, q)L(q, r)J (r, s) = L(q, r)( 2 H (t, q)J i,j (r, s)). M1 (t, s) = 2 n i,j=1 q=1 r=1 n i,j=1 l=1 r=1

(31) 28

Algebraic expansion gives us: n n 1 X i,j 1 X i,j 2 E( 2 H (t, q)J (r, s)) = E 4 H i,j (t, q)J i,j (r, s)H k,l (t, q)J k,l (r, s). n i,j=1 n i,j,k,l=1

(32)

We split the index sets (i, j) and (k, l) into three groups G1, G2 and G3 with • G1 = {i, j, k, l : (i, j) = (k, l)}, and card(G1) = O(n2 ), • G2 = (G1 ∪ G3)c , and card(G2) = O(n3 ), • G3 = {i, j, k, l : (i, j) ∩ (k, l) = ∅}, and card(G3) = O(n4 ). Since i,x − j,x is independent of k,x − l,x in G3, and they are both independent of y and f with any sample index, we know the expectation of terms with index {i, j, k, l} ∈ G3 is 0. Also, we observe that each single term in (32) is uniformly bounded in expectation. P With all these elements, we know E( n12 ni,j=1 H i,j (t, q)J i,j (r, s))2 = O(n−1 ), which implies Pn 1 − 12 i,j i,j ). Combined with Lemma 3 and (31), we showed 2 i,j=1 H (t, q)J (r, s) = Op (n n that M1 (t, s) = KpOP (n−1 ). In addition, we bound T31 in (30) by 3

ˆ x )> M1 ] ≤ K 2 p2 Op (n− 2 ). T31 = Tr[(Bx − B It follows that

1 n2

Pn

i,j=1

Di,j,x Di,j,y = op (an ), and thus (26) is proved.

Step 2: We will show (27). By expression (24), we have n n n n 1 X 1 X 1 X (i,x − j,x )> . 1 X ˆ Di,j,x = 2 di,j,x + 2 (Bx − Bx )(fi − fj ) = 2 di,j,x + T32 . n2 i,j=1 n i,j=1 n i,j=1 |i,x − j,x | n i,j=1

(33) Looking at the first term in (33), we know n n X 2 1 X 1 +1 −1+γ 2 2 2 r2 ˆ d ≤ (max |f |) kB − B k = o ((log n) n ), i,j,x i x x p F 2 2 i n i,j=1 n i,j=1 |i,x − j,x |

by Lemma 3, Lemma 4 and Lemma 6. For the second term in (33), we again write T32 in form of trace: T32

n 1 X (i,x − j,x )> . ˆ ˆ x )M2 ], = Tr[(Bx − Bx ) 2 (fi − fj ) ] = Tr[(Bx − B n i,j=1 |i,x − j,x |

29

where M2 is self-defined in the above equation. Similar to previous arguments, for any √ entry M2 (t, s) in M2 , since E(M2 ) = 0, we know E( nM2 (t, s)) = E(M2 (t, s)) = 0. Define (M2 M2> )G =

1 n4

X (i,j,k,l)∈G

(fi − fj )

(i,x − j,x )> (k,x − l,x ) (fk − fl )> . |i,x − j,x | |k,x − l,x |

√ To bound var( nM2 (t, s)), the following inequalities follow: var(M2 (t, s)) = E(M22 (t, s)) ≤ E Tr(M2 M2> ) = E[Tr((M2 M2> )G1 ) + Tr((M2 M2> )G2 ) + Tr((M2 M2> )G3 )] = E[Tr((M2 M2> )G1 ) + Tr((M2 M2> )G2 )] ≤ KE[k(M2 M2> )G1 k + k(M2 M2> )G2 )k]     Kn3 Kn2 + Op = O(n−1 ). =O 4 4 n n In the above derivation, the third equality follows similarly as in Step 1. The last equality is a direct application of Condition 1: Ek(fi − fj )

(i − j )> i − j (fi − fj )> k = E|fi − fj |2 = O(1). |i − j | |i − j |

As a result, M2 (t, s) = Op (n−1/2 ) for any (t, s) and T32 = Op (n−1 ) by its expression in P P trace form. In (33), we know n12 ni,j=1 Di,j,x = op ((log n)2/r2 +1 n−1+γ ) and n12 ni,j=1 Di,j,y = op ((log n)2/r2 +1 n−1+γ ). Step 2 is completed by n n 1 X 1 X D Di,j,y = op ((log n)4/r2 +2 n−2+2γ ) = op (bn ). i,j,x 2 n2 i,j=1 n i,j=1

Step 3 We will show (28). n n 1 XX Di,j,x Di,k,y n3 i=1 j,k=1   n n  1 XX (i,x − j,x )> (i,y − k,y )> ˆ ˆ di,j,x + = 3 (Bx − Bx )(fi −fj ) di,k,y + (By − By )(fi −fk ) n i=1 j,k=1 |i,x − j,x | |i,y − k,y | n n n n 1 XX (i,y − k,y )> 1 XX ˆ y )(fi − fk ) di,j,x di,k,y + 3 di,j,x = 3 (By − B n i=1 j,k=1 n i=1 j,k=1 |i,y − k,y | n n 1 XX (i,x − j,x )> ˆ x )(fi − fj ) + 3 di,k,y (Bx − B n i=1 j,k=1 |i,x − j,x |

30

# n > X  −  ( −  ) 1 i,x j,x i,y k,y ˆ y )(fi − fk )(fi − fj )> . ˆ x )> (By − B + Tr (Bx − B n3 i,j,k=1 |i,x − j,x | |i,y − k,y | "

(34)

Let n 1 X i,x − j,x (i,y − k,y )> ˆ y )(fi − fk )(fi − fj )> , M3 = 3 (By − B n i,j,k=1 |i,x − j,x | |i,y − k,y | h i > ˆ T33 = Tr (Bx − Bx ) M3 .

Let us look at (34) term by term. n n n X n X 1 XX 1 4 2 2 1 ˆ ˆ di,j,x di,k,y ≤ ( max |fi |) kBx − Bx kF kBy − By kF 3 3 i=1,···n n i=1 j,k=1 n i=1 j,k=1 |i,x − j,x ||i,y − k,y | 4

= op ((log n) r2

+2 −2+γ

n

),

due to Lemma 3, Lemma 4 and Lemma 8. The second term has order n n (i,y − k,y )> 1 XX ˆ y )(fi − fk ) = op (an ), d (By − B i,j,x n3 i=1 j,k=1 |i,y − k,y |

by the same reasoning as in Step 1. The third term has the same order op (an ). Now let us investigate the last term T33 . For any entry M3 (t, s), E(nM3 (t, s)) = E(M3 (t, s)) = 0 by similar argument as before and we only need to bound var(nM3 (t, s)). This time, when cross-terms appear, let the indices be (i1 , j1 , k1 ) and (i2 , j2 , k2 ). We redefine our index groups as follows f = {i1 , j1 , k1 , i2 , j2 , k2 : (i1 , j1 ) = (i2 , j2 )}, and card(G1) f = O(n4 ), • G1 f = (G1 f ∪ G3) f c , and card(G2) f = O(n5 ), • G2 f = {i1 , j1 , k1 , i2 , j2 , k2 : (i1 , j1 ) ∩ (i2 , j2 ) = ∅}, and card(G3) f = O(n6 ). • G3 f is 0, we Following the same arguments as Step 1, as expectation over sums in group G3 have M3 (t, s) = Op (n−1 ) for any entry (t, s) in M3 . This combined with (34) leads to P P 3 T33 = Op (n− 2 ) and n13 ni=1 nj,k=1 Di,j,x Di,k,y = op (an ). Thus Step 3 is completed. Combining Steps 1, 2 and 3, this proposition is proved.

31

A.3

Proof of Corrolary 2

It follows from the proofs of Theorems 1 and 2 and Slutsky’s Lemma.

A.4

Proof of Theorem 3

From Corrolary 2, the limiting distribution of nVn2 (ˆx , ˆy )/S2 (ˆx , ˆy ) is the same as that of nVn2 (x , y )/S2 (x , y ). Combining this with Theorem 6 in Sz´ekely et al. (2007), the results follow.

References Anderson, T. W. (1958), An introduction to multivariate statistical analysis, Wiley. Blomqvist, N. (1950), “On a measure of dependence between two random variables,” The Annals of Mathematical Statistics, 593–600. Blum, J., Kiefer, J., and Rosenblatt, M. (1961), “Distribution free tests of independence based on the sample distribution function,” The Annals of Mathematical Statistics, 485–498. Cai, T., Liu, W., and Luo, X. (2011), “A constrained L1 minimization approach to sparse precision matrix estimation,” Journal of the American Statistical Association, 106, 594– 607. Cai, T. T., Li, H., Liu, W., and Xie, J. (2013), “Covariate-adjusted precision matrix estimation with an application in genetical genomics,” Biometrika, 100, 139–156. Dempster, A. P. (1972), “Covariance selection,” Biometrics, 157–175. Drton, M. and Perlman, M. D. (2004), “Model selection for Gaussian concentration graphs,” Biometrika, 91, 591–602. Edwards, D. (2000), Introduction to graphical modelling, Springer. Fama, E. F. and French, K. R. (1993), “Common risk factors in the returns on stocks and bonds,” Journal of Financial Economics, 33, 3–56. 32

Fan, J., Feng, Y., and Wu, Y. (2009), “Network exploration via the adaptive LASSO and SCAD penalties,” The Annals of Applied Statistics, 3, 521. Fan, J. and Li, R. (2001), “Variable selection via nonconcave penalized likelihood and its oracle properties,” Journal of the American Statistical Association, 96, 1348–1360. Fan, J., Liao, Y., and Mincheva, M. (2011), “High dimensional covariance matrix estimation in approximate factor models,” Annals of Statistics, 39, 3320. Friedman, J., Hastie, T., and Tibshirani, R. (2008), “Sparse inverse covariance estimation with the graphical lasso,” Biostatistics, 9, 432–441. Hess, K. R., Anderson, K., Symmans, W. F., Valero, V., Ibrahim, N., Mejia, J. A., Booser, D., Theriault, R. L., Buzdar, A. U., Dempsey, P. J., et al. (2006), “Pharmacogenomic predictor of sensitivity to preoperative chemotherapy with paclitaxel and fluorouracil, doxorubicin, and cyclophosphamide in breast cancer,” Journal of Clinical Oncology, 24, 4236–4244. Hoeffding, W. (1948), “A non-parametric test of independence,” The Annals of Mathematical Statistics, 546–557. Hollander, M., Wolfe, D. A., and Chicken, E. (2013), Nonparametric statistical methods, vol. 751, John Wiley & Sons. Kaufmann, M., Hortobagyi, G. N., Goldhirsch, A., Scholl, S., Makris, A., Valagussa, P., Blohmer, J.-U., Eiermann, W., Jackesz, R., Jonat, W., et al. (2006), “Recommendations from an international expert panel on the use of neoadjuvant (primary) systemic treatment of operable breast cancer: an update,” Journal of Clinical Oncology, 24, 1940–1949. Kuerer, H. M., Newman, L. A., Smith, T. L., Ames, F. C., Hunt, K. K., Dhingra, K., Theriault, R. L., Singh, G., Binkley, S. M., Sneige, N., et al. (1999), “Clinical course of breast cancer patients with complete pathologic primary tumor and axillary lymph node response to doxorubicin-based neoadjuvant chemotherapy,” Journal of Clinical Oncology, 17, 460–460. 33

Lauritzen, S. L. (1996), Graphical models, Oxford University Press. Linton, O. and Gozalo, P. (1996), “Conditional independence restrictions: Testing and estimation,” COWLES FOUNDATION DISCUSSION PAPER. Liu, H., Lafferty, J., and Wasserman, L. (2009), “The nonparanormal: Semiparametric estimation of high dimensional undirected graphs,” The Journal of Machine Learning Research, 10, 2295–2328. Meinshausen, N. and B¨ uhlmann, P. (2006), “High-dimensional graphs and variable selection with the lasso,” The Annals of Statistics, 1436–1462. Puri, M. and Sen, P. (1971), “Nonparametric Methods in Multivariate Analysis,” . Schmidt, M. (2010), “UGM: A Matlab toolbox for probabilistic undirected graphical models,” . Su, L. and White, H. (2007), “A consistent characteristic function-based test for conditional independence,” Journal of Econometrics, 141, 807–834. — (2008), “A nonparametric Hellinger metric test for conditional independence,” Econometric Theory, 24, 829–864. — (2014), “Testing conditional independence via empirical likelihood,” Journal of Econometrics. Sz´ekely, G. J., Rizzo, M. L., Bakirov, N. K., et al. (2007), “Measuring and testing dependence by correlation of distances,” The Annals of Statistics, 35, 2769–2794. Wainwright, M. J. and Jordan, M. I. (2008), “Graphical models, exponential families, and R in Machine Learning, 1, 1–305. variational inference,” Foundations and Trends

Wilks, S. (1935), “On the independence of k sets of normally distributed statistical variables,” Econometrica, Journal of the Econometric Society, 309–326. Xue, L., Zou, H., et al. (2012), “Regularized rank-based estimation of high-dimensional nonparanormal graphical models,” The Annals of Statistics, 40, 2541–2571. 34

Yin, J. and Li, H. (2011), “A sparse conditional gaussian graphical model for analysis of genetical genomics data,” The Annals of Applied Statistics, 5, 26–30.

35

Suggest Documents