PENALIZED EUCLIDEAN DISTANCE REGRESSION

PENALIZED EUCLIDEAN DISTANCE REGRESSION

arXiv:1405.4578v1 [math.ST] 19 May 2014

D. VASILIU, T. DEY, AND I.L. DRYDEN

A BSTRACT. A new method is proposed for variable screening, variable selection and prediction in linear regression problems where the number of predictors can be much larger than the number of observations. The method involves minimizing a penalized Euclidean distance, where the penalty is the geometric mean of the `1 and `2 norms of the regression coefficients. This particular formulation exhibits a grouping effect, which is useful for screening out predictors in higher or ultra-high dimensional problems. Also, an important result is a signal recovery theorem, which does not require an estimate of the noise standard deviation. Practical performances of variable selection and prediction are evaluated through simulation studies and the analysis of a dataset of mass spectrometry scans from melanoma patients, where excellent predictive performance is obtained.

Keywords. Euclidean distance; Grouping; Penalization; Prediction; Regularization; Sparsity; Ultra-high dimension; Variable screening. 1. I NTRODUCTION High dimensional regression problems are of great interest in a wide range of applications, for example in analysing microarrays (Hastie et al., 2008; Fan et al., 2009), functional magnetic resonance images (Caballero Gaudes et al., 2013) and mass spectrometry data (Tibshirani et al., 2005). We consider the problem of predicting a single response Y from a set of p predictors X1 , . . . , Xp , where p can be much larger than the number of observations n of each variable. If p > n, commonly used methods include regularization by adding a penalty to the least squares objective function or variable selection of the most important predictors. A wide range of methods is available for achieving one or both of the essential goals in linear regression: accomplishing predictive accuracy and identifying pertinent predictive variables. Current applications motivate the need for screening predictors in the challenging situation of higher and ultra-high dimensional problems, where fast and efficient algorithms are implemented for screening in order to reduce the number of predictors for further implementation of more moderate dimensional methods. There is a very large literature on high-dimensional regression methods, for example introductions to the area are given by Hastie et al. (2008) and James et al. (2013). Methods for high-dimensional regression include ridge regression (Hoerl and Kennard, 1970a,b), LASSO (Tibshirani, 1996), Elastic Net (Zou and Hastie, 2005), Dantzig selector (Candes and Tao, 2007), Sure Independence Screening (Fan and Lv, 2008), and Square Root LASSO (Belloni et al, 2011), among many others. The Square Root LASSO involves minimizing the square root of the least squares objective function (a Euclidean distance) plus an `1 penalty, and the authors have given a rationale for choosing the regularization parameter using a property called pivotal recovery, without requiring an estimate of the noise 1

2


standard deviation. We also use a Euclidean distance objective function in our method plus a new norm based on the geometric mean of the `1 and `2 norms of the regression paramaters. The advantage of our approach is that we are also able to provide the pivotal recovery property, but in addition gain the grouping property of the elastic net. The resulting penalized Euclidean distance method is shown to work well in a variety of settings, and a strong feature of the method is a low false positive rate (i.e. those non-zero coefficients that are detected are justified). Low false positive rates are often desirable in many applications, e.g. in genomics, where expensive follow-up studies on detected genes should be carried out only when necessary. The paper is organized as follows. Section 2 introduces the penalized Euclidean distance objective function, and the idea of grouping of irrelevant predictors. Several theoretical results are presented in Section 3 to show how the proposed method works. Section 4 introduces the practical algorithm to implement our method for computational implementation. Section 5.1 illustrates the methodology along with simulation studies and in Section 5.2 we apply the methodology to an analysis of mass spectrometry data in the study of melanoma. The paper concludes with a brief discussion in Section 6. All proofs of the results are placed in the Appendix.

2. P ENALIZED E UCLIDEAN DISTANCE We assume that the data are organized as an n × p design matrix X, and a n dimensional response vector Y , where n is the number of observations and p is the number of variables. The columns of the matrix X, are colj (X) := (x1,j , x2,j ..., xn,j )T , j = 1, ..., p and the regression parameters are β = (β1 , . . . , βp )T . The usual linear model is Y = Xβ + σ where E[] = 0 and var() = In , and In is the n × n identity matrix. We shall assume that the expectation of the response Y = (y1 , y2 , ..., yn )T depends only on a few variables, and so (1)

˜ ˜ β, Xβ = X

˜ are a subset of the set of columns of the entire design matrix X, so X ˜ is associated where the columns of the matrix X ˜ We call with a subset of indices J˜ ⊂ {1, 2, . . . , p} and β˜ is a vector whose dimension is equal to the cardinality of J. ˜ the “drivers" of the model. We make the observation that in general, J˜ may not be unique since the columns in X an underdetermined system could have solutions with different sparsity patterns, even if the degree of the optimal sparsity (model size) is the same. The cardinality of J˜ is assumed to be less than the number of observations and ˜ are linearly independent. When p is much greater than |J| ˜ a huge challenge is it is desirable that the columns of X to detect the set of irrelevant columns, i.e. the variables that are not needed for efficiently controlling the response Y . With minimal assumptions about the given data, our goal is to develop a mathematical criterion for detecting irrelevancy inside the parameter space.


Xi

Xi

2

3

1

O Xj

1

F IGURE 1. Overcrowding on the unit hypersphere, relative to a tessellation. The larger group of similar points on the sphere (in the brown region) will have similar estimated regression coefficients in the model.

By applying a location transformation, both the design matrix X and the response vector Y can be centred, and we also scale the predictors so that (2)

n X i=1

yi = 0,

n X

xi,j = 0,

i=1

n X

x2i,j = 1, j = 1, ..., p.

i=1

Note that each predictor can be regarded as a point on the unit hypersphere S n−1 with a centring constraint. Alternatively we can use an isometry and regard the predictors as points on S n−2 without the centring constraint. This fact can be seen by pre-multiplying colj (X) by the (n − 1) × n Helmert sub-matrix H (Dryden and Mardia , 1998, p.34) which has jth row given by (hj , ..., hj , −jhj , 0, ..., 0), with hj = −{j(j + 1)}−1/2 repeated j times, followed by −jhj and then n − j − 1 zeros, j = 1, ..., n − 1. After pre-multiplication we have the n − 1 vector x∗j = Hcolj (X) which has kx∗j k = 1 because HH T = In−1 , and hence x∗j ∈ S n−2 . Also, if we let y ∗ = H(y1 , . . . , yn )T = HY then the angles between y ∗ and x∗j are the same as those between Y and colj (X), j = 1, . . . , p. Angles between predictors are also preserved and so we have an isometry. To return back to the centred vectors we can pre-multiply by H T , since H T H = In − n1 1n 1Tn is the n×n centring matrix. The observation that the points are on a hypersphere (either S n−1 with a centring constraint or S n−2 with no constraint) is useful in the sequel. ˜ is very small, but nonetheless positive ˜ βk We assume that the global minimum of kY − X ˜ ≥ c > 0, ˜ βk min kY − X

˜ ˜ |J| β∈R

˜ |J| ˜ and X, ˜ β, ˜ were as obtained in the presence of noise and when the number of observations is larger than |J|, defined after (1). Given this situation and since we define the objective function through the sparsity of the solution and the minimization of the norm of the residual, we exclude the possibility of exact solutions.

4


For any vector β ∈ Rp , we denote by θj the angle between colj (X) and Y − Xβ. Thus for a vector β˜ that achieves ˜ we must have θj = π for j ∈ J˜ and we can define the set of solutions as ˜ βk, the global minimum of kY − X 2

S = {β ∈ Rp | θj =

π ∀j ∈ J˜ and βj = 0 ∀j ∈ J˜c } 2

where J˜c denotes the complement of J˜ in the set of all indexes {1, 2, . . . , p}. We also assume that S is bounded away from 0Rp (i.e. β = 0Rp cannot be a solution for minimizing kY − Xβk). Given these minimal and common sense assumptions, our goal is to build an estimator of the index set S. Also, thinking of covariates as vectors on the unit hypersphere S n−1 (or S n−2 if using an isometry), we would like to build an objective function that facilitates automatic detection of “overcrowding" situations as illustrated in Figure 1. A group of close observations on the hypersphere, where the great circle distances are small within the group, will correspond to highly correlated predictors which in turn will have similar estimated regression parameters. In order to achieve these goals, we propose the following objective function p LP ED (λ, β) = kY − Xβk + λ kβk · |β|1

(3)

where λ is scalar regularization parameter, β = (β1 , β2 ..., βp ) is a vector in Rp , kβk2 =

p P i=1

βj2 and |β|1 =

p P

|βj |.

i=1

The penalized Euclidean distance (PED) estimator βˆ is defined as the minimizer of the objective function (3), i.e. βˆ = (βˆ1 , βˆ2 , . . . , βˆp ) and ˆ β(λ) = arg minp {LP ED (λ, β)}.

(4)

β∈R

An element of novelty is that the penalization is defined as the geometric mean of the `1 and `2 norms and has only one control parameter, λ. A well-established method that combines `1 and `2 penalties in a linear manner is the Elastic Net, which is based on the naïve Elastic Net criterion Lnen (λ1 , λ2 , β) = kY − Xβk2 + λ2 kβk2 + λ1 |β|1 where βên =

√

1 + λ2 βˆnen and βˆnen = arg min{Lnen (λ1 , λ2 , β)}. The LASSO is a special case with λ1 > β

0, λ2 = 0 and ridge regression has λ1 = 0, λ2 > 0, and so the Elastic Net combines the two methods, as does our PED method but in a radically different way. A visual comparison between the isoclines produced by the different penalties is seen in Figure 2, and in particular note that the PED function is linear along the axes. The PED penalty is identical to the LASSO penalty for a single non-zero βi , and so for very sparse models behaviour like the Square Root LASSO is envisaged. We shall show that there is a grouping effect for correlated variables, which is a property shared by the Elastic Net.


F IGURE 2. On the left we have a 3D representation for the norm the Elastic Net on the right.

5

p k · k · | · |1 and the penalty from

3. T HEORETICAL R ESULTS 3.1. Geometric mean norm and grouping. Our goal is to build a well-behaved objective function that will carry out variable selection without imposing restrictive conditions on the data. The concept is based on the simple fact that the sum of the squares of the relative sizes of vector components is always equal to 1. For any vector in Rp , if there are components that have relative size larger than

√1 p

then the other components must have relative size falling

under this value. In addition if many components have similar relative size due to a grouping effect, then the relative size of those components must be small. We propose a new penalty which is actually a non-standard norm. In Figure 3, we have contour plots in 2-D and 3-D p for |β|1 = 1, kβk = 1 and |β|1 kβk = 1.

Lemma 1. Given any two p-norms f, g : Rn → [0, +∞), i.e. for some p1 , p2 ≥ 1, f (β) =

n X

! |βi |p1

1 p1

,

i=1

g(β) =

n X

! p2

|βi |

1 p2

, we have that

√ f · g is a norm.

i=1

The following two theorems demonstrate the grouping effect achieved by a minimizer of the penalized Euclidean distance. ˆ Theorem 1. Let β(λ) = arg min{LP ED (λ, β)}. If coli (X) = colj (X) then βî (λ) = βˆj (λ). β

6


(a)

(b)

p F IGURE 3. (a) For 2−D, in blue is the contour given by kβk · |β|1 = 1 between |β|1 = 1 in purple and kβk = 1 in gold. (b) The 3−D representation for the ball of radius 1 given by the geometric mean of the `1 and `2 norms.

Considering the set-up of very large p compared to n, selecting and grouping variables is an important priority. Theorem 2 below supports the idea of obtaining groups of highly correlated variables, based on the relative size of the corresponding component minimizers of the penalized Euclidean distance objective function. Theorem 2. Assume we have a standardized data matrix X, and Y is a centred response vector, as in (2). Let βˆ be ˆ the PED estimate (i.e. β(λ) = arg min{LP ED (λ, β)} for some λ > 0). Define β

Dλ (i, j) =

1 |βî (λ) − βˆj (λ)| ˆ kβ(λ)k

then 2θij λ where θij is the angle between coli (X) and colj (X), 0 ≤ θij ≤ π/2. Dλ (i, j) ≤

This special grouping effect is facilitated by our particular choice for the objective function. Strong overcrowding on the unit hypersphere around an “irrelevant" column would be detected by a dramatic drop in the relative size of the corresponding components of the solution to our objective function.

3.2. Sparsity Properties. It is important and of great interest to consider the case when the number of variables by far exceeds the number of drivers for the optimal sparse model. Therefore the cardinality of the set S is infinite, and the challenge is to find a sparse solution in it. The starting point of our analysis will be a solution of the penalized ˆ Euclidean distance problem defined by (4). As before, we let θˆj represent the angle between vectors Xj and Y −X β.


7

We note that the angle θˆj could be defined as π θˆj = − arcsin 2 ˆ = whenever kY −X βk 6 0. Also, let kˆ =

r

ˆ colj (X)T (Y − X β) ˆ kY − X βk

! , 0 ≤ θˆj < π.

ˆ kβk 1 . It is well known that we have √ 4p ˆ |β|1

≤ kˆ ≤ 1 as long as βˆ 6= 0Rp . Thus

a common-sense but not very restrictive assumption would be that 0Rp should not be a minimizer for kY − Xβk. ˆ Lemma 2. If β(λ) is a solution of (4) then we have that βˆj (λ) = kˆ ˆ kβ(λ)k

(5)

! 2 cos(θˆj ) ˆ − k sgn(βˆj (λ)) if βˆj (λ) 6= 0. λ

Result 1. We have λkˆ if and only if βˆj (λ) = 0. | cos(θˆj )| ≤ 2 Result 2. If βˆ is the solution of (4) and its j-th component is nonzero (i.e. βˆj 6= 0) then sgn(βˆj ) = sgn(XjT (Y − ˆ = sgn( π − θˆj ). X β)) (6)

2

The following two results demonstrate the existence of a minimizing sequence whose terms have the grouping effect property for the relative size of their components. ˆ β Lemma 3. If βˆ is the solution of (4), we have ˆ j < M ≤ 1 if and only if | cos(θˆj )| ≤ kβ(λ)k is a constant.

λ 2

M ˆ k+ , where M kˆ

ˆ Result 3. The solution of the penalized Euclidean distance problem, β(λ), converges to a minimizer of the norm of the residual as the λ approaches 0. 3.3. Oracle Property. In this section we demonstrate that PED is also able to recover sparse signals without (pre)estimates of the noise standard deviation or any knowledge about the signal. In Belloni et al (2011) paper this property is referred as “pivotal recovery". An important aspect is that the development of an oracle theorem also brings a solid theoretical justification for the choice of the parameter λ. Let Y = Xβ ∗ +σ where σ is the standard deviation of the noise and i , i = 1, ..., n, are independent and identically distributed with a law F ∗ such that EF ∗ (i ) = 0 and EF ∗ (2 ) = 1. Let J˜ = supp(β ∗ ). For any candidate solution i

ˆ = kY − X βk ˆ and the newly introduced norm is βˆ we can use the notation L for the plain Euclidean loss, i.e. L(β) p denoted by kβk(1,2) , i.e.kβk(1,2) = |β|1 kβk. The idea behind the following considerations is the possibility of estimating F ∗.

kX T k∞ kk

in probability by using the law

Such general estimation is explained in Belloni et al (2011) as Lemma 1 under the assumptions of Condition

8


1. We can use the same general result to show that the method we propose is also capable of producing “pivotal" recoveries of sparse signals. Before stating the main theorem we introduce some more notations and definitions. The solution of the PED obˆ ˆ ˜ M ∗ = kβ ∗ k, jective function is denoted by β(λ). Let u = β ∗ − β(λ), kukX = kXuk, p∗ the cardinality of J, S=

kX T k∞ kk ,

c > 1 and, for brevity, c¯ =

c+1 c−1 .

Also, we write u ˜ for the vector of components of u that correspond to ˜ the non-zero elements, i.e. with indices in J. Also, we write u ˜c for the vector of components of u that correspond ˜ Consider to the zero elements of β ∗ , i.e. with indices in the complement of J. √ c4pp ∗ p c ∗ 4 ∗ (7) ∆c¯ = u ∈ R , |˜ u |1 ≤ c¯|˜ u|1 + p M . c−1 β∗

Assume that kc¯∗

√ ∗ 1 p kukX = 1− min∗ √ √ c u∈∆c¯ 2|˜ u|1 + 4 p 4 p∗ M ∗

is bounded away from 0, i.e. kc¯∗ > k > 0. We also assume the same property for 1 kukX . k¯c¯∗ = min∗ √ u∈∆c¯ n kuk In line with the terminology introduced by Bickel et al. (2009) we refer to kc¯∗ and k¯c¯∗ as restricted eigenvalues. As stated before, our oracle theorem is based on estimation of

kX T k∞ kk .

In the case when the law F ∗ = Φ0 is normal,

directly following from Lemma 1 of Belloni et al (2011), we have: Lemma 4. Given 0 < α < 1 and some constant c > 1, the choice of parametrization λ = √ satisfies λ ≥ c 4 pS with probability 1 − α.

√ c 4 p −1 √ Φ 1 n 0

−

α 2p

Now we are ready to state the main result: Theorem 3. (Signal Recovery) Assume that λ ≤

√ ρ 4 pkc∗¯ √ p∗

√ for some 0 < ρ < 1. If also λ ≥ c 4 pS then we have

(1 − ρ2 )kukX ≤ 2ρL(β ∗ ).

(8)

A direct consequence is the following oracle inequality: (9)

2ρL(β ∗ ) 1 ˆ √ , k¯c¯kβ(λ) − β∗k ≤ (1 − ρ2 ) n

ˆ and hence as n → ∞, we have β(λ) → β∗. √ We can use the value of λ in Lemma 4 for practical implementation in order to ensure λ ≥ c 4 pS holds with probability 1 − α. Also there are also some circumstances when we can consider lower values of λ, as shown in the Corollary.


9

Corollary 1. Let 0 < ζ < 1 and √ √ √ 2 n n c n p ∗ . ∆ζ = u ∈ R , √ |˜ u |1 ζ ≤ |˜ u|1 √ − ζ + |β |1 1 − √ 4 p 4 p 4 p √

If

kζ∗

p∗ √ kukX n √ √ |β ∗ | 2 n n u|1 + ζ 1 1− √ u∈∆ζ ζ √ 4p 4 p |˜

= min

> k > 0 and for λ = cΦ−1 1− 0 s

and, at the same time, we assume λ ≤

√ ρ 4 pk∗ √ √ ζ∗ n p

α 2p

√ 4

p n ,

with c > 1, we can check

√ ˆ kβ(λ)k n − √ ≥ζ>0 4 ˆ c p |β(λ)| 1

for some 0 < ρ < 1 then we also have an oracle property, i.e. const. ˆ kβ(λ) − β∗k ≤ √ n

with probability 1 − α.

ˆ − β ∗ k ≤ C/√n with probability 1 − α, where From the signal recovery theorem and corollary we obtain that kβ(λ) C > 0 is a constant. Thus, if j is an index where there is no signal, i.e. βj∗ = 0 then, from the previous inequality, √ ˆ ˆ ˆ = 6 0 we can divide the constant by kβ(λ)k and get we have that |βˆj (λ))| < kβ(λ) − β ∗ k ≤ C/ n. If kβ(λ)k √ |βˆj (λ))| < C/ n. ˆ kβ(λ))k

(10)

We will use (10) to inform a threshold choice as part of the PED fitting algorithm. As well as dependence on n ˆ we also investigate the effect of p on the relative size of the components. Note that the components of β(λ) whose |βˆj (λ)| ) ˆ kβ(λ)k

ˆ is small belong to columns of X that make with Y − X β(λ) an angle a lot closer to π2 than √ |βˆ (λ)| the rest of the columns. For example, since 1/ 4 p ≤ kˆ ≤ 1 and if ˆj < √δnp for some δ > 0, from equation (5) relative size (i.e.

kβ(λ)k

we have (11)

λ cos(θˆj ) < 2

δ kˆ + √ ˆ k np

!

λ ≤ 2

kˆ +

δ 1/2 n p1/4

,

√ √ and remember λ = O( 4 p/ n). For a practical method we implement the detection of a set I(λ, δ) of “irrelevant" indices (with zero parameter estimates) as follows (12)

I(λ, δ) =

j,

|βˆj (λ)| δ 0. We tested c = 20 so λ = 0.2973 and in this case we could verify the condition

from the corollary for the n = 800, p = 20000 datasets. Hence we take a reasonable choice as λ0 = 12 , based on the result of Corollary 1. For this particular example for PED we have used Algorithm C, which is an iterative algorithm with N = 3 steps. n p d Model 800 20000 14 Truth SIS

pˆ

`1

`2

ZAZ N ZAZ 19986 0 20.72 6.35 2.15 19976.96 2.32 (3.89) (2.84) (0.77) (4.44) (1.28) ISIS 28.57 4.16 0.90 19976.97 2.29 (2.12) (2.04) (0.97) (4.45) (1.30) Elastic Net 144.23 8.02 2.47 19855.77 0.00 (55.98) (1.98) (0.65) (55.97) (0.00) PED 12.96 3.51 1.48 19985.63 1.41 (1.07) (1.27) (0.48) (0.72) (0.81) TABLE 1. Simulation result based on Example 2. Results are reported based on averaging over 100 data sets. The best performances for each criterion are in bold. The standard errors of the estimates are presented within parentheses.

Table 1 summarizes results from the simulation study from Example 2. The output illustrates that the PED method performs well with respect to all criteria. As noticed in earlier examples, the estimated model sizes are very close to the true model size. Note that the predictors are dependent in nature; so lower estimated model size is more desirable than selecting larger models. The Elastic Net selects much larger models than the others. The estimation error with respect to `1 and `2 distances are much lower in PED than SIS, and generally lower than ISIS and the Elastic Net (except ISIS for `2 ). As mentioned earlier, a strong feature of the PED method is that it identifies true zero variables to be zero. The number of estimated zero variables in PED is very close to the number of true zero variables, whereas SIS/ISIS do not perform quite as well, and the Elastic Net is much worse. Also, the false positive rate is notably smaller in PED than SIS/ISIS. The Elastic Net has the very lowest NZAZ rate, but this is at the expense of selecting large models and a lower ZAZ value. 5.2. Mass spectrometry data from melanoma patients. We consider an application of the method to a proteomics dataset from the study of melanoma (skin cancer). The mass spectrometry dataset was described by Mian et al. (2005) and further analysed by Browne et al. (2010). The data consist of mass spectrometry scans from serum samples of 205 patients, with 101 patients with Stage I melanoma (least severe) and 104 patients with Stage IV melanoma (most severe). Each mass spectrometry scan consists of an intensity for 13951 mass over charge (m/z) values between 2000 and 30000 Daltons. It is of interest to find which m/z values could be associated with the stage of the disease, which could point to potential proteins for use as biomarkers. We first fit a set of 500 important peaks

15

1

2


● ● ●

●

● ●

Yhat

0

● ●●

●

● ●● ● ● ●● ● ● ●● ●●

●●

● ● ● ●●

●

●●

−1

●

●●

● ●

●

● ● ● ●

●

● ●

● ●

● ● ●●● ● ●

●●● ● ●● ● ● ● ● ● ●● ● ●

● ●● ● ● ●

●

● ● ●

● ● ● ● ●● ● ●

● ●

●

● ● ●

●

−2

●

●

●

●

●

−3

●

0

50

100

150

200

Scan number

F IGURE 6. The predicted score Yˆ from the fitted PED model. The Stage I patients are labelled 1-101 on the x-axis and then the Stage IV patients are labelled 102-205.

to the overall mean of the scans using the deterministic peak finding algorithm of Browne et al. (2010) to obtain 500 m/z values at peak locations. We consider the disease stage to be the response, with Y = −1 for Stage I and Y = 1 for Stage IV. We fit the PED regression model versus the intensities at the 500 peak locations. We also include a column of ones in the design matrix X which is of size n = 205 by p = 501. The data are available at http://www.maths.nottingham.ac.uk/∼ild/mass-spec Here we illustrate the one pass PED method with Algorithm B. A plot of the fitted values Yˆ is given in Figure 6 and we see that the fitted model predicts the stage of the disease quite well. In the PED fitting algorithm the non-zero regression coefficients occur at 20 m/z values out of the 500 fitted peak locations in the fitted PED model. Browne et al. (2010) also considered a mixed effects Gaussian mixture model and a two stage t-test for detecting significant peaks (where just the first 200 peaks were considered). Of the 11 largest peaks selected by PED, 10 were also selected by the two stage method and 9 were selected by the linear mixed model. Hence, it is reassuring that so many important peaks are chosen by all the methods. It is also of interest to consider classification of the scans into the two groups, as discussed by Mian et al. (2005). We consider 10-fold cross validation (using 10 approximately equal random splits of the data) and measure the success of the classification by the overall classification error, and the sensitivity, the specificity, positive predictive value (PPV) and negative predictive value (NPV) with Stage I treated as Negative and Stage IV as Positive. The PED method classifies an observation as Positive if Yˆ > 0 and otherwise Negative. We compared the performance of PED with one-pass Algorithm B, PED with iterative algorithm D, PED-AIC (using Algorithm B except AIC to

16


1.0 or CV err

0.8 0.6 0.4 0.2

80 60

lta de

3.0 2.5

40 2.0 1.5 20

bda

lam

1.0 0.5

F IGURE 7. Sensitivity analysis of melanoma data choose δ), PED-BIC (using Algorithm B except BIC to choose δ), SIS, ISIS (using the R package SIS) and the Elastic Net (using the R package glmnet). The results are presented in Table 2. Error (%) pˆ Sensitivity (%) Specificity (%) PPV (%) NPV (%) Method PED(Algorithm B) 11.66 15.9 93.79 84.35 85.64 92.56 11.14 18.1 95.18 82.27 86.71 94.72 PED(Algorithm D) PED.AIC 12.23 10.6 90.46 86.91 86.19 88.81 18.09 8.0 89.41 75.87 78.75 85.87 PED.BIC SIS 16.22 5.9 88.83 76.22 82.16 86.12 ISIS 16.04 8 87.90 79.58 83.01 87.85 16.12 38 92.66 80.14 80.61 88.32 Elastic net TABLE 2. Classification performance using 10-fold cross-validation in the melanoma dataset. Error is the average classification error from 10-fold cross-validation, pˆ is the average model size. Sensitivity = TP/(TP+FN), Specificity =TN/(FP+TN), PPV=TP/(TP+FP), NPV=TN/(FN+TN), where TP=True Positive, TN=True Negative, FP=False Positive, FN=False Negative.

From Table 2 we see that the PED methods work very well here. The iterative PED method using Algorithm D has the best classification performance and the highest sensitivity, PPV and NPV values. Also, PED-AIC has the highest specificity of the methods. We also carry out a sensitivity analysis in Figure 7 by investigating the cross-validation error for different choices of λ and δ. We see that for a given choice of λ that choosing too large δ can have a very detrimental effect. The choice of λ is less sensitive than the choice of δ. For our example λ = 6p1/4 /n1/2 = 1.9816 and so the choice of δ = 33 is reasonable for the one pass method.


17

6. C ONCLUSIONS We have introduced a sparse linear regression method which does not make use of existing sparse regression algorithms. Backed up by a concrete and relatively simple theoretical explanation, we show that the proposed methodology is useful and practical to implement without imposing any assumptions that in general could be infeasible to verify. A notable aspect is that we did not assume any conditions between the number of observations and the number of predictors, and so the methodology is generally applicable. Further extensions of the work will follow in a natural way, for example as Fan et al. (2009) have extended ISIS to generalized linear models and classification problems. It will be interesting to analyse the mass spectrometry data using PED for binary response models, and compare this with the PED regression approach of our paper. The ultra-high dimensional situation requires a few layers of methodological and computational implementation. By using several examples, we demonstrated that the proposed method can be effective for many instances of variable screening, variable selection or for prediction for p ≥ n and p >> n. The performance of the proposed method has been compared with a number of well known variable selection methods and has performed favourably. Further, the predictive performance of the proposed method has been compared to some current state-of-the-art known predictive techniques and it seems very effective in higher dimensional situations. The simulation study and the data analysis reveals that the proposed method is very resilient to so called “curse of dimensionality". In the simulation studies, moving from high to ultra-high dimensions, we saw how the method we proposed consistently performed well in discarding the irrelevant information of the data. In conclusion, the method appears to be a promising addition to the sparse regression toolbox.

R EFERENCES Ando, T. and Li, K-C (2013) A model-averaging approach for high-dimensional regression. Journal of the American Statistical Association, DOI: 10.1080/01621459.2013.838168. Belloni, A., Chernozhoukov, V., Wang, L. (2011). Square-root lasso: pivotal recovery of sparse signals via conic programming. Biometrika, 98, 4, 791 – 806. Bickel, P.J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and Dantzig selector. Ann. Statist. 37, 4, 1705–1732. Bonnans, J.F., Gilbert, J.C., Lemaréchal, C. and Sagastizábal, C.A. (2006). Numerical optimization: Theoretical and practical aspects, Berlin: Springer-Verlag. Browne, W.J., Dryden, I.L., Handley, K., Mian, S. and Schadendorf, D. (2010). Mixed effect modelling of proteomic mass spectrometry data using Gaussian mixtures. Journal of the Royal Statistical Society, Series C (Applied Statistics), 59, 617-633.

18


Caballero Gaudes, C., Petridou, N., Francis, S., Dryden, I.L. and Gowland, P. (2013). Paradigm Free Mapping with sparse regression automatically detects single-trial fMRI BOLD responses. Human Brain Mapping. 34, 501-518. Candès, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when p is much larger than n. Annals of Statistics, 35(6), 2313 - 2351. Dryden, I. L. and Mardia, K. V. (1998). Statistical Shape Analysis. John Wiley, Chichester. Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression (with discussion). Annals of Statistics, 32, 407 - 499. Fan, J. and Lv, J. (2008). Sure independence screening for ultra-high dimensional feature space. Journal of Royal Statistical Society Series B, 70, 849 - 911. Fan, J., Samworth, R. and Wu, Y. (2009). Ultra-high Dimensional Feature Selection: Beyond The Linear Model. Journal of Machine Learning Research 10, 2013–2038. Hastie, T., Tibshirani, R. and Friedman, J. (2008). The Elements of Statistical Learning (2nd edition), Springer, New York. Hoerl A.E. and Kennard R.W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12, 55 - 67. Hoerl A.E. and Kennard R.W. (1970). Ridge regression: Applications to nonorthogonal problems (Corr: V12 p723). Technometrics, 12, 69 - 82. James, G., Witten, D., Hastie, T. and Tibshirani, R. (2013). An Introduction to Statistical Learning with Applications in R. Springer, New York. Lewis A. S. and Overton, M. L. (2008). Nonsmooth optimization via BFGS. Technical Report, New York University. Mian, S., Ugurel, S., Parkinson, E., Schlenzka, I., Dryden, I.L., Lancashire, L., Ball, G., Creaser, C., Rees R., and Schadendorf, D. (2005). Serum proteomic fingerprinting discriminates between clinical stages and predicts disease progression in melanoma patients. Journal of Clinical Oncology, 33, 5088-5093. Nocedal, J. (1980). Updating Quasi-Newton Matrices with Limited Storage. Mathematics of Computation, 35 (151): 773- 782. Tibshirani, R. (1996) Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society Series B, 58, 267 – 288. Tibshirani, R., Saunders, M., Rosset, S., Zhu, J. and Knight, K. (2005). Sparsity and smoothness via the fused lasso. J. R. Statist. Soc. B, 67, pp. 91–108. Zou, H. and Hastie, T. (2005). Regularization and variable selection via the Elastic Net. Journal of the Royal Statistical Society, Series B, 67, 301 - 320.


19

A PPENDIX A: P ROOFS In this section we prove the results.

Proof. (Lemma 1.) Let p C = {β ∈ Rn | f (β)g(β) ≤ 1}. We notice that C ≡ {β ∈ Rn |[f (β)g(β)]2p1 p2 ≤ 1} and therefore C is a bounded, closed and convex subset of Rn which contains the origin. Let Epi(h) denote the epigraph of some function h : Rp → R i.e. Epi(h) = {(β, t) ∈ Rn+1 | h(β) ≤ t}.

√ √ We see that in our case Epi( f · g) = {t(C, 1)| t ∈ [0, +∞)} and therefore Epi( f · g) is a convex cone in √ √ Rn+1 since C is a convex set in Rn . This shows that f · g is a convex function. Because f · g is convex and √ homogeneous of degree 1 it follows that it must also satisfy the triangle inequality. Therefore f · g is a norm on Rn . What we present is the fact that the geometric mean of two p-norms is also a norm on a finite dimensional vector

space.

Proof. (Theorem 1.) We follow a similar idea as stated in Zou and Hastie (2005). Let βˆ = arg min{LP ED (λ, β)} β p and assume that coli (X) = colj (X). Let J(β) = λ kβk|β|1 . If βî 6= βˆj consider ( βˆk if k 6= i and k 6= j βˆk∗ = 1 ˆ ˆ 2 (βi + βj ) if k = i or k = j v s u p p uP P ˆ 2 = kY − X βˆ∗ k2 . We also consider J(β) ˆ = λt We have kY − X βk |βˆk | βˆk2 and notice that k=1

k=1

v  u v u uX p u X  u p 1 ˆ 1 u ∗ ˆ ˆ ˆ  J(β ) = λu |βk | + |βi + βj | u βˆk2 + |βî + βˆj |2 t t 2 4 It is clear that

   

k=1

k=1

k6=i,j

k6=i,j



p X k=1

k6=i,j

v u p uX 1 1 ˆ ˆ ˆ  |βk | + |βi + βj | u βˆk2 + |βî + βˆj |2 < t 2 4 k=1

k6=i,j

p X k=1

!v u p uX ˆ |βk | t βˆk2 k=1

ˆ a contradiction with the fact that βˆ = arg min{LP ED (λ, β)}. which implies J(βˆ∗ ) < J(β), β

20


Proof. (Theorem 2.) If coli (X) = colj (X) then βî = βˆj which means that the proposed regression method should ˆ assign identical coefficients to the identical variables. Since β(λ) = arg min{LP ED (λ, β)} we have β

∂LP ED (λ, β) ˆ = 0 for every k = 1, 2, ...p ∂βk β=β(λ)

(13)

unless βˆk (λ) = 0. Thus, if βˆk (λ) 6= 0 we have (14)

βˆk (λ) ˆ |β(λ)|1 ˆ ˆ ˆ λ kβ(λ)k colk (X)T [Y − X β(λ)] λ sgn{βˆk (λ)}kβ(λ)k + q − + q = 0. ˆ 2 2 kY − X β(λ)k ˆ ˆ ˆ ˆ kβ(λ)k · |β(λ)|1 kβ(λ)k · |β(λ)|1

If we take k = i and k = j, after subtraction we obtain ˆ ˆ [colj (X)T − coli (X)T ][Y − X β(λ)] λ [βî (λ) − βˆj (λ)]|β(λ)| 1 q + =0 ˆ 2 kY − X β(λ)k 3 · |β(λ)| ˆ ˆ kβ(λ)k 1

(15)

since sgn{βî (λ)} = sgn{βˆj (λ)}. Thus we get q ˆ ˆ ˆ ˆ · |β(λ)| 1 βi (λ) − βj (λ) 2 kβ(λ)k = [colj (X)T − coli (X)T ]ˆ r(λ) (16) ˆ ˆ λ kβ(λ)k |β(λ)|1 ˆ

where rˆ(λ) = y−X β(λ) and kcolj (X)T − coli (X)T k2 = 2(1 − ρ) since X is standardized, and ρ = cos(θij ). We ˆ ky−X β(λ)k √ kβk·|β|1 ≤ 1 for any nonzero vector β in Rp and |ˆ r(λ)| ≤ 1. Thus, equation (16) implies that have |β|1 (17)

Dλ (i, j) ≤

θij 2|ˆ r(λ)| 2p kcoli (X) − colj (X)k ≤ 2(1 − ρ)< 2 , λ λ λ

which proves the grouping effect property for the proposed method.

Proof. (Proposition 1) Here we are going to prove the “if" part as the “only if" implication follows directly from the previous Lemma. Let us assume that ˆ β(λ) = (βˆ1 (λ), ..., βˆj−1 (λ), 0, βˆj+1 (λ)...βˆp (λ)) = arg min{LP ED (λ, β)} β

for a given λ > 0. Here we can fix λ and, for brevity, we can omit it from notations in the course of this proof. For any t > 0 we have LP ED (βˆ1 , ...βˆj−1 , t, βˆj+1 , ...βˆp ) − LP ED (βˆ1 , ...βˆj−1 , 0, βˆj+1 , ...βˆp ) ≥ 0. t


21

Again, for brevity we can denote βˆt@j = (βˆ1 , ...βˆj−1 , t, βˆj+1 , ...βˆp )T and also let θˆt@j be the angle between colj (X) and Y − X βˆt@j . By using the mean value theorem (Lagrange), there exists 0 < t∗ < t such that LP ED (βˆt@j ) − LP ED (βˆ0@j ) = − cos(θˆt∗ @j ) + λ t

q q kβˆt@j k · |βˆt@j |1 − kβˆ0@j k · |βˆ0@j |1 t

If we rationalize the numerator of the second fraction in the previous equation, we get kβˆ

k·|βˆ

| −kβˆ

k·|βˆ

|

t@j t@j 1 0@j 0@j 1 LP ED (βˆt@j ) − LP ED (βˆ0@j ) t q = − cos(θˆt∗ @j ) + λ q t kβˆt@j k · |βˆt@j |1 + kβˆ0@j k · |βˆ0@j |1

and thus

kβˆt@j k·|βˆt@j |1 −kβˆ0@j k·|βˆ0@j |1 t q cos(θˆt∗ @j ) ≤ λ q . ˆ ˆ kβt@j k · |βt@j |1 + kβˆ0@j k · |βˆ0@j |1

Also

kβˆt@j k − kβˆ0@j k |βˆt@j |1 − |βˆ0@j |1 kβˆt@j k · |βˆt@j |1 − kβˆ0@j k · |βˆ0@j |1 = |βˆt@j |1 + kβˆ0@j k t t t

and we notice that

|βˆt@j |1 −|βˆ0@j |1 t

= 1 for any t > 0. Letting t → 0 we obtain s kβˆ0@j k λ λkˆ cos(θˆ0@j ) ≤ . = 2 |βˆ0@j |1 2

Analogously, by starting with t < 0, we can show that λ cos(θˆ0@j ) ≥ − 2

s

kβˆ0@j k λkˆ = . 2 |βˆ0@j |1

Proof. (Proposition 2) By writing the necessary conditions for optimality in the case of problem (4) we have ˆ = sgn π − θˆj sgn(XjT (Y − X β)) 2 and βˆj (λ) = kˆ ˆ kβ(λ)k

ˆ 2XjT (Y − X β) − sgn(βˆj (λ))kˆ ˆ λkY − X βk

!

ˆ = sgn(cos(θˆj )). if βˆj (λ) 6= 0. Since kˆ > 0 we have sgn(βˆj ) = sgn(XjT (Y − X β))

Proof. (Lemma 3) The proof follows directly from (5) and (6).

ˆ Proof. (Proposition 3) If β(λ) is a solution of (4) we have cos(θˆj ) ≤ λ → 0 since M ≤ 1 and p−1/4 ≤ kˆ ≤ 1.

λ 2

M kˆ + kˆ

and therefore cos(θˆj ) → 0 when

22


Proof. (Theorem 3) The proof follows a similar method to that of Theorem 1 in Belloni et al (2011). Given that ˆ β(λ) is a minimizer of the PED objective function for a given λ, we have λ ˆ ∗ ˆ ˆ L(β(λ)) − L(β ∗ ) ≤ λ|β ∗ |1 − λkβ(λ)k |β(λ)|1 . (1,2) ≤ λ|β |1 − √ 4 p We obtain p λ ∗ λ ˆ λ √ ∗ c ∗ 4 ∗ 4 ˆ L(β(λ)) − L(β ∗ ) ≤ √ |β | − | β(λ)| + pλM ≤ (|˜ u | − |˜ u | ) + λM p . √ √ 1 1 1 1 4 p 4 p 4 p At the same time, due to the convexity of L, we have ˆ L(β(λ)) − L(β ∗ ) ≥ (∇L(β ∗ ))T u ≥ − √ if λ ≥ c 4 pS, where S =

kX T k∞ kk .

kX T k∞ λ |u|1 ≥ − √ (|˜ u|1 + |˜ uc |1 ) kk c4p

Thus we have √ c4pp c+1 4 |˜ u|1 + p∗ M ∗ |˜ u |1 ≤ c−1 c−1 c

and also

√ c4pp 2c 4 |u|1 ≤ |˜ u|1 + p∗ M ∗ . c−1 c−1

Now λ ˆ ˆ L(β(λ)) − L(β ∗ ) ≤ |L(β(λ)) − L(β ∗ )| ≤ √ (|˜ u|1 + |˜ uc |1 ) c4p √ λ|˜ u|1 λ p∗ kukX ≤ . √ √ ≤ 4 pk ∗ c4p c¯ Considering the identity ˆ L2 (β(λ)) − L2 (β ∗ ) = kuk2X − 2(σT Xu) along with ˆ ˆ ˆ L2 (β(λ)) − L2 (β ∗ ) = (L(β(λ)) − L(β ∗ ))(L(β(λ)) + L(β ∗ )) and the fact that 2|σT Xu| ≤ 2L(β ∗ )S|u|1 we deduce kuk2X

≤ ≤

√ √ √ ∗ λ p∗ kukX λ p∗ kukX ∗ ∗ λ p kukX L(β ) + + L(β ) √ , √ √ 4 pk ∗ 4 pk ∗ c 4 pkc¯∗ c¯ c¯ √ ∗ 2 √ 2λ p∗ kukX λ p kukX + . √ √ 4 pk ∗ 4 pk ∗ c¯ c¯


Thus we have

and if

√ λ p∗ √ 4 pk ∗ c ¯

23

√ ∗ 2 # √ λ p 2λ p∗ 2 1− √ kukX ≤ √ L(β ∗ )kukX 4 pk ∗ 4 pk ∗ c¯ c¯

"

< ρ < 1, we obtain (1 − ρ2 )kukX ≤ 2ρL(β ∗ ).

Proof. (Corollary 1) We have √ √ √ λ n ∗ λ n ˆ n ˆ ∗ ˆ ˆ L(β(λ)) − L(β ) ≤ √ |β |1 − √ |β(λ)|1 + λ √ |β(λ)|1 − kβ(λ)k(1,2) + 4 p 4 p 4 p √ n λ|β ∗ |1 1 − √ . 4 p If

√ λ n √ c4p

≥ S with probability 1 − α for some c > 1, we obtain √ √ √ √ n n n c n ˆ c ˆ − √ (|˜ u|1 + |˜ u |1 ) ≤ √ |˜ u|1 − √ |˜ u |1 + √ |β(λ)|1 − kβ(λ)k(1,2) + 4 p 4 p 4 p c4p √ n ∗ . |β |1 1 − √ 4 p

(18)

At the same time we can write s ! √ √ √ ˆ n ˆ n n k β(λ)k ∗ ˆ ˆ + |β(λ)|1 − kβ(λ)k = |β(λ)| − √ √ 1 (1,2) + |β |1 1 − √ 4 p 4 p 4 p ˆ |β(λ)| 1 s √ ! √ ˆ kβ(λ)k n n ∗ ∗ |β |1 − √ + |β |1 1 − √ 4 p 4 p ˆ |β(λ)|1 and thus we have √ √ n ˆ n ∗ ˆ |β(λ)|1 − kβ(λ)k(1,2) + |β |1 1 − √ ≤ |u|1 √ 4 p 4 p

By combining with inequality (18) we obtain

s ! √ ˆ n kβ(λ)k − + √ 4 p ˆ |β(λ)| 1 √ n |β ∗ |1 1 − √ . 4 p

24


√ √ n 1 n 1 c 1− |˜ u |1 ≤ √ + |u|1 |˜ u|1 1 + √ 4 p 4 p c c

s ! √ ˆ n kβ(λ)k + − √ 4 p ˆ |β(λ)| 1 √ n |β ∗ |1 1 − √ 4 p

and it immediately follows that s |u|1 r If

ˆ kβ(λ)k ˆ |β(λ)| 1

−

√ n √ c4p

√ ! √ √ ˆ 2 n kβ(λ)k n n ∗ − √ ≤ √ . |˜ u|1 + |β |1 1 − √ 4 p 4 p 4 p ˆ c |β(λ)|1

≥ ζ > 0 we have √ √ n 2 n |β ∗ |1 |u|1 ≤ √ |˜ u | + 1 − . √ 1 4 p ζ4p ζ

and equivalently

√ √ n 2 n ∗ − ζ + |β | 1 − . √ √ 1 4 p 4 p n √ √ o n n Considering ∆ζ = u ∈ Rp , |˜ uc |1 ζ ≤ |˜ u|1 2√ + |β ∗ |1 1 − √ and assuming 4p − ζ 4p |˜ uc |1 ζ ≤ |˜ u|1

kζ∗ = min

√ u∈∆ζ 2 √n u|1 ζ 4 p |˜

√ ∗ √p kukX n

+

|β ∗ |1 ζ

1−

√ n √ 4p

>k>0

we get √ λ n ˆ ˆ L(β(λ)) − L(β ∗ ) ≤ |L(β(λ)) − L(β ∗ )| ≤ √ (|˜ u|1 + |˜ uc |1 ) c4p √ √ λ p∗ kukX λ n|u|1 ≤ . ≤ √ √ 4 pk ∗ c4p ζ The rest of the proof is virtually identical with the last part of the argument we detailed for Theorem 3.


D EPARTMENT OF M ATHEMATICS , C OLLEGE OF W ILLIAM & M ARY, W ILLIAMSBURG , V IRGINIA , 23185, USA E-mail address: [email protected] D EPARTMENT OF M ATHEMATICS , C OLLEGE OF W ILLIAM & M ARY, W ILLIAMSBURG , V IRGINIA , 23185, USA E-mail address: [email protected] S CHOOL OF M ATHEMATICAL S CIENCES , T HE U NIVERSITY OF N OTTINGHAM , N OTTINGHAM , NG7 2RD, UK E-mail address: [email protected]

25

PENALIZED EUCLIDEAN DISTANCE REGRESSION

PENALIZED EUCLIDEAN DISTANCE REGRESSION

Suggest Documents

Longitudinal Penalized Functional Regression

Euclidean Distance Mapping - CiteSeerX

Euclidean Distance Mapping

Euclidean distance geometry and applications

Penalized Logistic Regression for Detecting ... - Stanford University

Penalized Regression with Correlation Based Penalty

Using Penalized Regression with Parallel Coordinates for

Nonparametric M-quantile Regression via Penalized ...

Sparse Brain Network using Penalized Linear Regression

Optimized application of penalized regression ... - Semantic Scholar

Penalized Regression with Penalty Factors for ...

Penalized CHI Square Distance Function in

Orthogonal Distance Regression

Efficient Parallel Algorithms for Euclidean Distance Transformã

Nonhomogeneous Euclidean first-passage percolation and distance

A Comparative Analysis of Euclidean Distance and

NOISY EUCLIDEAN DISTANCE REALIZATION - Optimization Online

Euclidean Distance Mapping for computing microstructural gradients ...

Euclidean distance matrices and separations in communication ...

An Euclidean Distance Optimization based Intelligent ...

Dimensionality Reduction via Euclidean Distance ... - CSC - KTH

Euclidean distance matrices and separations in communication ...

Linear Time Euclidean Distance Transform ... - Semantic Scholar

On the Euclidean Distance of Images

PENALIZED EUCLIDEAN DISTANCE REGRESSION