enforce non-crossing of the quantile curves on any given domain of interest. ...... (pcon), black market premium (blakp), political instability (pol) and growth rate ...
Interquantile Shrinkage in Regression Models Liewen Jiang, Huixia Judy Wang, Howard D. Bondell
Abstract Conventional analysis using quantile regression typically focuses on fitting the regression model at different quantiles separately. However, in situations where the quantile coefficients share some common feature, joint modeling of multiple quantiles to accommodate the commonality often leads to more efficient estimation. One example of common features is that a predictor may have a constant effect over one region of quantile levels but varying effects in other regions. To automatically perform estimation and detection of the interquantile commonality, we develop two penalization methods. When the quantile slope coefficients indeed do not change across quantile levels, the proposed methods will shrink the slopes towards constant and thus improve the estimation efficiency. We establish the oracle properties of the two proposed penalization methods. Through numerical investigations, we demonstrate that the proposed methods lead to estimations with competitive or higher efficiency than the standard quantile regression estimation in finite samples. Supplemental materials for the article are available online.
Key Words: Fused lasso; Non-crossing; Oracle; Quantile regression; Smoothing; Sup-norm.
1
1
Introduction
Quantile regression has attracted an increasing amount of attention after being introduced by Koenker and Bassett (1978). One major advantage of quantile regression over classical mean regression is its flexibility in assessing the effect of predictors on different locations of the response distribution. Regression at multiple quantiles provides more comprehensive statistical views than analysis at the mean or at a single quantile level. The standard approach of multiple-quantile regression is to fit the regression model at each quantile separately. However, in applications where the quantile coefficients enjoy some common features across quantile levels, joint modeling of multiple quantiles can lead to more efficient estimations. One special case is the regression model with independent and identically distributed (i.i.d.) errors so that the quantile slope coefficients are constant across all quantile levels. Under this assumption, Zou and Yuan (2008a) proposed the Composite Quantile Regression (CQR) method by combining the objective functions at multiple quantiles to estimate the common slopes. For models with i.i.d. errors, the composite estimator is more efficient than the conventional estimator obtained at a single quantile level. However, in practice, the i.i.d. error assumption is restrictive and needs to be verified. It is likely that the quantile slope coefficients may appear constant in certain quantile regions, but vary in others. For instance, the analysis of an international economic growth data shows that the impact of political instability on the GDP growth tends to be constant for τ ∈ [0, 0.5], but varies at upper quantiles; see Figure 3 in Section 5. In order to borrow information across quantiles to improve efficiency, the common structure of quantile slopes has to be determined. One way to identify the commonality of quantile slopes at multiple quantile levels is through hypothesis testing. Koenker (2005) described the Wald-type test through direct estimation of the asymptotic covariance matrix of the quantile coefficient estimates at multiple quantiles. The Wald-type test can be used to test the equality of quantile slopes at a given set of quantile levels, and this was implemented in the function ‘anova.rq’ in the R package quantreg. The testing approach is feasible if we only test the equality of slopes at a few given
2
quantile levels. However, to identify the complete quantile regions with constant quantile slopes, at least 2p(K−1) tests have to be conducted, where K is the total number of quantile levels and p is the number of predictors. This makes the testing procedure complicated, especially when K and p are large. To overcome this drawback, we propose penalization approaches to allow for simultaneous estimation and automatic shrinkage for interquantile differences of the slope coefficients. Penalization methods are useful tools for variable selection. In conditional mean regression, various penalties have been introduced to produce sparse models. Tibshirani (1996) employed L1 -norm in Least Absolute Shrinkage and Selection Operator (lasso) for variable selection. Fan and Li (2001) proposed the Smoothly Clipped Absolute Deviation (SCAD) penalty, which is a nonconcave penalty and defined through its first order derivative. Zou (2006) extended the L1 -penalty by including adaptive weights, referred to as adaptive lasso penalty. Yuan and Lin (2006) introduced group lasso to identify significant factors represented as groups of predictors. In applications where the predictors have a natural ordering, Tibshirani, et al. (2005) introduced the Fused lasso, which penalizes the L1 -norm of both coefficients and their successive differences. The penalization idea was also adopted for quantile regression models in various contexts. Koenker (2004) employed the lasso penalty to shrink random subject effects towards constant in studying conditional quantiles of longitudinal data. Li and Zhu (2008) studied L1 -norm quantile regression and computed the entire solution path. Further work on this problem, with improved algorithm, can be found in Kato (2010) and Osborne and Turlach (2011). Wu and Liu (2009a) further discussed SCAD and adaptive lasso methods in quantile regression and demonstrated their oracle properties. Zou and Yuan (2008b) adopted a group F∞ -norm penalty to eliminate covariates that have no impact on any quantile levels. Li and Zhu (2007) analyzed the quantiles of the Comparative Genomic Hybridization (CGH) data using fused quantile regression. Wang and Hu (2010) proposed fused adaptive lasso to accommodate the spatial dependence in studying multiple array CGH samples from two different groups. In this work, we adopt the fusion idea to shrink the differences of quantile slopes at two 3
adjacent quantile levels towards zero. Therefore, the quantile regions with constant quantile slopes can be automatically identified and all the coefficients can be estimated simultaneously. We develop two types of fusion penalties in the multiple-quantile regression model: fused lasso and fused sup-norm, along with their adaptively weighted counterparts. An additional benefit of our approach is that it is straightforward to include restrictions to enforce non-crossing of the quantile curves on any given domain of interest. The crossing issue refers to the fact that in typical quantile regression estimation, it is possible, and often the case that for a particular region of the covariate space, a more extreme quantile of the response is estimated to be smaller than one that is less extreme. Of course, this is not possible in reality for any valid distribution, but is a common issue plaguing quantile regression estimation. The quantile crossing is often seen in finite samples especially in multiple-quantile regression. The quantile crossing issue has been studied in He (1997), Tackeuchi, et al. (2006), Wu and Liu (2009b), and Bondell, Reich and Wang (2010). The remainder of this paper is organized as follows. In section 2, we illustrate the proposed methods and discuss the computation issues. In section 3, we discuss the asymptotic properties of the proposed penalization estimators. A simulation study is conducted to assess the numerical performance of our proposed estimators in section 4. We apply the proposed methods to analyze international economic growth data in section 5. All technical details are provided in the appendix.
2 2.1
Proposed Method Model Setup
Let Y be the response variable and X ∈ Rp be the corresponding covariate vector. Suppose we are interested in regression at K quantile levels 0 < τ1 < . . . < τK < 1, where K is a finite integer. Denote Qτk (x) as the τkth conditional quantile function of Y given X = x, that is, P {Y ≤ Qτk (x)|X = x} = τk , for k = 1, . . . , K. We consider the linear quantile regression
4
model Qτk (x) = αk + xT βk ,
(1)
where αk ∈ R is the intercept and βk ∈ Rp is the slope vector at the quantile level τk . Let {yi , xi }, i = 1, · · · , n, be an observed sample. At a given quantile level τk , the conventional quantile regression method estimates (αk , βkT )T by (˜ αk , β˜kT )T , the minimizer of the quantilespecific objective function n X
ρτk (yi − αk − xTi βk ),
i=1
where ρτ (r) = τ rI(r > 0) + (τ − 1)rI(r ≤ 0) is the quantile check function and I(·) is the indicator function; see Koenker and Bassett (1978). Minimizing the quantile loss function at each quantile level separately is equivalent to minimizing the following combined loss function n K X X
ρτk (yi − αk − xTi βk ).
(2)
k=1 i=1
In some applications, however, it is likely that the quantile coefficients share some commonality across quantile levels. For example, the quantile slope may be constant in certain quantile regions for some predictors. Separate estimation at each quantile level will ignore such common features and thus lose efficiency. An alternative strategy is to model multiple quantiles jointly by borrowing information from neighboring quantiles. Zou and Yuan (2008a) proposed a composite quantile estimator by assuming that the quantile slope is constant across all quantiles. Such an assumption is restrictive, and it requires the model structure to be known beforehand, which is hard to determine in practice. In the following sections, we denote βk,l as the slope corresponding to the l-th predictor at the quantile level τk , and dk,l = βk,l −βk−1,l as the slope difference at two neighboring quantiles τk−1 and τk , with k = 2, . . . , K and d1,l = β1,l for l = 1, . . . , p. Let θ = (αT , dT1 , . . . , dTK )T de5
note the collection of unknown parameters, where α = (α1 , . . . , αK )T and dk = (dk,1 , . . . , dk,p )T for k = 1, . . . , K. Therefore, the τkth quantile coefficient vector can be written as (αk , βkT )T = Tk θ, where Tk = (Dk,0 , Dk,1 , Dk,2 ) ∈ R(p+1)×(p+1)K , Dk,0 is a (p + 1) × K matrix with 1 in the first row and the k th column, but zero elsewhere, Dk,1 = 1Tk ⊗ (0p , Ip )T ∈ R(p+1)×pk and Dk,2 is a (p + 1) × (K − k)p zero matrix. Here 0p is a p × 1 zero vector, Ip is a p × p identity matrix and 1k is a k × 1 vector with all 1’s. Define zTik = (1, xTi )Tk ∈ R1×(p+1)K . With these reparameterizations, the combined quantile objective function (2) can be rewritten as n K X X
ρτk (yi − zTik θ).
(3)
k=1 i=1
In order to capture the possible feature that some quantile slope coefficients are constant in some quantile regions, we propose to shrink the interquantile differences {dk,l , k = 2, . . . , K, l = 1, . . . , p} towards zero, thus inducing smoothing across quantiles.
2.2
Penalized Joint Quantile Estimators
We propose an adaptive fused penalization approach to shrink interquantile slope differences towards zero. The proposed adaptive fused (AF) penalization estimator is defined as
θˆAF = arg min Q(θ), where Q(θ) = θ
K X n X
ρτk (yi −
k=1 i=1
zTik θ)
+λ
p X
kDiag(w˜k,l )θ(l) kν .
(4)
l=1
Here λ ≥ 0 is a tuning parameter controlling the degree of penalization, Diag(w˜k,l ) is a diagonal matrix with elements w˜2,l , . . . , w˜K,l on the diagonal, w˜k,l is the adaptive weight for dk,l , and θ(l) = (d2,l , . . . , dK,l )T can be regarded as a group of parameters corresponding to the lth predictor.
6
In this paper, we consider two choices of ν: ν = 1 and ν = ∞, corresponding to the Fused Adaptive Lasso (FAL) and Fused Adaptive Sup-norm (FAS) penalization approaches, respectively. Let d˜k,l be the initial estimator obtained from the conventional quantile regression. For fused adaptive lasso, we set the adaptive weights w˜k,l = |d˜k,l |−1 . For fused adaptive sup-norm, we let w˜k,l = 1/ max{|d˜k,l |, k = 2, . . . , K} be the group-wise weights. As a special case, when all the adaptive weights w˜k,l = 1, fused adaptive lasso and fused adaptive supnorm are reduced to Fused Lasso (FL) and Fused Sup-norm (FS), respectively. Notice that for fused adaptive lasso, the slope differences are penalized individually, leading to piecewise constant quantile slope coefficients. However, for fused adaptive sup-norm, the slope differences are penalized in a group manner, and consequently, either all elements in θ(l) will be shrunk to be 0, or none of them will be shrunk to 0. In this paper, we focus on the shrinkage of interquantile coefficient differences to achieve higher estimation efficiency. If one desires to remove any predictor from the model for variable selection purpose, additional penalization should be considered. For regression at multiple quantiles, the estimated quantiles may cross, that is, a lower conditional quantile may be estimated larger than an upper quantile. To avoid quantile crossing, we propose adding additional constraints by adopting the idea in Bondell, Reich and Wang (2010). Without loss of generality, we assume that the desired domain for noncrossing is the region [0, 1]p . We propose to minimize (4) subject to the following non-crossing constraint:
(dα )k −
p X
d− k,l ≥ 0, k = 2, . . . , K,
(5)
l=1
where (dα )k = αk − αk−1 and d− k,l is the negative part of dk,l = βk,l − βk−1,l . As discussed in Bondell, Reich and Wang (2010), the non-crossing constraint (5) ensures the estimation of the conditional quantiles Qτ (y|x) to be nondecreasing in τ for any x ∈ [0, 1]p . In addition, the asymptotic properties of the penalized estimators are not affected by adding the constraint (5). 7
2.3
Computation
For a given t, the minimization can be formulated as a linear programming problem with linear constraints, and thus can be solved by using any existing linear programming software. In our numerical studies, we use the R function “rq.fit.sfn” in the package quantreg. This function adopts the sparse Frisch-Newton interior-point algorithm so that the computational time is proportional to the number of nonzero entries in the design matrix. Note that minimizing (4) with non-crossing constraint (5) is equivalent to solving
min θ
K X n X
ρτk (yi −
zTik θ),
s.t.
k=1 i=1
p X
kDiag(w˜k,l )θ(l) kν ≤ t, and (dα )k −
p X
d− k,l ≥ 0,
(6)
l=1
l=1
where t > 0 is a tuning parameter that plays a similar role as λ. Consider the point where dk,l = 0, l = 1, . . . , p, k = 2, . . . , K, that is, all slopes are constant across quantile levels. Note this point satisfies both inequality constraints in (6). Hence a solution to (6) always exists. Adopting this constraint formulation gives us a natural range of the tuning parameter t ∈ P [0, tmax ], where tmax = pl=1 kDiag(w˜k,l )θ˜(l) kν with θ˜(l) being the conventional RQ estimator. The tuning parameter t can be selected by minimizing the Akaike Information Criterion (AIC) (Akaike, 1974):
AIC(t) =
K X k=1
" log
n X i=1
# n o 1 ρτk yi − zTik θˆk (t) + edf (t), n
(7)
where the first term measures the goodness of fit (see Bondell, Reich and Wang 2010 for a similar measure for joint quantile regression), θˆk (t) is the solution to (6) with the tuning parameter value t, and edf (t) is the effective degree of freedom associated with the tuning parameter t. We set edf as the number of nonzero d’s for fused adaptive lasso, and as the number of unique d’s for fused adaptive sup-norm (Zhao, Rocha and Yu, 2009).
8
3
Asymptotic Properties
Define Fi as the conditional cumulative distribution function of Y given X = xi . To establish the asymptotic properties of the proposed fused adaptive lasso and fused adaptive sup-norm estimators, we assume the following three regularity conditions. (A1) For k = 1, . . . , K, i = 1, . . . , n, the conditional density function of Y given X = xi , denoted as fi , is continuous, and has a bounded first derivative, and fi {Qτk (xi )} is uniformly bounded away from zero and infinity. (A2) max1≤i≤n kxi k = o(n1/2 ). (A3) For 1 ≤ k ≤ K, there exist some positive definite matrices Γk and Ωk such that P P limn→∞ n−1 ni=1 zik zTik = Γk and limn→∞ n−1 ni=1 fi {Qτk (xi )}zik zTik = Ωk . In this paper, we focus on regression at multiple quantiles. In situations where the quantile slopes of a particular predictor are not constant across quantile levels, the boundedness in that predictor direction is needed to ensure the validity of the linear quantile regression model (1). In particular, if the support were unbounded in that direction, then the slopes must be constant across quantiles, otherwise, they would cross, and the model itself would be invalid. We note that partially (or fully) bounding the support is a special case of (A2).
3.1
Fused Adaptive Lasso Estimator
For ease of illustration, we consider p = 1 in this subsection. Notations can then be simplified by letting θj be the j th element of θ and w ˜j = |θ˜K+j |−1 for j = 2, . . . , K. Denote θ0 = (θj,0 , j = 1, . . . , 2K) as the true value of θ. Let the index sets A1 = {1, . . . , K}, A2 = {j : θj,0 6= 0, j = K + 1, . . . , 2K}, and A = A1 ∪ A2 . We write θA = (θj : j ∈ A)T , and its truth as θA,0 = (θj,0 : j ∈ A)T . Before discussing the asymptotic property of the fused adaptive lasso estimator, we first examine the oracle estimator in this setting with p = 1. Without loss of generality, we 9
assume that the quantile slopes βk vary for the first s < K quantiles, but remain constant for the remaining (K − s) quantile levels. The properties of the oracle and fused adaptive lasso estimators for other more general cases follow the similar exposition, but with more complicated notations. Suppose the model structure is known, the oracle estimator θˆA ∈ RK+s can be obtained by θˆA = arg min θA
K X n X
T ρτk (yi − zik,A θA ),
k=1 i=1
where zik,A ∈ RK+s contains the first K + s elements of zik . Proposition 1. Under conditions (A1)-(A3), we have d n1/2 (θˆA − θA,0 ) → N (0, ΣA ), as n → ∞,
where ΣA =
P
K k=1
Ωk,A
−1 nP
K k=1 τk (1
− τk )Γk,A
o P
K k=1
Ωk,A
−1
, Ωk,A and Γk,A are the
top-left (K + s) × (K + s) submatrices of Ωk and Γk , respectively. However, in practice, the true model structure is usually unknown beforehand. Hence, we take into account the full parameter vector θ and estimate it as θˆF AL = arg minθ Q(θ), where Q(θ) is defined in (4) with ν = 1. We show that θˆF AL has the following oracle property. Theorem 1. Suppose that conditions (A1)-(A3) hold. If n1/2 λn → 0 and nλn → ∞ as n → ∞, we have ˆ 1. sparsity: Pr {j : θj,F AL 6= 0, j = K + 1, . . . , 2K} = A2 → 1; d 2. asymptotic normality: n1/2 (θˆA,F AL − θA,0 ) → N (0, ΣA ), where ΣA is the covariance matrix
of the oracle estimator given in Proposition 1.
3.2
Fused Adaptive Sup-norm Estimator
To illustrate the asymptotic property of the fused adaptive sup-norm method, we consider the general case with p predictors. Let θ(0) = (α1 , . . . , αK )T ∈ RK be the vector of intercepts, θ(−1) = d1 = (β1,1 , . . . , β1,p )T ∈ Rp be the slope coefficient at τ1 , and θ(l) = (d2,l , . . . , dK,l )T ∈ 10
RK−1 be the interquantile slope differences associated with the lth predictor. For notational T T T T convenience, we reorder θ and define the new parameter vector θ = (θ(−1) , θ(0) , . . . , θ(p) ) .
The vector zik here is an updated vector with the order of elements corresponding to the new parameter vector θ. Define the index sets B1 = {−1, 0}, B2 = {l : kθ(l) k 6= 0, l = 1, . . . , p}, T and B = B1 ∪B2 . Hence θB = (θ(l) : l ∈ B)T is the non-null subset of θ and the true parameter T vector θB,0 = (θ(l),0 : l ∈ B)T . Without loss of generality, we assume that the quantile slopes
vary across quantiles for the first g ≥ 0 predictors and remain constant for the others. That is, kθ(l) k = 0 for l = g + 1, . . . , p. Proposition 2. Let θˆB be the oracle estimator of θB,0 obtained by knowing the true structure. Assuming conditions (A1)-(A3) hold, we have d n1/2 (θˆB − θB,0 ) → N (0, ΣB ), as n → ∞,
where ΣB =
P K
k=1 Ωk,B
−1 nP
K k=1 τk (1
− τk )Γk,B
o P
K k=1
Ωk,B
−1
, Ωk,B and Γk,B are the
top-left m × m submatrices of Ωk and Γk , respectively, where m = K + p + g(k − 1). Theorem 2 shows that when the true model structure is unknown, the fused adaptive sup-norm penalized estimator of θ, denoted as θˆF AS , has the following oracle property. Theorem 2. Suppose that conditions (A1)-(A3) hold. If n1/2 λn → 0 and nλn → ∞ as n → ∞, we have 1. sparsity: P r {l : kθˆ(l),F AS k = 6 0, l = 1, . . . , p} = B2 → 1; 2. asymptotic normality: n1/2 (θˆB,F AS − θB ) → N (0, ΣB ) in distribution, where ΣB is the covariance matrix of the oracle estimator given in Proposition 2.
4
Simulation Study
We consider three different examples to assess the finite sample performance of our proposed methods. In each example, the simulation is repeated 500 times with 9 quantile levels τ = 11
{0.1, 0.2, . . . , 0.9} being considered. We compare the following approaches, the fused adaptive lasso (FAL) method, the fused lasso method without adaptive weights (FL), the fused adaptive sup-norm (FAS) method, the fused sup-norm method without adaptive weights (FS), and the conventional quantile regression method (RQ). To evaluate various approaches, we examine three performance measurements. The first one is the mean of integrated squared error (MISE), defined as the average of ISE over 500 simulations, where
n
1X αk + xTi βˆk )}2 . ISE = {(αk + xTi βk ) − (ˆ n i=1 αk + xTi βˆk ) are the true and estimated τkth conditional quantile of Here, (αk + xTi βk ) and (ˆ Y given xi . The MISE aims to assess the estimation efficiency. The second measurement is the overall oracle proportion (ORACLE), defined as the proportion of times where the true model is selected correctly, that is, all nonzero interquantile slope differences are estimated as nonzero and all zero ones are shrunk exactly to zero. The oracle proportion measures the structure identification ability. The third measurement is True Proportion (TP), defined as the proportion of times where the individual difference of quantile slopes at two adjacent quantile levels is estimated correctly as either zero or non-zero. Example 1. This example corresponds to a model with a univariate predictor. The data is generated from
yi = α + βxi + (1 + γxi )ei , i = 1, . . . , 200, i.i.d
(8)
i.i.d
where xi ∼ U (0, 1), ei ∼ N (0, 1), α = 1, β = 3, and γ ≥ 0 controls the degree of heteroscedasticity. Under model (8), the τ th conditional quantile of Y given x is Qτ (x) = α(τ ) + β(τ )x, where α(τ ) = α + Φ−1 (τ ), β(τ ) = β + γΦ−1 (τ ) and Φ−1 (τ ) is the τ th quantile of N (0, 1). If γ = 0, (8) is a homoscedastic model with the constant quantile slope β(τ ) = β. However, if γ 6= 0, (8) becomes a heteroscedastic model with the quantile slope β(τ ) varying in τ . For this example with p = 1, fused adaptive sup-norm is equivalent to fused sup-norm since the
12
penalty only involves one group-wise weight, which can be incorporated into the penalization parameter, λ. Table 1: The MISE and ORACLE proportions of different methods in Example 1. The values in the parentheses are the standard errors of 103 ∗MISE. 103 *MISE ORACLE τ 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 γ=2 FL 120.96 79.50 66.26 61.34 62.79 61.57 68.28 83.26 136.00 0.31 (6.46) (4.03) (3.37) (3.03) (3.13) (3.46) (3.65) (4.27) (6.18) FAL 121.36 84.06 70.76 63.78 66.75 65.93 74.43 88.69 141.40 0.018 (6.53) (4.30) (3.62) (3.15) (3.24) (3.82) (3.95) (4.53) (6.45) FS 118.50 75.53 61.15 56.02 55.27 56.04 61.63 77.05 131.88 1 (6.26) (3.77) (3.07) (2.79) (2.77) (3.04) (3.33) (3.85) (5.97) RQ 117.51 81.03 67.11 62.17 63.93 62.47 69.10 84.67 129.29 (6.58) (4.07) (3.39) (3.06) (3.14) (3.50) (3.73) (4.45) (6.22) γ=0 FL 22.77 17.14 14.70 13.73 13.99 13.99 14.88 17.59 24.51 0.418 (1.15) (0.79) (0.67) (0.62) (0.64) (0.70) (0.69) (0.80) (1.08) – FAL 24.24 17.50 15.00 13.94 14.16 14.14 15.07 17.90 25.74 0.298 (1.22) (0.79) (0.67) (0.63) (0.65) (0.71) (0.70) (0.84) (1.13) – FS 22.57 16.86 14.92 14.03 13.81 13.95 14.98 17.44 23.74 0.364 (1.14) (0.76) (0.69) (0.63) (0.64) (0.70) (0.68) (0.79) (1.03) – RQ 28.45 20.10 16.80 15.93 15.78 15.62 16.73 20.66 30.81 – (1.36) (0.94) (0.75) (0.71) (0.71) (0.76) (0.76) (0.93) (1.31) – MISE: Mean of Integrated Squared Errors; ORACLE: the proportion of times where the model is selected correctly among 500 simulations; FL: Fused Lasso; FAL: Fused Adaptive Lasso; FS: Fused Sup-norm; FAS: Fused Adaptive Sup-norm; RQ: Regular Quantile Regression.
Table 1 and Figure 1 contain the simulation results for example 1 with both γ = 2 and γ = 0. In the case with γ = 2, where the slope coefficients β(τ ) vary across quantile levels, Figure 1 shows that the estimation efficiency of our proposed estimators is not significantly worse than RQ estimator, which is the oracle estimator under the model with no any sparsity. Table 1 further shows that the group-wise shrinkage method fused sup-norm performs the best with the smallest MISE and the highest ORACLE. When γ = 0, the slope coefficients remain invariant across quantile levels. Our proposed methods, by shrinking quantile slope coefficients to be constant, return significantly smaller mean squared error (MSE) and MISE 13
over τ compared to RQ (see Figure 1). Asymptotically, fused adaptive lasso has the oracle property with adaptive weights, while fused lasso does not have such nice property except for special cases where the quantile slope √ coefficients are all constant under the condition nλn → ∞, or there are no common quantile √ slope coeffcients at all with nλn → 0. In example 1, the slope coefficients either remain invariant across quantile levels, or vary at all quantile levels. The ideal adaptive weighting scheme is assigning equal weights to all interquantile differences. The fused lasso method does not assign any weight and thus is equivalent to assigning equal component-wise weights. However, fused adaptive lasso adopts the weights calculated from the RQ estimates which contains variability, particularly in the tails. Consequently, fused lasso performs slightly better than fused adaptive lasso for both γ = 0 and 2. Example 2. To further demonstrate the role of the adaptive component-wise weights, we consider a more complex univariate example. Let xi ∼ U (0, 1) for i = 1, . . . , 500. Assume Qτ (xi ) = α(τ ) + β(τ )xi for any 0 < τ < 1, where α(τ ) = α + Φ−1 (τ ) and β − γΦ−1 (0.49) + γΦ−1 (τ ) if 0 < τ < 0.49 β(τ ) = β if 0.49 ≤ τ < 1, with α = 0, β = 3, and γ = 15. To generate data, we first generate a quantile level ui ∼ U (0, 1) and let yi = α(ui ) + β(ui )xi . Therefore, for the quantiles τ = {0.1, 0.2, . . . , 0.9}, β(τ ) varies for τ = 0.1, . . . , 0.4, but remains as a constant for τ = 0.5, . . . , 0.9. In this setting, β(τ ) has different patterns across τ . Adaptive weights are expected to lead to more efficient shrinkage by controlling the shrinkage speeds of different coefficient differences dk = β(τk ) − β(τk−1 ), k = 2, . . . , 9. By assigning larger weights to the upper quantiles at which β(τ ) is a constant, fused adaptive lasso achieves much higher ORACLE and TP at the upper five quantiles than fused lasso (see Table 2). This suggests that if the interquantile slope differences are well distinguished, employing adaptive weights can effectively improve the structure identification accuracy.
14
Table 2: Percentage of correctly identifying dk = β(τk ) − β(τk−1 ) over 500 simulations in Examples 1-2. Example 1, γ = 2, TP d2 6= 0 d3 6= 0 d4 6= 0 d5 6= 0 d6 6= 0 d7 6= 0 d8 6= 0 d9 6= 0 ORACLE FL 0.69 0.89 0.93 0.91 0.94 0.92 0.89 0.67 0.31 FAL 0.65 0.66 0.67 0.65 0.65 0.67 0.67 0.62 0.02 Example 1, γ = 0, TP d2 = 0 d3 = 0 d4 = 0 d5 = 0 d6 = 0 d7 = 0 d8 = 0 d9 = 0 ORACLE FL 0.91 0.82 0.81 0.78 0.79 0.77 0.82 0.88 0.42 FAL 0.84 0.81 0.84 0.85 0.87 0.81 0.83 0.81 0.30 Example 2, TP d2 6= 0 d3 6= 0 d4 6= 0 d5 6= 0 d6 = 0 d7 = 0 d8 = 0 d9 = 0 ORACLE FL 1.00 1.00 1.00 0.99 0.50 0.66 0.67 0.81 0.25 FAL 0.96 0.97 0.97 0.84 0.78 0.98 0.98 0.96 0.52 TP: percentage of correctly identifying each dk = β(τk ) − β(τk−1 ), k = 2, . . . , 9 over 500 simulations. ORACLE: overall percentage of correctly shrinking all zero dk to zero, and nonzero dk as nonzero.
Example 3. In this example, we consider a bivariate case with p = 2. The data are generated from
yi = α + β1 xi,1 + β2 xi,2 + (1 + γxi,2 )ei , i = 1, . . . , 200, i.i.d
i.i.d
i.i.d
where xi,1 ∼ U (0, 1), xi,2 ∼ U (0, 1), ei ∼ N (0, 1), α = 1, β1 = β2 = 3. Therefore, the τ th conditional quantile of Y given x1 and x2 is
Qτ (x1 , x2 ) = α(τ ) + β1 (τ )x1 + β2 (τ )x2 , where α(τ ) = α + Φ−1 (τ ), β1 (τ ) = β1 and β2 (τ ) = β2 + γΦ−1 (τ ). Unlike β1 (τ ), which stays invariant for all τ , β2 (τ ) is constant when γ = 0 but it varies across τ when γ 6= 0. Figure 2 reveals that in the bivariate case with γ = 2, where the true β1 (τ ) stay invariant across τ , our proposed approaches yield significantly smaller MSE of the estimated βˆ1 (τ ) over τ than RQ method. On the other hand, the true β2 (τ ) vary across quantiles, our proposed
15
fused lasso and fused adaptive lasso approaches result in comparable estimation efficiency with RQ, while fused sup-norm and fused adaptive sup-norm yield smaller MSE of the estimated βˆ2 (τ ), especially in the middle quantiles. This is largely due to the nature of the group-wise shrinkage methods: fused sup-norm and fused adaptive sup-norm either shrink all interquantile slope differences dk,l = βk,l − βk−1,l , l = 1, 2, k = 2, . . . , 9 to be exactly zero, or none of them to be zero. It is very likely that fused sup-norm and fused adaptive sup-norm induce some smoothness on the quantile slope coefficients for the second predictor, but do not set any of them to be exactly equal. It is also shown in Table 3 that fused adaptive sup-norm has higher ORACLE than fused lasso and fused adaptive lasso. By imposing two distinguished group-wise weights on two groups of interquantile slope differences, fused adaptive sup-norm leads to better model selection results than fused sup-norm. We compare fused lasso and fused adaptive lasso in Table 4. When the true d’s are indeed zero, fused adaptive lasso has higher TP than fused lasso (see d2,1 , . . . , d9,1 with γ = 2 in Table 4), but when the true d’s are nonzero, fused adaptive lasso has lower TP’s than fused lasso (see d2,2 . . . d9,2 with γ = 2 in Table 4). This implies that fused adaptive lasso may suffer from over-shrinkage problems in the varying regions. Table 3 shows the simulation results when γ = 0. As we expect, all the proposed shrinkage methods yield significantly smaller MSE and MISE than RQ when the slope coefficients are constant for each predictor (see Figure 2). However, similar to Example 1, the quantile slope coefficients within each group are too close to be separated, which prevents the component-wise adaptive weights from playing effective roles. Thus, fused adaptive lasso has less estimation efficiency (higher MISE) and worse selection accuracy (lower ORACLE) than fused lasso, especially in the tails where the initial quantile estimates are more variable.
16
Table 3: The MISE and ORACLE proportion of different methods in Example 3 with γ = 2 and γ = 0, respectively.
τ
0.1
FL 178.82 (7.68) FAL 175.74 (7.65) FS 156.21 (6.68) FAS 154.74 (6.71) RQ 181.88 (7.34) FL
31.00 (1.19) FAL 35.43 (1.37) FS 31.75 (1.21) FAS 31.81 (1.22) RQ 47.26 (1.74)
0.2
0.3
114.03 (4.64) 119.45 (5.10) 107.46 (4.41) 106.85 (4.43) 122.62 (4.70)
100.65 (4.22) 104.15 (4.52) 91.09 (3.88) 91.10 (3.93) 106.27 (4.39)
25.03 (0.94) 26.16 (0.96) 25.15 (0.93) 25.13 (0.93) 31.71 (1.13)
22.58 (0.83) 22.81 (0.85) 22.51 (0.82) 22.56 (0.82) 26.36 (0.94)
103 *MISE 0.4 0.5 0.6 γ=2 89.29 94.28 90.22 (4.00) (4.30) (3.89) 92.23 99.77 95.69 (4.20) (4.61) (4.13) 80.49 79.45 81.17 (3.54) (3.45) (3.63) 81.15 82.30 82.73 (3.52) (3.59) (3.67) 94.27 98.19 97.62 (4.04) (4.37) (4.17) γ=0 20.53 20.84 21.04 (0.79) (0.80) (0.78) 21.15 21.39 21.68 (0.81) (0.81) (0.79) 20.43 20.84 20.78 (0.79) (0.79) (0.79) 20.57 20.74 20.65 (0.81) (0.80) (0.78) 24.25 25.21 25.11 (0.91) (0.95) (0.93)
0.7
0.8
0.9
ORACLE –
96.94 (3.93) 101.74 (4.31) 88.78 (3.57) 88.97 (3.56) 103.84 (4.08)
115.04 (4.68) 122.40 (5.05) 106.70 (4.26) 106.65 (4.30) 122.58 (4.87)
176.77 (7.43) 175.00 (7.63) 157.90 (6.19) 154.94 (6.21) 173.38 (7.09)
0.006 – 0 – 0.264 – 0.402 – – –
22.55 (0.82) 23.38 (0.85) 22.25 (0.81) 22.12 (0.80) 27.30 (0.96)
25.68 (0.98) 26.51 (1.01) 25.30 (0.94) 25.35 (0.94) 31.84 (1.16)
30.99 (1.22) 34.67 (1.35) 30.67 (1.19) 31.09 (1.21) 45.02 (1.67)
0.276 – 0.154 (0.016) 0.216 – 0.228 – – –
Table 4: True Proportion of True Positive (TP) for each interquantile difference dk,l in Example 3 with γ = 2 and γ = 0, respectively, where dk,l = βk,l − βk−1,l , l = 1, 2, k = 2, . . . , 9. For γ = 2, the true coefficients dk,1 = 0, but dk,2 6= 0 for all k. For γ = 0, dk,l = 0 for all k and l. d2,1
d3,1
FL 0.79 FAL 0.79
0.69 0.80
FL 0.93 0.83 FAL 0.848 0.808
d4,1 d5,1 d6,1 d7,1 d8,1 γ 0.62 0.58 0.59 0.62 0.79 0.81 0.77 0.82 0.79 0.80 γ 0.80 0.75 0.76 0.78 0.83 0.85 0.82 0.84 0.82 0.82
d9,1 =2 0.61 0.79 =0 0.91 0.83
17
d2,2 d3,2 d4,2 d5,2 d6,2 d7,2 d8,2 d9,2 0.85 0.92 0.92 0.92 0.92 0.89 0.87 0.58 0.63 0.67 0.67 0.66 0.69 0.65 0.69 0.58 0.92 0.85 0.83 0.80 0.78 0.78 0.85 0.91 0.81 0.82 0.85 0.82 0.84 0.83 0.83 0.80
MSE of (), =2
MISE,=2
0.14
0.9
MSE of (), =2
* *
0.8
0.14
*
*
0.10
MISE
0.5
0.6
MSE
0.12 0.10
MSE
*
0.7
0.12
*
*
*
0.08
*
*
*
*
* *
* *
*
*
*
*
0.3
0.06
*
0.4
*
*
*
0.2
*
*
0.06
0.4
0.08
*
0.6
0.8
0.2
0.4
0.6
0.8
0.2
0.4
MSE of (), =0
0.8
0.030
MISE,=0
0.025
0.050
0.16
0.055
0.18
0.060
MSE of (), =0
0.6
MISE
0.14
*
0.10
*
0.035
*
0.020
0.040
* *
0.12
MSE
0.045
MSE
*
*
* * * *
*
0.2
*
*
0.4
*
0.6
*
0.8
0.015
*
0.08
0.030
*
*
0.2
*
0.4
0.6
*
*
* *
*
0.8
0.2
*
0.4
*
0.6
0.8
Figure 1: Mean squared error (MSE) of coefficient estimates and the mean integrated squared error (MISE) over τ for example 1. When γ = 2, slope coefficients vary across τ . When γ = 0, slope coefficients are constant over τ . The solid line is for RQ, the line with stars on it is for fused lasso, the dashed line is for fused adaptive lasso, and the dotted line is for fused sup-norm approach. The shaded area is the 90% confidence band calculated from RQ method.
18
MSE of 2(), =2
MISE,=2
0.18
MSE of 1(), =2
0.7
MSE of (), =2
*
*
0.9
*
*
*
0.16
0.8
0.14
IMSE 0.12
*
0.6
MSE
0.5
MSE
0.20
MSE
0.7
0.25
0.6
*
0.4
*
*
*
0.5
* *
*
0.15
0.10
*
*
*
0.3
* *
*
*
0.4
* *
*
*
*
*
*
*
0.3
0.08
*
0.6
0.2
0.6
0.2
0.6
0.2
MSE of 2(), =0
0.18
0.045
0.20
MISE,=0
0.16
0.040
0.18
0.10
0.035
IMSE
0.14
0.030
0.12
0.12
0.14
MSE
MSE
0.16
0.09 0.08 0.07
MSE
0.6
MSE of 1(), =0
0.11
MSE of (), =0
*
*
*
*
0.2
*
*
*
*
*
* *
*
0.2
0.6
0.08
* *
*
0.2
*
*
0.6
*
0.2
*
*
*
*
*
*
0.6
*
0.020
*
0.08
* *
*
* *
*
* *
0.025
*
*
0.05
0.10
* *
*
0.10
0.06
*
*
0.2
*
*
0.6
Figure 2: Mean squared error (MSE) of coefficient estimates and the mean integrated squared error (MISE) over τ for example 3. When γ = 2, slope coefficients are constant for the first predictor, but vary across quantiles for the second one. When γ = 0, slope coefficients are constant for both predictors. The solid line is for RQ, the line with stars on it is for fused lasso, the dashed line is for fused adaptive lasso, and the dotted line is for fused sup-norm approach. The shaded area is the 90% confidence band calculated from RQ method.
19
5
Analysis of Economic Growth Data
We apply the proposed methods to analyze the international economic growth data, which were originally from Barro and Lee (1994) and later studied by Koenker and Machado (1999). The data set contains 161 observations. The first 71 observations correspond to 71 countries and their averaged annual growth percentages of per Capita Gross Domestic Product (GDP growth) from period 1965-1975 are recorded. The remaining 90 observations are for period 1975-1985. Some countries may appear in both periods. There are 13 covariates in total: the initial per capital GDP (igdp), male middle school education (mse), female middle school education (f se), female higher education (f he), male higher education (mhe), life expectancy (lexp), human capital (hcap), the ratio of eduction and GDP growth (edu), the ratio of investment and GDP growth (ivst), the ratio of public consumption and GDP growth (pcon), black market premium (blakp), political instability (pol) and growth rate terms trade (ttrad). All covariates are standardized to lie in the interval [0, 1] before analysis, and we focus on τ = {0.1, 0.2, . . . , 0.9}. Our purpose is to investigate the effects of covariates on multiple conditional quantiles of the GDP growth. In Figure 4, we list 3 predictors as examples. Koenker and Machado (1999) studied the effects of covariates on certain conditional quantiles of the GDP growth by using the conventional quantile regression method (RQ). In this study, we consider simultaneous regression of multiple quantiles by employing the proposed adaptively weighted penalization methods fused adaptive lasso and fused adaptive sup-norm, and select the penalization parameter by minimizing the AIC value as described in Section 2.3. Figure 4 shows the quantile slope estimates from fused adaptive lasso, fused adaptive supnorm and RQ, respectively, across the 9 quantile levels for 4 covariates out of the 13. In each plot, the shaded area is the 90% pointwise confidence band constructed from the inversion of rank score test described in Koenker (2005). The points connected by solid lines correspond to the estimated quantile coefficients from RQ, while the points connected by dashed lines are fused adaptive lasso estimates, and the stars with dashed lines are fused adaptive sup-norm estimates. The fused adaptive sup-norm shrinks the slope coefficients of mhe, ivst
20
and blakp to be constant over τ , but vary with τ for pol, while fused adaptive lasso tends to shrink neighboring quantile coefficients to be equal, resulting in piecewise constant quantile coefficients. In contrast, the solid lines (RQ) are more variable compared to the dashed lines (fused adaptive lasso and fused adaptive sup-norm), since RQ can not make any shrinkage. To further verify the shrinkage results, we conduct hypothesis tests to check the constancy of slope coefficients by using the R function “anova.rqlist” in the quantreg package. This function is based on the Wald-type test described in Koenker (2005) and can be used to test if the slope coefficients are identical at some pre-specified quantiles. Take the covariate ivst for an example, the Wald-type test for equality of quantile coefficients at τ = 0.1, . . . , 0.9 returns a p-value of 0.9324, implying that the effect of ivst does not vary significantly across the nine quantile levels. This agrees with the results from fused adaptive lasso and fused adaptive sup-norm, where the nine quantile coefficients are shrunk to be a constant. On the other hand, for the covariate pol, the equality test on the quantile coefficients at τ = 0.1, . . . , 0.9 returns a p-value of 0.000996, showing that the effect of pol varies across the 9 quantile levels. The equality test at τ = 0.5, . . . , 0.9 also shows the significant difference. However, for the equality test at τ = 0.1, . . . , 0.5, the null hypothesis failed to be rejected. This agrees with the results from fused adaptive lasso, which shrinks the first 5 quantile coefficients to be a constant, but keeps the upper quantile coefficients vary. To evaluate the prediction accuracy of the different methods, we performed a cross validation study. The data are randomly split into a testing set with 50 observations and a training set with the remaining 111 observations. For each method, we estimate the quantile coefficients ˆ j ), j = 1, . . . , 9 and predict the τ th conditional quanbased on the training set, denoted as β(τ j tile of the GDP growth on the testing set. Prediction error (PE), used to assess the prediction P P T ˆ accuracy, is defined as PE = 9k=1 50 j=1 ρτk {yj − xj β(τk )}, where {(yj , xj ), j = 1, . . . , 50} are in the test set. We repeat the cross validation 200 times and take the average of PE. For fused adaptive lasso, the mean PE is 252.96 (s.e=1.66), while it is 252.73 (s.e.=1.97) for fused adaptive sup-norm and 258.94 (s.e.=1.65) for RQ. It is clear that both proposed penalization approaches yield higher prediction accuracy than RQ. 21
ivst
pol
0.0 -0.5
estimated coefficients
0.8
-1.0
0.2
-1.0
0.4
0.6
estimated coefficients
0.0 -0.5
estimated coefficients
1.0
0.5
1.2
edu
0.2
0.4
0.6
0.8
0.2
0.4
0.6
0.8
0.2
0.4
0.6
0.8
Figure 3: The effects of edu (the ratio of education and GDP growth), ivst (the ratio of investment and GDP growth) and pol (political instability) on different quantiles of the GDP growth.
22
ivst 0.4
0.6
0.2 0.4 0.6 0.8 1.0 1.2
1.0 0.5 0.0 -0.5
mhe
0.2
0.8
0.2
0.4
0.8
0.6
0.8
-0.5 -1.0
pol
0.0
-1.2 -1.0 -0.8 -0.6 -0.4
blakp
0.6
0.2
0.4
0.6
0.8
0.2
0.4
Figure 4: Estimated quantile coefficients for the four covariates mhe (male higher education), ivst (the ratio of investment and GDP growth), blakp (black market premium) and pol (political instability) from RQ (solid line), FAL (dashed line with dots), FAS (dashed line with stars) methods. Shaded areas are the 90% confidence bands from regular RQ method.
23
Supplemental Materials R Code: R code to perform the method in this article along with a help file is supplied. (codeQR.zip) Appendix: The supplemental files include the Appendix which gives all proofs. (appendixQR.pdf)
6
Acknowledgement
The authors are grateful to the editor, an associate editor, and two anonymous referees for their valuable comments. Wang’s research was supported in part by NSF grant DMS-1007420. Bondell’s research was supported in part by NSF grant DMS-1005612 and NIH grant P01CA-142538.
References Akaike, H. (1974), New Look at the Statistical Model Identification. IEEE Transactions on Automatic Control 19, 716-723. Barro, R. and Lee, J. (1994), Data Set for a Panel of 138 Countries. Cambridge, Mass.: National Bureau of Economic Research. Bondell, H., Reich, B. and Wang, H. (2010), Non-crossing Quantile Regression Curve Estimation. Biometrika 97, 825-838. Fan, J. and Li, R. (2001), Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of the American Statistical Association 96, 1348-1360. Geyer, C. (1994), On the Asymptotics of Constrained M-estimation. Annals of Statistics 22, 1993-2010. He, X. (1997), Quantile Curves Without Crossing. The American Statistician 51, 186-192. 24
Kato, K. (2010), Solving l1 regularization problems with piecewise linear losses. Journal of Computational and Graphical Statistics 19, 1024-1040. Koenker, R. and Bassett, G. (1978), Regression Quantiles. Econometrica 4, 33-50. Koenker, R. (2004), Quantile Regression for Longitudinal Data. Journal of Multivariate Analysis 91, 74-89. Koenker, R. (2005), Quantile Regression. Cambridge: Cambridge University Press. Koenker, R. and Machado, J. (1999), Goodness of Fit and Related Inference Processes for Quantile Regression. Journal of the American Statistical Association 94, 1296-1310. Li, Y. and Zhu, J. (2007), Analysis of Array CGH Data for Cancer Studies Using Fused Quantile Regression. Bioinformatics 23, 2470-2476. Li, Y. and Zhu, J. (2008), L1-norm Quantile Regression. Journal of Computational and Graphical Statistics 17, 163-185. Osborne, M.R. and Turlach, B.A (2011), A homotopy algorithm for the quantile regression lasso and related piecewise linear problems. Journal of Computational and Graphical Statistics 20, 972-987. Tackeuchi, I., Le, Q., Sears, T. and Smola, A. (2006), Nonparametric Quantile Estimation. Jounal of Machine Learning Research 7, 1231-1264. Tibshirani, R. (1996), Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, series B 58, 267-288. Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., and Knight, K. (2005), Sparsity and Smoothness via the Fused Lasso. Journal of the Royal Statistical Society, series B 67, 91-108. Wang, H. and Hu, J. (2010), Identification of differential aberrations in multiple-sample array CGH studies. Biometrics 67, 353-362. 25
Wu, Y. and Liu, Y. (2009a), Variable selection in quantile regression. Statistica Sinica 19, 801-817. Wu, Y. and Liu, Y. (2009b), Stepwise Multiple Quantile Regression Estimation Using Noncrossing Constraints. Statistica and Its Interface 2, 299-310. Yuan, M. and Lin, Y. (2006), Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, series B 68, 49-67. Zhao, P. and Rocha, G. and Yu, B. (2009), The composite absolute penalties family for grouped and hierarchical variable selection. Annals of Statistics 37, 3468-3497. Zou, H. (2006), The adaptive lasso and its oracle properties. Journal of the American Statistical Association 101, 1418-1429. Zou, H. and Yuan, M. (2008a), Composite quantile regression and the oracle model selection theory. The Annals of Statistics 36, 1108-1126. Zou, H. and Yuan, M. (2008b), Regularized Simultaneous Model Selection in Multiple Quantiles Regression. Computational Statistics and Data Analysis 52, 5296-5304.
26