THE SPECTRAL CONDITION NUMBER PLOT FOR REGULARIZATION PARAMETER DETERMINATION
arXiv:1608.04123v1 [stat.CO] 14 Aug 2016
CAREL F.W. PEETERS*, MARK A. VAN DE WIEL, AND WESSEL N. VAN WIERINGEN
Abstract. Many modern statistical applications ask for the estimation of a covariance (or precision) matrix in settings where the number of variables is larger than the number of observations. There exists a broad class of ridge-type estimators that employs regularization to cope with the subsequent singularity of the sample covariance matrix. These estimators depend on a penalty parameter and choosing its value can be hard, in terms of being computationally unfeasible or tenable only for a restricted set of ridge-type estimators. Here we introduce a simple graphical tool, the spectral condition number plot, for informed heuristic penalty parameter selection. The proposed tool is computationally friendly and can be employed for the full class of ridge-type covariance (precision) estimators. Key words: Eigenvalues; High-dimensional covariance (precision) estimation; `2 -penalization; Matrix condition number
1. Introduction The covariance matrix Σ of a p-dimensional random vector YiT ≡ [Yi1 , . . . , Yip ] ∈ R is of central importance in many statistical analysis procedures. Estimation of Σ or its inverse Σ−1 ≡ Ω (generally known as the precision matrix) are central to, for example, multivariate regression, time-series analysis, canonical correlation analysis, discriminant analysis, and Gaussian graphical modeling. Let yT i be a realization of YiT . It is well-known [50] that the sample covariance matrix ˆ = n−1 Pn (yi − y ¯ )(yi − y ¯ )T is a poor estimate of Σ when p approaches the Σ i=1 ˆ will tend to become illsample size n or when p > n. When p approaches n, Σ ˆ is singular leaving Ω ˆ undefined. However, many conditioned. When p > n, Σ contemporary applications in fields such as molecular biology, neuroimaging, and finance, encounter situations of interest where p > n. ˆ towards The estimation of Σ can be improved by shrinking the eigenvalues of Σ ˆ a central value [e.g., 50] or by convexly combining Σ with some well-conditioned target matrix [e.g., 45]. These solutions define a class of estimators that can be caught under the umbrella term ‘ridge-type covariance estimation’. Such estimators depend on a penalty parameter determining the rate of shrinkage and choosing its value is of prime importance. Available procedures for choosing the penalty have some (situation dependent) disadvantages in the sense that (a) they can be computationally expensive, (b) they can be restricted to special cases within the class of ridge-type estimators, or (c) they are not guaranteed to result in a meaningful penalty-value. There is thus some demand for a generic and computationally friendly procedure on the basis of which one can (i) heuristically select an acceptable p
*Principal corresponding author. 1
2
C.F.W. PEETERS, M.A. VAN DE WIEL, AND W.N. VAN WIERINGEN
(minimal) value for the penalty parameter and (ii) assess if more formal procedures have indeed proposed an acceptable penalty-value. Here such a tool is provided on the basis of the matrix condition number. The remainder is organized as follows. Section 2 reviews the class of ridge-type estimators of the covariance matrix. In addition, penalty parameter selection is reviewed and an expos´e of the matrix condition number is given. The spectral condition number is central to the introduction of the spectral condition number plot in Section 3. This graphical display is posited as an exploratory tool, that may function as a fast and simple heuristic in determining the (minimum) value of the penalty parameter when employing ridge estimators of the covariance or precision matrix. Section 4 illustrates usage of the spectral condition number plot with data from the field of oncogenomics. Section 5 discourses on the software that implements the proposed graphical display. Sections 6 and 7 conclude with discussions. 2. Eigenstructure regularization and the condition number 2.1. Ridge-type shrinkage estimation. Regularization of the covariance matrix goes back to Stein [50, 51], who proposed shrinking the sample eigenvalues towards some central value. This work spurred a large body of literature [see, e.g., 23, 24, 64, 63]. Of late, two encompassing forms of what is referred to as ‘ridge-type covariance estimation’ have emerged. ˆ and some positive definite The first form considers convex combinations of Σ (p.d.) target matrix T [12, 34, 35, 36, 45, 18]: (1)
ˆ I (λI ) = (1 − λI )Σ ˆ + λI T, Σ
with penalty parameter λI ∈ (0, 1]. Such an estimator can be motivated from the Steinian bias-variance tradeoff as it seeks to balance the high-variance, low-bias ˆ with the lower-variance, higher-bias matrix T. The second form is of matrix Σ more recent vintage and considers the ad-hoc estimator [61, 65, 22]: (2)
ˆ II (λII ) = Σ ˆ + λII Ip , Σ
with λII ∈ (0, ∞). This second form is motivated, much like how ridge regression ˆ in order was introduced by Hoerl and Kennard [26], as an ad-hoc modification of Σ to deal with singularity in the high-dimensional setting. van Wieringen and Peeters [58] show that both (1) and (2) can be justified as penalized maximum likelihood estimators [cf. 61]. However, neither (1) nor (2) utilizes a proper `2 -penalty in that perspective. Starting from the proper ridge type `2 -penalty λ2a tr (Ω − T)T (Ω − T) , van Wieringen and Peeters [58] derive an alternative estimator: 1/2 1 ˆ ˆ a (λa ) = λa Ip + 1 (Σ ˆ − λa T)2 (3) Σ + (Σ − λa T), 4 2 with λa ∈ (0, ∞). van Wieringen and Peeters [58] show that, when considering a p.d. T, the estimator (3) is an alternative to (1) with superior behavior in terms of risk. When considering the non-p.d. choice T = 0, they show that (3) acts as an alternative to (2), again with superior behavior. Clearly, one may obtain ridge estimators of the precision matrix by considering the inverses of (1), (2), and (3). For comparisons of these estimators see [39, 9, 58].
THE SPECTRAL CONDITION NUMBER PLOT
3
For expositions of other penalized covariance and precision estimators we confine by referring to [44]. 2.2. Penalty parameter selection. The choice of penalty-value is crucial to the aforementioned estimators. Let λ denote a generic penalty. When choosing λ too small, an ill-conditioned estimate may ensue when p > n (see Section 2.3). When choosing λ too large, relevant data signal may be lost. Many options for choosing λ are available. The ridge estimators, in contrast to `1 -regularized estimators of the covariance or precision matrix [e.g., 19, 3], do not generally produce sparse estimates. This implies that model-selection-consistent methods (such as usage of the BIC), are not appropriate. Rather, for `2 -type estimators, it is more appropriate to seek loss efficiency. A generic strategy for determining an optimal value for λ that can be used for any ridge-type estimator of Section 2.1 is k-fold cross-validation (of the likelihood function). Asymptotically, such an approach can be explained in terms of minimizing Kullback-Leibler divergence. Unfortunately, this strategy is computationally prohibitive for large p and/or large n. Lian [38] and Vujaˇci´c et al. [60] propose approximate solutions to the leave-one-out cross-validation score. While these approximations imply gains in computational efficiency, they are not guaranteed to propose a reasonable optimal value for λ. Ledoit and Wolf [36] propose a strategy to determine analytically an optimal value for λ under a modified Frobenius loss for the estimator (1) under certain choices of T [cf. 18]. This optimal value requires information on the unknown population matrix Σ. The optimal penalty-value thus needs to be approximated with some estimate of Σ. Ledoit and Wolf [36] utilize an n-consistent estimator n ˆ Σ. In practice, this while Sch¨ afer and Strimmer [45] use the unbiased estimate n−1 may result in overshrinkage [9] or even negative penalty-values [45]. Given the concerns stated above, there is some demand for a generic and computationally friendly tool for usage in the following situations of interest: ˆ i. When one wants to speedily determine a (minimal) value for λ for which Σ(λ) is well-conditioned; ii. When one wants to determine speedily whether an optimal λ proposed by some ˆ other (formal) procedure indeed leads to a well-conditioned estimate Σ(λ); iii. When one wants to determine speedily a reasonable minimal value for λ for usage in a search-grid (for an optimal such value) by other, optimization-based, procedures. In Section 3 we propose such a tool based on the spectral condition number. 2.3. Spectral condition number. The estimators from Section 2.1 are p.d. when their penalty-values are strictly positive. However, they are not necessarily wellconditioned for any strictly positive penalty-value when p ' n, especially when the penalty takes a value in the lower range. We seek to quantify the condition of the estimators w.r.t. a given penalty-value. To this end we utilize a condition number [59, 57]. Consider a nonsingular matrix A ∈ Rp×p as well as the matrix norm k · k. The condition number w.r.t. matrix inversion can be defined as [25] (4)
k(A + δA)−1 − A−1 k , kA−1 k →0+ kδAk≤kAk
cond(A) := lim
sup
4
C.F.W. PEETERS, M.A. VAN DE WIEL, AND W.N. VAN WIERINGEN
indicating the sensitivity of inversion of A w.r.t. small perturbations δA. When the norm in (4) is induced by a vector norm the condition number is characterized as [25]: (5)
cond(A) = C(A) := kAkkA−1 k.
For singular matrices C(A) would equal ∞. Hence, a high condition number is indicative of near-singularity and quantifies an ill-conditioned matrix. Indeed, the inverse of the condition number gives the relative distance of A to the set of singular matrices S [11]: dist(A, S) = inf
S∈S
1 kA − Sk = . kAk C(A)
Another interpretation can be found in error propagation. A high condition number implies severe loss of accuracy or large propagation of error when performing matrix inversion under finite precision arithmetic. One can expect to loose at least blog10 C(A)c digits of accuracy in computing the inverse of A [e.g., Chapter 8 and Section 6.4 of, respectively, 7, 21]. In terms of error propagation, C(A) is also a reasonable sensitivity measure for linear systems Ax = b [25] . We can specify (5) with regard to a particular norm. We have special interest in the `2 -condition number C2 (A), usually called the spectral condition number for its relation to the spectral decomposition. When A is a symmetric p.d. matrix, it is well-known that [59, 21] C2 (A) = kAk2 kA−1 k2 =
d(A)1 , d(A)p
where d(A)1 ≥ . . . ≥ d(A)p are the eigenvalues of A. We can connect the machinery of ridge-type regularization to this spectral condition number. ˆ T be the spectral decomposition of Σ ˆ with D(Σ) ˆ denoting a diagLet VD(Σ)V ˆ onal matrix with the eigenvalues of Σ on the diagonal and where V denotes the matrix that contains the corresponding eigenvectors as columns. Note that the orthogonality of V implies VVT = VT V = Ip . This decomposition can then be used to show that, like the Stein estimator [50], estimate (2) is rotation equivariant: ˆ II (λII ) = VD(Σ)V ˆ T + λII VVT = V[D(Σ) ˆ + λII Ip ]VT . That is, the estimator Σ ˆ intact and thus solely performs shrinkage on the eigenleaves the eigenvectors of Σ values. When choosing T = ϕIp with ϕ ∈ [0, ∞), the estimator (3) also is of the rotation equivariant form, as we may then write: ( ) 1/2 1 1 a 2 ˆ ˆ ˆ + [D(Σ) − λa ϕIp ] VT . (6) Σ (λa ) = V λa Ip + [D(Σ) − λa ϕIp ] 4 2 An analogous decomposition can be given for the estimator (1) when choosing T = µIp , with µ ∈ (0, ∞). The effect of the shrinkage estimators can then, using the rotation equivariance property, also be explained in terms of restricting the spectral condition number. For example, using (6), we have: q ˆ 1 − λa ϕ]2 /4 + [d(Σ) ˆ 1 − λa ϕ]/2 ˆ 1 λa + [d(Σ) d(Σ) ≤ ∞. 1≤ q < ˆ p d(Σ) ˆ p − λa ϕ]2 /4 + [d(Σ) ˆ p − λa ϕ]/2 λa + [d(Σ)
THE SPECTRAL CONDITION NUMBER PLOT
5
Similar statements can be made regarding all rotation equivariant versions of the estimators discussed in Section 2.1. Similar statements can also be made when considering a target T for estimators (1) and (3) that is not of the form αIp whenever ˆ this target is well-conditioned (i.e., has a lower condition number than Σ). Example 1. For clarification consider the following toy example. Assume we have ˆ whose largest eigenvalue d(Σ) ˆ 1 = 3. Additiona sample covariance matrix Σ ˆ ˆ p = 0 ally assume that Σ is estimated in a situation where p > n so that d(Σ) ˆ 1 /d(Σ) ˆ p = 3/0 = ∞ (under the IEEE computing Standard for and, hence, d(Σ) Floating-Point Arithmetic; [27]). Say we are interested in regularization using the estimator (3) using a scalar target matrix with ϕ = 2. Even using a very small ˆ a (λa )]1 /d[Σ ˆ a (λa )]p = penalty of λa = 1 × 10−10 it is then quickly verified that d[Σ ˆ ˆ 300, 003 < d(Σ)1 /d(Σ)p . Under a large penalty such as λa = 10, 000 we find that ˆ a (λa )]p = 1.00015. Indeed, van Wieringen and Peeters [58] have ˆ a (λa )]1 /d[Σ d[Σ ˆ a (λa )]j → 1/ϕ as λa → ∞ for shown that, in this rotation equivariant setting, d[Σ a a ˆ ˆ all j. Hence, d[Σ (λa )]1 /d[Σ (λa )]p → 1 as the penalty value grows very large. Section 1 of the Supplementary Material visualizes this behavior (in the given setˆ 1 , 1/d(Σ) ˆ 1 < ϕ < 1, ϕ = 1, ting) for various scenarios of interest: ϕ < 1/d(Σ) ˆ 1 , and ϕ > d(Σ) ˆ 1. 1 < ϕ < d(Σ) ˆ Let Σ(λ) denote a generic ridge-type estimator of the covariance matrix under ˆ generic penalty λ. We thus quantify the conditioning of Σ(λ) for given λ (and possibly a given T) w.r.t. perturbations λ + δλ with ˆ d[Σ(λ)] 1 ˆ ˆ ˆ . (7) C2 [Σ(λ)] = kΣ(λ)k 2 kΩ(λ)k2 = ˆ d[Σ(λ)] p ˆ ˆ A useful property is that C2 [Σ(λ)] = C2 [Ω(λ)], i.e., knowing the condition of the covariance matrix implies knowing the condition of the precision matrix (so essential in contemporary topics such as graphical modeling). The condition number (7) can be used to construct a simple and computationally friendly visual tool for penalty parameter selection. 3. The spectral condition number plot 3.1. The basic plot. As one may appreciate from the exposition in Section 2.3, ˆ when Σ(λ) moves away from near-singularity, small increments in the penalty λ ˆ can cause dramatic changes in C2 [Σ(λ)]. One can expect that at some point along ˆ the domain of λ, its value will be large enough for C2 [Σ(λ)] to stabilize (in some relative sense). We will cast these expectations regarding the behavior in a (loose) definition: ˆ Heuristic Definition. Let Σ(λ) denote a generic ridge-type estimator of the covariance matrix under generic fixed penalty λ. In addition, let ∆λ indicate a real perturbation in λ as opposed to the theoretical perturbation δλ. We will term the ˆ estimate Σ(λ) well-conditioned when small increments ∆λ in λ translate to (relaˆ + ∆λ)] vis-` ˆ tively) small changes in C2 [Σ(λ a-vis C2 [Σ(λ)]. From experience, when considering ridge-type estimation of Σ or its inverse in p > n situations, the point of relative stabilization can be characterized by a leveling-off of the acceleration along the curve when plotting the condition number
C.F.W. PEETERS, M.A. VAN DE WIEL, AND W.N. VAN WIERINGEN
6000 4000 0
2000
spectral condition number
8000
6
−6
−4
−2
0
2
ln(penalty value)
Figure 1. First example of a spectral condition number plot. The ridge estimator used is given in (2). R code to generate the data and produce the plot can be found in Section 4 of the Supplementary Material. A reasonable minimal value for the penalty parameter can be found at the x-axis at approximately −3. The exponent of this number signifies a minimal value of the penalty for which ˆ the estimate Σ(λ) is well-conditioned according to the Heuristic ˆ II (exp(−3))] ≈ 184.95. Definition. The condition number C2 [Σ ˆ C2 [Σ(λ)] against the (chosen) domain of λ. Consider Figure 1, which is the first example of what we call the spectral condition number plot. Figure 1 indeed plots (7) against the natural logarithm of λ. As should be clear from Section 2.3, the spectral condition number displays (mostly) decreasing concave upward behavior in the feasible domain of the penalty parameter with a vertical asymptote at ln(0) and a horizontal asymptote at C2 (T) (which amounts to 1 in case of a strictly positive scalar target). The logarithm is used on the xaxis as, especially for estimators (2) and (3), it is more natural to consider orders of magnitude for λ. In addition, usage of the logarithm ‘decompresses’ the lower domain of λ, which enhances the visualization of the point of relative stabilization, as it is in the lower domain of the penalty parameter where ill-conditioning usually ensues when p > n. Figure 1 uses simulated data (see Section 4 of the Supplementary Material) with p = 100 and n = 25. The estimator used is the ad-hoc ridge-type estimator given by (2). One can observe relative stabilization of the spectral condition number – in the sense of the Heuristic Definition – at approximately exp(−3) ≈ .05. This value can be taken as a reasonable (minimal) value for the penalty parameter. The spectral condition number plot can be a simple visual tool of interest in the situations sketched in Section 2.2, as will be illustrated in Section 4.
THE SPECTRAL CONDITION NUMBER PLOT
7
3.2. Interpretational aids. The basic spectral condition number plot can be amended with interpretational aids. The software (see Section 5) can add two such aids to form a panel of plots. These aids support the heuristic choice for a penalty-value and provide additional information on the basic plot. ˆ The first aid is the visualization of blog10 C2 [Σ(λ)]c against the domain of λ. ˆ As stated in Section 2.3, blog10 C2 [Σ(λ)]c provides an estimate of the digits of accuracy one can expect to loose (on top of the digit loss due to inherent numerical ˆ imprecision) in operations based on Σ(λ). Note that this estimate is dependent on the norm. This aid can support choosing a (miminal) penalty-value on the basis of the error propagation (in terms of approximate loss in digits of accuracy) one finds acceptable. Figure 2 gives an example. Let Cln(λ)7→C2 [Σ(λ)] denote the curvature (of the basic plot) that maps the natural ˆ logarithm of the penalty-value to the condition number of the regularized precision matrix. We seek to approximate the second-order derivative of this curvature (the acceleration) at given penalty-values in the domain [λmin , λmax ]. The software (see Section 5) requires the specification of λmin and λmax , as well as the number of steps one wants to take along the domain [λmin , λmax ]. Say we take s = 1, . . . , S steps, such that λ1 = λmin , λ2 , . . . , λS−1 , λS = λmax . The implementation takes steps that are log-equidistant, hence ln(λs ) − ln(λs−1 ) = [ln(λmax ) − ln(λmin )]/(S − 1) ≡ τ for all s = 2, . . . , S. The central finite difference approximation to the second-order derivative [see e.g., 37] of Cln(λ)7→C2 [Σ(λ)] at ln(λs ) then takes the following form: ˆ ˆ s+1 )] − 2C2 [Σ(λ ˆ s )] + C2 [Σ(λ ˆ s−1 )] C2 [Σ(λ , 2 τ 00 which is available for s = 2, . . . , S−1. The second visual aid thus plots Cln(λ)7→C [Σ(λ)] ˆ 2 against the feasible domain of λ (see Figure 2 for an example). The behavior of the condition number along the domain of the penalty-value is to always decrease. This decreasing behavior is not consistently concave upward for (1) (under general non-scalar targets) and (3), as there may be parts of the domain where the behavior is concave downward. This aid may then more clearly indicate inflection points in the regularization path of the condition number. In addition, this aid may put perspective on the point of relative stabilization when the y-axis of the basic plot represents a very wide range. Software implementing the spectral condition number plot is discussed in Section 5. The following section illustrates, using oncogenomics data, the various uses of the spectral condition number plot with regard to covariance or precision matrix regularization. Section 2 of the Supplementary Material contains a second data example to further illustrate usage of the condition number plot. Section 4 of the Supplementary Material contains all R code with which these illustrations can be reproduced (including querying the data). 00
Cln(λ
ˆ s )7→C2 [Σ(λs )]
≈
4. Illustration 4.1. Context and data. Various histological variants of kidney cancer are designated with the amalgamation ‘renal cell carcinoma’ (RCC). Chromophobe RCC (ChRCC) is a rather rare and predominantly sporadic histological variant, accounting for 4-6% of RCC cases [49]. ChRCC originates in the distal convoluted tubule [47], a portion of the nephron (the basic structural unit of the kidney) that serves to maintain electrolyte balance [52]. Often, histological variants of cancer have a
8
C.F.W. PEETERS, M.A. VAN DE WIEL, AND W.N. VAN WIERINGEN
Figure 2. The spectral condition number plot with interpretational aids. The data are the same as for Figure 1 (see Section 4 of the Supplementary Material). The ridge estimator used is given in ˆ ◦ Ip )−1 . This estimator exhibits nonlinear (3) with target T = (Σ shrinkage. The left-hand panels give the basic spectral condition number plot. The middle and right-hand panels exemplify the interpretational aids to the basic plot: the approximate loss in digits of accuracy (middle panel) and the approximation of the acceleration along the curve in the basic plot (right-hand panel). The top panels give the basic condition number plot and its interpretational aids for the domain λa ∈ [1 × 10−5 , 1000]. The bottom panels zoom in on the boxed areas. The interpretational aids can support the selection of a (minimal) penalty-value and may provide additional information on the basic plot. For example, say we are interested in choosing (approximately) the minimal value for the penalty for which the error propagation (in terms of approximate loss in digits of accuracy) is at most 1. From the middle panels we see that we should then choose the penalty-value to be exp(−4.2). From the right-hand panels we may infer that the regularization path of the condition number displays decreasing concave downward behavior for penalty-values between approximately exp(.2) and exp(1.8).
THE SPECTRAL CONDITION NUMBER PLOT
9
distinct pathogenesis contingent upon the deregulation of certain molecular pathways. A pathway can be thought of as a collection of molecular features that work interdependently to regulate some biochemical function. Recent evidence suggests that (reactivation of) the Hedgehog (Hh) signaling pathway may support cancer development and progression in clear cell RCC (CCRCC) [13, 8], the most common subtype of RCC. The Hh-signaling pathway is crucial in the sense that it “orchestrates tissue patterning” in embryonic development, making it “critical to normal kidney development, as it regulates the proliferation and differentiation of mesenchymal cells in the metanephric kidney” [8]. Later in life Hh-signaling is largely silenced and constitutive reactivation may elicit and support tumor growth and vascularization [13, 8]. Our goal here is to explore if Hh-signaling might also be reactivated in ChRCC. The exploration will make use of network modeling (see Section 4.3) in which the network is taken as a representation of a biochemical pathway. This exercise hinges upon a well-conditioned precision matrix. We attained data on RCC from the The Cancer Genome Atlas Research Network [55] as queried through the Cancer Genomics Data Server [6, 20] using the cgdsr R-package [28]. All ChRCC samples were retrieved for which messenger ribonucleic acid (mRNA) data is available, giving a total of n = 15 samples. The data stem from the IlluminaHiSeq RNASeqV2 RNA sequencing platform and consist of normalized relative gene expressions. That is, individual gene expressions are given as mRNA z-scores relative to a reference population that consists of all tumors that are diploid for the gene in question. All features were retained that map to the Hh-signaling pathway according to the Kyoto Encyclopedia of Genes and Genomes (KEGG) [31], giving a total of p = 56 gene features. Regularization of the desired precision matrix is needed as p > n. Even though features from genomic data are often measured on or mapped to (approximately) the same scale, regularization on the standardized scale is often appropriate as the variability of the features may differ substantially when p > n; a point also made by [61]. Note that we may use the ˆ ◦ Ip )− 21 Σ( ˆ Σ ˆ ◦ Ip )− 21 instead of Σ ˆ in equations (1) to (3) correlation matrix R = (Σ without loss of generality. 4.2. Penalty parameter selection. The precision estimator of choice is the inverse of (3). The target matrix is chosen as T = ϕIp , with ϕ set to the reciprocal of the average eigenvalue of R: 1. First, the approximate leave-one-out crossvalidation (aLOOCV) procedure [38, 60] is used (on the negative log-likelihood) in finding an optimal value for λa under the given target and data settings. This procedure searches for the optimal value λ∗a in the domain λa ∈ [1 × 10−5 , 20]. A relatively fine-grained grid of 10, 000 log-equidistant steps along this domain points to 1 × 10−5 as being the optimal value for the penalty (in the chosen domain). This value seems low given the p/n ratio of the data. This calls for usage-type ii. of the condition number plot (Section 2.2), where one uses it to determine if an optimal penalty as proposed by some procedure indeed leads to a well-conditioned estimate. The condition number is plotted over the same penalty-domain considered by the aLOOCV procedure. The left-hand panel of Figure 3 depicts this condition number plot. The green vertical line represents the penalty-value that was chosen as optimal by the aLOOCV procedure. Clearly, the precision estimate at λa = 1 × 10−5 is not well-conditioned in the sense of the Heuristic Definition. This exemplifies that the (essentially large-sample) approximation to the LOOCV score may not be suitable for non-sparse situations and/or for situations in which the p/n ratio grows
10
C.F.W. PEETERS, M.A. VAN DE WIEL, AND W.N. VAN WIERINGEN
more extreme (the negative log-likelihood term then tends to completely dominate the bias term). At this point one could use the condition number plot in accordance with usage-type i., in which one seeks a reasonable minimal penalty-value. This reasonable minimal value (in accordance with the Heuristic Definition) can be found ˆ a (exp(−6))] = C2 [Σ ˆ a (exp(−6))] ≈ 247.66. at approximately exp(−6), at which C2 [Ω One could worry that the precision estimate retains too much noise under the heuristic minimal penalty-value. To this end, a proper LOOCV procedure is implemented that makes use of the root-finding Brent algorithm [4]. The expectation is that the proper data-driven LOOCV procedure will find an optimal penaltyvalue in the domain of λa for which the estimate is well-conditioned. The penaltyspace of search is thus constrained to the region of well-conditionedness for additional speed, exemplifying usage-type iii. of the condition number plot. Hence, the LOOCV procedure is told to search for the optimal value λ∗a in the domain λa ∈ [exp(−6), 20]. The optimal penalty-value is indeed found to the right of the heuristic minimal value at 5.2. At this value, indicated by the red vertical line in ˆ a (5.2)] = C2 [Σ ˆ a (5.2)] ≈ 8.76. The precision estimate at this penaltyFigure 3, C2 [Ω value is used in further analysis. 4.3. Further analysis. Biochemical networks are often reconstructed from data through graphical models. Graphical modeling refers to a class of probabilistic models that uses graphs to express conditional (in)dependence relations (i.e., Markov properties) between random variables. Let V denote a finite set of vertices that correspond to a collection of random variables with probability distribution P, i.e., {Y1 , . . . , Yp } ∼ P. Let E denote a set of edges, where edges are understood to consist of pairs of distinct vertices such that Yj is connected to Yj 0 , i.e., Yj − Yj 0 ∈ E. We then consider graphs G = (V, E) under the basic assumption {Y1 , . . . , Yp } ∼ Np (0, Σ). In this Gaussian case, conditional independence between a pair of variables corresponds to zero entries in the precision matrix. Indeed, the following relations can be shown to hold for all pairs {Yj , Yj 0 } ∈ V with j 6= j 0 [see, e.g., 62]: ˆ jj 0 = 0 ⇐⇒ Yj ⊥⊥ Yj 0 |V \ {Yj , Yj 0 } ⇐⇒ Yj 6− Yj 0 , (Ω) where 6− indicates the absence of an edge. Hence, the graphical model can be selected by determining the support of the precision matrix. For support determination we resort to a local false discovery rate procedure proposed by [45], retaining only those edges whose empirical posterior probability of being present equals or exceeds .80. The right-hand panel of Figure 3 represents the retrieved Markov ˆ a (5.2). The vertex-labels are curated gene names of the network on the basis of Ω Hh-signaling pathway genes. The graph seems to retrieve salient features of the Hhsignaling pathway. The Hh-signaling pathway involves a cascade from the members of the Hh-family (IHH, SHH, and DHH) via the SMO gene to ZIC2 and members of the Wnt-signaling pathway. The largest connected component is indicative of this cascade, giving tentative evidence of reactivation of the Hh-signaling pathway in (at least) rudimentary form in ChRCC.
THE SPECTRAL CONDITION NUMBER PLOT 11
Figure 3. Left-hand panel : Condition number plot of the Hedgehog (Hh) signaling pathway variables on the chromophobe kidney cancer data of [55]. The green vertical line indicates the value of the penalty parameter that was chosen as optimal by the aLOOCV procedure (1 × 10−5 ). The red vertical line indicates the value of the penalty that was chosen as optimal by the root-finding LOOCV procedure (5.2). The value indicated by the aLOOCV procedure does not lie in a region where the estimate can be deemed well-conditioned. Right-hand panel : The retrieved Markov network using the optimal penalty-value as indicated by the root-finding LOOCV procedure. The vertex-labels are Human Genome Organization (HUGO) Gene Nomenclature Committee (HGNC) curated gene names of the Hh-signaling pathway genes. Certain salient features of the Hh-signaling pathway are retrieved, indicating that this pathway may be reactivated in ChRCC.
12
C.F.W. PEETERS, M.A. VAN DE WIEL, AND W.N. VAN WIERINGEN
5. Software The R-package rags2ridges [43] implements (along with functionalities for graphical modeling) the ridge estimators of Section 2.1 and the condition number plot through the function CNplot. This function outputs visualizations such as Figure 1 and Figure 2. The condition number is determined by exact calculation using the spectral decomposition. For most practical purposes this exact calculation is fast enough, especially in rotation equivariant situations for which only a single spectral decomposition is required to obtain the complete solution path of the condition number. Additional computational speed is achieved by the implementation of core operations in C++ via Rcpp and RcppArmadillo [15, 14]. For example, producing the basic condition number plot for the estimator (3) on the basis of data with dimensions p = 1, 000 and n = 200, using a scalar target matrix and 1, 000 steps along the penalty-domain, will take approximately 1.2 seconds (see Section 3 of the Supplementary Material for a benchmark exercise). The additional computational cost of the interpretational aids is linear: producing the panel of plots (including interpretational aids) for the situation just given takes approximately 1.3 seconds. When λ is very small and p n the exact calculation of the condition number may suffer from rounding problems (much like the imaginary linear system Σx = b), but remains adequate in its indication of ill-conditioning. When exact computation is deemed too costly in terms of floating-point operations, or when one wants more speed in determining the condition number, the CNplot function offers the option to cheaply approximate the `1 -condition number, which amounts to X X ˆ ˆ ˆ ˆ ˆ 0| 0| |[ Σ(λ)] max |[ Ω(λ)] . C1 [Σ(λ)] = k Σ(λ) k1 k Ω(λ) k1 = max jj jj j0 j0 j
j
The `1 -condition number is computationally less complex than the calculation of ˆ C2 [Σ(λ)] in non-rotation equivariant settings. The machinery of ridge-type regularization is, however, less directly connected to this `1 -condition number (in comparˆ ison to the `2 -condition number). The approximation of C1 [Σ(λ)] uses LAPACK routines [2] and avoids overflow. This approximation is accessed through the rcond function from R [54]. The package rags2ridges is freely available from the Comprehensive R Archive Network (http://cran.r-project.org/) [54]. 6. Discussion The condition number plot is a heuristic tool and heuristics should be handled with care. Below, some cases are presented that serve as notes of caution. They exemplify that the proposed heuristic accompanying the condition number plot should not be applied (as any statistical technique) without proper inspection of the data. 6.1. Artificial ill-conditioning. A first concern is that, when the variables are measured on different scales, artificial ill-conditioning may ensue [see, e.g., 21]. In case one worries if the condition number is an adequate indication of error propagation when using variables on their original scale one can ensure that the columns (or rows) of the input matrix are on the same scale. This is easily achieved by scaling the input covariance matrix to be the correlation matrix. Another issue is that it is not guaranteed that the condition number plot will give an unequivocal
THE SPECTRAL CONDITION NUMBER PLOT
13
point of relative stabilization for every data problem (which hinges in part on the chosen domain of the penalty parameter). Such situations can be dealt with by extending the domain of the penalty parameter or by determining the value of λ ˆ that corresponds to the loss of blog10 C[Σ(λ)]c digits (in the imaginary linear system Σx = b) one finds acceptable.
6.2. Naturally high condition numbers. Some covariance matrices may have high condition numbers as their ‘natural state’. Consider the following covariance matrix: Σequi = (1 − %)Ip + %Jp , with −1/(p − 1) < % < 1 and where Jp denotes the (p × p)-dimensional all-ones matrix. The variates are thus equicorrelated with unit variance. The eigenvalues of this covariance matrix are p% + (1 − %) and (with multiplicity p − 1) 1 − %. Consequently, its condition number equals 1 + p%/(1 − %). The condition number of Σequi thus becomes high when the number of variates grows large and/or the (marginal) correlation % approaches one (or −1/(p− 1)). The large ratio between the largest and smallest eigenvalues of Σequi in such situations mimics a high-dimensional setting in which any non-zero eigenvalue of the sample covariance estimate is infinitely larger than the smallest (zero) eigenvalues. However, irrespective of the number of samples, any sample covariance estimate of an Σequi with large p and % close to unity (or −1/(p − 1)) exhibits such a large ratio. Would one estimate the Σequi in penalized fashion (even for reasonable sample sizes) and choose the penalty parameter from the condition number plot as recommended, then one would select a penalty parameter that yields a ‘wellconditioned’ estimate. Effectively, this amounts to limiting the difference between the penalized eigenvalues, which need not give a condition number representative of Σequi . Thus, the recommendation to select the penalty parameter from the wellconditioned domain of the condition number plot may in some (perhaps exotic) cases lead to a choice that crushes too much relevant signal (shrinking the largest eigenvalue too much). For high-dimensional settings this may be unavoidable, but for larger sample sizes this is undesirable.
6.3. Contamination. Real data is often contaminated with outliers. To illustrate the potential effect of outliers on the usage of the condition number plot, consider data yi drawn from a contaminated distribution, typically modeled by a mixture distribution: yi ∼ (1 − φ)Np (0, Σ) + φNp (0, cIp ) for i = 1, . . . , n, some positive constant c > 0, and mixing proportion φ ∈ [0, 1]. Then, the expectation of the sample covariance matrix E(yi yT i ) = (1 − φ)Σ + cφIp . Its eigenvalues are: d[(1 − φ)Σ + cφIp ]j = (1 − φ)d(Σ)j + cφ, for j = 1, . . . , p. In high-dimensional settings with few samples the presence of any outlier corresponds to mixing proportions clearly deviating from zero. In combination with any substantial c the contribution of the outlier(s) to the eigenvalues may be such that the contaminated sample covariance matrix is represented as better conditioned (vis-`a-vis its uncontaminated counterpart). It is the outlier(s) that will determine the domain of well-conditionedness in such a situation. Then, when choosing the penalty parameter in accordance with the Heuristic Definition, undershrinkage may ensue. One may opt, in situations in which the results are influenced by outliers, to use a ˆ in producing the condition number plot. robust estimator for Σ
14
C.F.W. PEETERS, M.A. VAN DE WIEL, AND W.N. VAN WIERINGEN
7. Conclusion We have proposed a simple visual display that may be of aid in determining the value of the penalty parameter in ridge-type estimation of the covariance or precision matrix when the number of variables is large relative to the sample size. The visualization we propose plots the spectral condition number against the domain of the penalty parameter. As the value of the penalty parameter increases, the covariance (or precision) matrix will move away from (near) singularity. In some lower-end of the domain this will mean that small increments in the value of the penalty parameter will lead to large decreases of the condition number. At some point, the condition number can be expected to stabilize, in the sense that small increments in the value of the penalty have (relatively) little effect on the condition number. The point of relative stabilization may be deemed to indicate a reasonable (minimal) value for the penalty parameter. Usage of the condition number plot was exemplified in situations concerned with the direct estimation of covariance or precision matrices. The plot may also be of interest in situations in which (scaled versions of) these matrices are conducive to further computational procedures. For example, it may support the ridge approach to the regression problem x = Yβ + . We would then assess the conditioning of YT Y + λIp for use in the ridge-solution to the regression coefficients: βˆ = (YT Y + λIp )−1 YT x. We explicitly state that we view the proposed condition number plot as an heuristic tool. We emphasize ‘tool’, as it gives easy and fast access to penaltyvalue determination without proposing an optimal (in some sense) value. Also, in the tradition of exploratory data analysis [56], usage of the condition number plot requires good judgment. As any heuristic method, it is not without flaws. Notwithstanding these concerns, the condition number plot gives access to a fast and generic (i.e., non-target and non-ridge-type specific) procedure for regularization parameter determination that is of use when analytic solutions are not available and when other procedures fail. In addition, the condition number plot may aid more formal procedures, in terms of assessing if a well-conditioned estimate is indeed obtained, and in terms of proposing a reasonable minimal value for the regularization parameter for usage in a search grid. Acknowledgements This research was supported by grant FP7-269553 (EpiRadBio) through the European Community’s Seventh Framework Programme (FP7, 2007-2013). References [1] American Cancer Society. URL http://www.cancer.org/cancer/ prostatecancer/detailedguide/prostate-cancer-survival-rates. [2] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK Users’ Guide. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics, 3 edition, 1999. [3] J. Bien and R. Tibshirani. Sparse estimation of a covariance matrix. Biometrika, 98:807–820, 2011.
THE SPECTRAL CONDITION NUMBER PLOT
15
[4] R. P. Brent. An algorithm with guaranteed convergence for finding a zero of a function. The Computer Journal, 14:422–425, 1971. [5] C. M. Carvalho, J. Chang, J. E. Lucas, J. R. Nevins, Q. Wang, and M. West. High-dimensional sparse factor modeling: Applications in gene expression genomics. Journal of the American Statistical Association, 103:1438–1456, 2008. [6] E. Cerami, J. Gao, U. Dogrusoz, B. E. Gross, S. O. Sumer, B. A. Aksoy, A. Jacobsen, C. J. Byrne, M. L. Heuer, E. Larsson, Y. Antipin, B. Reva, A. P. Goldberg, C. Sander, and N. Schultz. The cBio cancer genomics portal: An open platform for exploring multidimensional cancer genomics data. Cancer Discovery, 2:401–404, 2012. [7] W. Cheney and D. Kincaid. Numerical Computing and Mathematics. Belmont, CA, USA: Thomson Brooks/Cole, 6th edition, 2008. [8] C. D’Amato, R. Rosa, R. Marciano, V. D’Amato, L. Formisano, L. Nappi, L. Raimondo, C. Di Mauro, A. Servetto, F. Fulciniti, A. Cipolletta, C. Bianco, F. Ciardiello, B. M. Veneziani, S. De Placido, and R. Bianco. Inhibition of Hedgehog signalling by NVP-LDE225 (Erismodegib) interferes with growth and invasion of human renal cell carcinoma cells. British Journal of Cancer, 111:1168–1179, 2014. [9] M. J. Daniels and R. E. Kass. Shrinkage estimators for covariance matrices. Biometrics, 57:1173–1184, 2001. [10] A. I. P. M. de Kroon, P. J. Rijken, and C. H. de Smet. Checks and balances in membrane phospholipid class and acyl chain homeostasis, the yeast perspective. Progress in Lipid Research, 52:374–394, 2013. [11] J. W. Demmel. On condition numbers and the distance to the nearest ill-posed problem. Numerische Mathematik, 51:251–289, 1987. [12] S. J. Devlin, R. Gnanadesikan, and J. R. Kettenring. Robust estimation and outlier detection with correlation coefficients. Biometrika, 62:531–545, 1975. [13] V. Dormoy, S. Danilin, V. Lindner, L. Thomas, S. Rothhut, C. Coquard, J. J. Helwig, D. Jacqmin, H. Lang, and T. Massfelder. The sonic hedgehog signaling pathway is reactivated in human renal cell carcinoma and plays orchestral role in tumor growth. Molecular Cancer, 8:123, 2009. [14] D. Eddelbuettel. Seamless R and C++ Integration with Rcpp. Springer-Verlag, New York, 2013. [15] D. Eddelbuettel and R. Fran¸cois. Rcpp: Seamless R and C++ integration. Journal of Statistical Software, 40(8), 2011. [16] QIAGEN Ingenuity Target Explorer. URL https://targetexplorer. ingenuity.com. [17] N. Ferrara, H. P. Gerber, and J. LeCouter. The biology of VEGF and its receptors. Nature Medicine, 9:669–676, 2003. [18] T. J. Fisher and X. Sun. Improved Stein-type shrinkage estimators for the highdimensional multivariate normal covariance matrix. Computational Statistics and Data Analysis, 55:1909–1918, 2011. [19] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9:432–441, 2008. [20] J. Gao, B. A. Aksoy, U. Dogrusoz, G. Dresdner, B. Gross, S. O. Sumer, Y. Sun, A. Jacobsen, R. Sinha, E. Larsson, E. Cerami, C. Sander, and N. Schultz. Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Science Signaling, 6:pl1, 2013.
16
C.F.W. PEETERS, M.A. VAN DE WIEL, AND W.N. VAN WIERINGEN
[21] J. E. Gentle. Matrix Algebra: Theory, Computations, and Applications in Statistics. New York: Springer Verlag, 2007. [22] M. J. Ha and W. Sun. Partial correlation matrix estimation using ridge penalty followed by thresholding and re-estimation. Biometrics, 70:765–773, 2014. [23] L. R. Haff. Empirical Bayes estimation of the multivariate normal covariance matrix. Annals of Statistics, 8:586–597, 1980. [24] L. R. Haff. The variational form of certain Bayes estimators. Annals of Statistics, 19:1163–1190, 1991. [25] D. J. Higham. Condition numbers and their condition numbers. Linear Algebra and its Applications, 214:193–213, 1995. [26] A. E. Hoerl and R. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12:55–67, 1970. [27] IEEE Computer Society. IEEE Standard for Floating-Point Arithmetic. IEEE Std 754-2008, pages 1–70, Aug 2008. [28] A. Jacobsen. cgdsr: R-Based API for Accessing the MSKCC Cancer Genomics Data Server (CGDS), 2015. URL http://CRAN.R-project.org/package= cgdsr. R package version 1.2.5. [29] H. F. Kaiser. The varimax criterion for analytic rotation in factor analysis. Psychometrika, 23:187–200, 1958. [30] H. F. Kaiser. A second generation little jiffy. Psychometrika, 35:401–415, 1970. [31] M. Kanehisa and S. Goto. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research, 28(1):27–30, 2000. [32] KEGG. URL http://www.genome.jp/kegg-bin/show_pathway?hsa04370. [33] M. Kutmon, A. Riutta, N. Nunes, K. Hanspers, E. L. Willighagen, A. Bohler, J. M´elius, A. Waagmeester, S. R. Sinha, R. Miller, S. L. Coort, E. Cirillo, B. Smeets, C. T. Evelo, and A. R. Pico. WikiPathways: capturing the full diversity of pathway knowledge. Nucleic Acids Research, 44(D1):D488–D494, 2016. [34] O. Ledoit and M. Wolf. Improved estimation of the covariance matrix of stock returns with an application to portfolio selection. Journal of Empirical Finance, 10:603–621, 2003. [35] O. Ledoit and M. Wolf. Honey, I shrunk the sample covariance matrix. Journal of Portfolio Management, 30:110–119, 2004. [36] O. Ledoit and M. Wolf. A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis, 88:365–411, 2004. [37] R. J. LeVeque. Finite Difference Methods for Ordinary and Partial Differential Equations: Steady State and Time Dependent Problems. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, 2007. [38] H. Lian. Shrinkage tuning parameter selection in precision matrices estimation. Journal of Statistical Planning and Inference, 141:2839–2848, 2011. [39] S. Lin and M. Perlman. A Monte Carlo comparison of four estimators of a covariance matrix. In P. R. Krishnaiah, editor, Multivariate Analysis, pages 411–429. Amsterdam: North Holland, 6 edition, 1985. [40] A. Manukyan, E. C ¸ ene, A. Sedef, and I. Demir. Dandelion plot: a method for the visualization of R-mode exploratory factor analyses. Computational Statistics, 29:1769–1791, 2014. [41] Olaf Mersmann. microbenchmark: Accurate Timing Functions, 2014. URL http://CRAN.R-project.org/package=microbenchmark. R package version
THE SPECTRAL CONDITION NUMBER PLOT
17
1.4-2. [42] L. R. Nunez, S. A. Jesch, M. L. Gaspar, C. Almaguer, M. Villa-Garcia, M. Ruiz-Noriega, J. Patton-Vogt, and S. A. Henry. Cell Wall Integrity MAPK Pathway is essential for lipid homeostasis. Journal of Biological Chemistry, 283:34204 –34217, 2008. [43] C. F. W. Peeters, A. E. Bilgrau, and W. N. van Wieringen. rags2ridges: Ridge Estimation of Precision Matrices from High-Dimensional Data, 2016. URL http://cran.r-project.org/package=rags2ridges. R package version 2.1. [44] M. Pourahmadi. High-dimensional covariance estimation. Hoboken NJ: John Wiley & Sons, 2013. [45] J. Sch¨ afer and K. Strimmer. A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical Applications in Genetics and Molecular Biology, 4:art. 32, 2005. [46] G. E. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6: 461–464, 1978. [47] B. Shuch, A. Amin, A. J. Armstrong, J. N. Eble, V. Ficarra, A. Lopez-Beltran, G. Martignoni, B. I. Rini, and A. Kutikov. Understanding pathologic variants of renal cell carcinoma: Distilling therapeutic opportunities from biologic complexity. European Urology, 67:85–97, 2015. [48] R. L. Siegel, K. D. Miller, and A. Jemal. Cancer Statistics, 2016. CA: A Cancer Journal for Clinicians, 66:7–30, 2016. [49] R. Stec, B. Grala, M. M¸aczewski, L. Bodnar, and C. Szczylik. Chromophobe renal cell cancer–review of the literature and potential methods of treating metastatic disease. Journal of Experimental & Clinical Cancer Research, 28: 134, 2009. [50] C. Stein. Estimation of a covariance matrix. Rietz Lecture. 39th Annual Meeting IMS. Atlanta, Georgia, 1975. [51] C. Stein. Lectures on the theory of estimation of many parameters. Journal of Mathematical Sciences, 34:1373–1403, 1986. [52] A. R. Subramanya and D. H. Ellison. Distal convoluted tubule. Clinical Journal of the American Society of Nephrology, 9:2147–2163, 2014. [53] B. S. Taylor, N. Schultz, H. Hieronymus, A. Gopalan, Y. Xiao, B. S. Carver, V. K. Arora, P. Kaushik, E. Cerami, B. Reva, Y. Antipin, N. Mitsiades, T. Landers, I. Dolgalev, J. E. Major, M. Wilson, N. D. Socci, A. E. Lash, A. Heguy, J. A. Eastham, H. I. Scher, V. E. Reuter, P. T. Scardino, C. Sander, C. L. Sawyers, and W. L. Gerald. Integrative genomic profiling of human prostate cancer. Cancer Cell, 18:11–22, 2010. [54] R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2011. URL http://www.R-project.org/. ISBN 3-900051-07-0. [55] The Cancer Genome Atlas Research Network. Comprehensive molecular characterization of clear cell renal cell carcinoma. Nature, 499:43–49, 2013. [56] J. W. Tukey. Exploratory data analysis. Addison-Wesley, 1977. [57] A. M. Turing. Rounding-off errors in matrix processes. Quarterly Journal of Mechanics and Applied Mathematics, 1:287–308, 1948. [58] W. N. van Wieringen and C. F. W. Peeters. Ridge estimation of inverse covariance matrices from high-dimensional data. Computational Statistics & Data Analysis, 103:284–303, 2016.
18
C.F.W. PEETERS, M.A. VAN DE WIEL, AND W.N. VAN WIERINGEN
[59] J. Von Neumann and H. H. Goldstine. Numerical inverting of matrices of high order. Bulletin of the American Mathematical Society, 53:1021–1099, 1947. [60] I. Vujaˇci´c, A. Abbruzzo, and E. C. Wit. A computationally fast alternative to cross-validation in penalized Gaussian graphical models. Journal of Statistical Computation and Simulation, 85:3628–3640, 2015. [61] D.I. Warton. Penalized normal likelihood and ridge regularization of correlation and covariance matrices. Journal of the American Statistical Association, 103:340–349, 2008. [62] J. Whittaker. Graphical Models in Applied Multivariate Statistics. John Wiley & Sons Ltd., Chichester, 1990. [63] J. H. Won, J. Lim, S. J. Kim, and B. Rajaratnam. Condition-numberregularized covariance estimation. Journal of the Royal Statistical Society, Series B, 75:427–450, 2013. [64] R. Yang and J. O. Berger. Estimation of a covariance matrix using the reference prior. Annals of Statistics, 22:1195–1211, 1994. [65] K. H. Yuan and W. Chan. Structural equation modeling with near singular covariance matrices. Computational Statistics and Data Analysis, 52:4842– 4858, 2008. (Carel F.W. Peeters) Dept. of Epidemiology & Biostatistics, VU University medical center Amsterdam, Amsterdam, The Netherlands E-mail address:
[email protected] (Mark A. van de Wiel) Dept. of Epidemiology & Biostatistics, VU University medical center Amsterdam, Amsterdam, The Netherlands; and Dept. of Mathematics, VU University Amsterdam, Amsterdam, The Netherlands E-mail address:
[email protected] (Wessel N. van Wieringen) Dept. of Epidemiology & Biostatistics, VU University medical center Amsterdam, Amsterdam, The Netherlands; and Dept. of Mathematics, VU University Amsterdam, Amsterdam, The Netherlands E-mail address:
[email protected]
THE SPECTRAL CONDITION NUMBER PLOT
1
SUPPLEMENTARY MATERIAL This supplement contains figures and R code in support of the main text. In addition, this supplement contains a second illustration of (the use of) the spectral condition number plot. 1. Behavior largest and smallest eigenvalues This section contain a visual impression (Figure S1) of the behavior of the largest and smallest eigenvalues along the domain of the penalty parameter. The setting is given in Example 1 of the main text.
Figure S1. Visualization of the behavior of the largest and smallest eigenvalues along the domain of the penalty parameter. The green line represents the smallest eigenvalue while the blue line represents the largest eigenvalue. The panels represent different ˆ 1 ; (b) 1/d(Σ) ˆ 1 < ϕ < 1; choices for the scalar ϕ: (a) ϕ < 1/d(Σ) ˆ ˆ (c) ϕ = 1; (d) 1 < ϕ < d(Σ)1 ; and (e) ϕ > d(Σ)1 .
2
C.F.W. PEETERS, M.A. VAN DE WIEL, AND W.N. VAN WIERINGEN
2. A second Illustration 2.1. Context and data. Prostate cancer refers to an adenocarcinoma in the prostate gland. It is the most common solid tumor diagnosed in western men [48]. Its prognosis is largely determined by metastasis with low survival rates for metastatic forms in comparison to organ-confined forms [1]. Vascular endothelial growth factor (VEGF) is a signal protein that supports vascularization [17]. Vascular endothelial cells are cells that line the interior of blood vessels. The formation of new blood vessels is pivotal to the growth and metastasis of solid tumors. VEGF is actually at the helm of a cascade of signaling pathways that form the VEGF-signaling pathway. In specific, the activation of VEGF leads to the activation of the mitogen-activated protein kinase (MAPK) and phosphatidylinositol 3’-kinase (PI3K)-AKT signaling pathways [32]. The VEGFsignaling pathway thus largely consists of two subpathways. These subpathways mediate “the proliferation and migration of endothelial cells and” promote “their survival and vascular permeability” [32]. Hence, VEGF overexpression supports tumor growth and metastasis. VEGF-signaling is likely active in metastatic prostate cancer. This contention is supported by recent evidence that changes in the PI3K-pathway are present in metastatic tumor samples [53]. Cancer may be viewed, from a pathway perspective, as consisting of a loss of normal biochemical connections and a gain of abnormal biochemical connections. Severe deregulation would then imply the loss of normal subpathways and/or the gain of irregular subpathways. Our goal here is to explore if the VEGF-signaling pathway in metastatic prostate cancer can still be characterized as consisting of the MAPK and PI3K-AKT subpathways. The exploration will make use of a pathway-based factor analysis in which the retained latent factors are taken to represent biochemical subpwathways [cf. 5]. This exercise hinges upon a well-conditioned covariance matrix. We attained data on prostate cancer from the Memorial Sloan-Kettering Cancer Center [53] as queried through the Cancer Genomics Data Server [6, 20] using the cgdsr R-package [28]. All metastatic samples were retrieved for which messenger ribonucleic acid (mRNA) data is available, giving a total of n = 19 samples. The data stem from the Affymetrix Human Exon 1.0 ST array platform and consist of log2 whole-transcript mRNA expression values. All Human Genome Organization (HUGO) Gene Nomenclature Committee (HGNC) curated features were retained that map to the VEGF-signaling pathway according to the Kyoto Encyclopedia of Genes and Genomes (KEGG) [31], giving a total of p = 75 gene features. Regularization of the desired covariance matrix is needed as p > n. Regularization is performed (as in the illustration in the main text) on the standardized scale. 2.2. Model. We work with standardized data. Hence, let zi ∈ Rp denote a centered and scaled p-dimensional observation vector available for i = 1, . . . , n persons. The factor model states that zi = Γξ i + i , m
where ξ i ∈ R denotes an m-dimensional vector of latent variables typically called ‘factors’, and where Γ ∈ Rp×m is a matrix whose entries γjk denote the loading of the jth variable on the kth factor, j = 1, . . . , p, k = 1, . . . , m. Finally, the i ∈ Rp denote error measurements. Hence, the model can be conceived of as a multivariate regression with latent predictors. An important assumption in this model is
THE SPECTRAL CONDITION NUMBER PLOT
3
that m < p: the dimension of the latent vector is smaller than the dimension of the observation vector. The following additional assumptions are made: (i) The observation vectors are independent; (ii) rank(Γ) = m; (iii) ξ i ∼ Nm (0, Ip ); (iv) i ∼ Np (0, Ψ), with Ψ a (p × p)-dimensional diagonal matrix with strictly positive diagonal entries ψj ; and (v) ξ i and i0 are independent for all i and i0 . The preceding assumptions establish a distributional assumption on the covariance structure of the observed data-vector: (vi) zi ∼ Np (0, Σ = ΓΓT + Ψ). Hence, to characterize the model, we need to estimate Γ and Ψ on the basis of the sample correlation matrix. As p > n for the data at hand, the sample correlation matrix is singular and standard maximum likelihood estimation (MLE) is not available. Recent efforts deal with such situations by (Bayesian) sparsity modeling of the factor model [e.g., 5], i.e., by imposing sparsity constraints in the loadings matrix. Our approach differs. We first ensure that we have a well-conditioned correlation matrix by using a regularized (essentially a Bayesian) estimator. Afterwards, standard MLE techniques are employed to estimate Γ and Ψ on the basis of ˆ ∗) this regularized correlation matrix. Hence, we perform MLE on the basis of Σ(λ ∗ where λ is (in some sense) deemed optimal. 2.3. Penalty parameter selection. The estimator of choice is again (3) of the ˆ with the sample correlation matrix R. The tarmain text where we replace Σ get matrix is chosen as T = Ip , such that a regularized correlation matrix ensues. Again, the aLOOCV procedure was tried first in finding an optimal value for λa under the given target and data settings. We searched for the optimal value λ∗a in the domain λa ∈ [1 × 10−5 , 20] with 10, 000 log-equidistant steps along this domain. Again, the procedure pointed to 1 × 10−5 as being the optimal value for the penalty (in the chosen domain), which seems low given the p/n ratio of the data. The condition number plot (covering the same penalty-domain considered by the aLOOCV procedure) indeed indicates that the precision estimate at λa = 1 × 10−5 is not wellconditioned in the sense of the Heuristic Definition, exhibiting a condition number of approximately 9, 456 (Figure S2). A reasonable minimal penalty-value (in accordance with the Heuristic Definition) can be found at approximately exp(−6.5), at ˆ a (exp(−6.5))] ≈ 785.01. This reasonable minimal value is subsequently which C2 [Σ used to constrain the search-domain to the region of well-conditionedness. A rootfinding (by the Brent algorithm [4]) LOOCV procedure is then told to search for the optimal value λ∗a in the domain λa ∈ [exp(−6.5), 20]. The optimal penaltyvalue is found at .422. At this value, indicated by the red vertical line in Figure ˆ a (.422)] ≈ 62.39. S2, C2 [Σ The Kaiser-Meyer-Olkin index [30] of .98 indicates that a reasonable proportion of variance among the variables might be common variance and, hence, that ˆ a (.422) is suitable for a factor analysis. The regularized estimate Σ ˆ a (.422) is used Σ in further analysis. 2.4. Further analysis. The dimension of the factor solution is unknown. Hence, the optimal factor dimension needs to be determined in conjunction with the esˆm = Λ ˆ mΛ ˆT + Ψ ˆ denote the MLE solution timation of the model. Now, let Σ m under m factors. Then, we determine the optimal dimension (contingent upon the MLE solutions) using the Bayesian Information Criterion (BIC; [46]), which, for
4
C.F.W. PEETERS, M.A. VAN DE WIEL, AND W.N. VAN WIERINGEN
Figure S2. The left-hand panels give the basic spectral condition number plot. The middle and right-hand panels exemplify the interpretational aids to the basic plot: the approximate loss in digits of accuracy (middle panel) and the approximation of the acceleration along the curve in the basic plot (right-hand panel). The top panels give the basic condition number plot and its interpretational aids for the domain λa ∈ [1 × 10−5 , 20]. The bottom panels zoom in on the boxed areas. The boxed areas cover a domain of well-conditionedness according to the heuristically chosen minimal penalty-value: λa ∈ [exp(−6.5), 20]. The red vertical line indicates the value of the penalty that was chosen as optimal by the root-finding LOOCV procedure (.422).
the problem at hand, amounts to: n o ˆ m | + tr Σ ˆ −1 Σ ˆ a (.422) + ln(n)η, n p ln(2π) + ln |Σ m where η = p(m + 1) − m(m − 1)/2 indicates the free parameters in the model. Figure S3 gives the BIC scores for each dimension allowed by the existence condition p(p + 1)/2 − η ≥ 0. The solution with the lowest BIC score is deemed optimal, indicating that m = 2 is the preferred solution. The specified FA model is inherently underidentified. Assume H ∈ Rm×m is an arbitrary orthogonal matrix. Considering the implied covariance structure of the
5
5000 4500 4000
BIC score
5500
6000
THE SPECTRAL CONDITION NUMBER PLOT
0
10
20
30
40
50
m
BIC
1 2 3 4 5
3, 721.762 3, 642.678 3, 645.902 3, 666.222 3, 691.069
60
dimension of latent vector
Figure S3. BIC scores for various dimensions of the latent vector. The left-hand panel gives the trace of the BIC scores for all dimensions of the latent vector allowed by the bound p(p + 1)/2 − η ≥ 0. The right-hand panel gives a table with BIC scores for m = 1, . . . , 5. The m = 2 solution is to be preferred according to the BIC. observed data we may write: ΓΓT + Ψ = (ΓH)(ΓH)T + Ψ. This equality implies that, given Ψ, there is an infinite number of alternative loading matrices that generate the same covariance structure as Γ. Thus, in any solution, Γ can be made to satisfy m(m − 1)/2 additional conditions, and, hence, the structure of the existence condition given above. Naturally, any estimation method then requires a minimum of m(m − 1)/2 restrictions on Γ to attain uniqueness (up to possibly polarity reversals in the columns of Γ). In MLE this is achieved by requiring that ΓT Ψ−1 Γ be diagonal along with an order-condition on its diagonal elements. As this is a convenience solution that has no direct interpretational meaning, usually a post-hoc rotation is applied, whence estimation is settled, to enhance interpretation. Here, the Varimax [29] rotation to an orthogonal simple structure was ultimately used as oblique rotation (that allows for factor correlations) indicated a near-zero correlation between the two retained latent factors (note that any orthogonal representation has equivalent oblique representations). The final factor solution is represented in Figure S4. This Dandelion plot [40] viˆ and Ψ ˆ and their relation to the two resualizes the magnitudes of the elements of Λ tained factors. The latent factors are taken to represent biochemical subpwathways. Table S1 characterizes the factors according to their constituting genes that achieve the highest absolute loading. These genes are furthermore associated with VEGFsignaling subpathways as indicated by the WikiPathways database [33]. Factor 1
6
C.F.W. PEETERS, M.A. VAN DE WIEL, AND W.N. VAN WIERINGEN
Figure S4. Dandelion plot of the factor solution. Each central solid line represents a factor and is connected to a star plot (also known as a Kiviat diagram). Each spoke in each of the star plots then represents a variable. The length of the spokes corresponds to the maximum magnitude of each loading (which is unity). The extension of the polygon along each spoke then indicates the magnitude of a factor loading in the obtained solution, with the sign of the corresponding loading represented by color coding: green for a negative loading and blue for a positive loading. The representation suppresses absolute loadings lower than .3 for enhanced visualization. The angle between the solid lines corresponds to the amount of variance explained by the first factor (approximately 28%). The dashed line represents the cumulative percentage of variance explained by all retained factors (approximately 38% in this case). The star plots at the far-right P 2then represent the magnitudes of the communalities κj = j γjk and the uniquenesses ψj = 1 − κj . Variables are represented by index numbers. The table at the far-left gives the corresponding HUGO-curated gene names. See [40] for more information on the Dandelion plot.
THE SPECTRAL CONDITION NUMBER PLOT
7
Table S1. Top genes per factor according to absolute factor loadings. ‘HUGO’ refers to Human Genome Organization Gene Nomenclature Committee (HGNC) curated gene names. ‘Pathway’ refers to the VEGF-subpathway a certain gene belongs to according to the WikiPathways database [33].
Factor
HUGO Symbol
Pathway
1
CDC42 MAPK14 PIK3R5 PLA2G2C PLA2G3 PLA2G4E PPP3R1 PRKCG SPHK1 SPHK2
MAPK-signaling MAPK-signaling Glycerol phospholipid biosynthesis Glycerol phospholipid biosynthesis Glycerol phospholipid biosynthesis MAPK-signaling MAPK-signaling VEGF-signaling -
2
AKT1 AKT2 AKT3 BAD MAP2K2 MAPKAPK2 PIK3R2 PLA2G6 PXN SHC2
PI3K/AKT-signaling in cancer AKT-signaling / VEGF-signaling AKT-signaling / VEGF-signaling AKT-signaling / Prostate cancer MAPK-signaling / Prostate cancer MAPK-signaling / Prostate cancer AKT-signaling / VEGF-signaling Glycerol phospholipid biosynthesis VEGF-signaling / Prostate cancer VEGF-signaling
consists mostly of MAPK-related genes. In addition, genes related to the glycerol phospholipid biosynthesis pathway load on factor 1. The MAPK pathway has been associated with lipid homeostase in yeast [42] and “phospholipid composition and synthesis are similar in yeast and mammalian cells” [10]. Genes with pathway association indicated by ‘-’ in Table S1 (PIK3R5 and SPHK2) could not be associated to VEGF-subpathways according to the WikiPathways database. PIK3R5 is, however, related to the MAPK-signaling pathway according to the Ingenuity Target Explorer [16]. Factor 2 consists mostly of genes related to the PI3K/AKTsignaling and prostate cancer pathways. Moreover, this factor also includes general VEGF-signaling pathway genes that are conducive in activating the MAPK and PI3K/AKT cascade. These results suggest that the compositional subpathway structure of the VEGF-signaling pathway in metastatic prostate cancer can still be characterized as consisting of the MAPK and PI3K-AKT cascades. Hence, VEGFpathway degerulation in metastatic prostate cancer is more likely characterized by intricate changes in its known subpathways than by the loss of these (normal) subpathways or the gain of (abnormal) subpathways.
8
C.F.W. PEETERS, M.A. VAN DE WIEL, AND W.N. VAN WIERINGEN
3. Benchmark A benchmarking exercise was conducted to get an indication of the execution time of the CNplot function. The sample size n is not a factor in the execution time as the function operates on the covariance (or precision) matrix. Hence, time complexity hinges on the dimension p and the number of steps S taken along the (specified) domain of the penalty-parameter. Execution time was thus evaluated for various combinations of p and S. Timing evaluations were done for all ridge estimators mentioned in Section 2.1 of the main text and all combinations of the elements in p ∈ {125, 250, 500, 1000} and S ∈ {125, 250, 500, 1000}. The basic condition number plot was produced 50 times (on simulated covariance matrices) for each combination of estimator, p, and S. In addition, both a rotation equivariant situation (using a scalar target) and a rotation non-equivariant situation (using a non-scalar target) were considered for estimators (1) and (3) of the main text (note that estimator (2) is always rotation equivariant). All execution times of the CNplot call in R were evaluated with the microbenchmark package [41]. All timings were carried out on a Intelr CoreTM i7-4702MQ 2.2 GHz processor. The results, stated as median runtimes in seconds, are given in Tables S2 to S4. The code used in this benchmarking exercise can be found in Listing 6 in Section 4 of this Supplement. Tables S2 to S4 indicate that, in general, the condition number plot can be produced quite swiftly. As stated in the main text: in rotation equivariant situations only a single spectral decomposition is required to obtain the complete solution path of the condition number. Hence, the time complexity of the CNplot call in these situations is approximately O(p3 ) (worst-case time complexity of a spectral decomposition) as, after the required spectral decomposition, the solution path can be obtained in linear time only. Increasing the number of steps S along the penalty-domain thus comes at little additional computational cost. For the rotation non-equivariant setting the time complexity is approximately O(Sp3 ) as a spectral decomposition is required for all s = 1, . . . , S. The runtime of the condition number plot function under Estimator (3) of the main text is then somewhat longer than the corresponding runtime under Estimator (1). This is because the spectral decomposition in the former situation is also used for computing the (relatively expensive) matrix square root. As S acts as a scaling factor in rotation non-equivariant settings the computation time of the plot can be reduced by coarsening the search-grid along the penalty-domain. To compare the runtimes for the CNplot call we also conducted timing evaluations for the approximate leave-one-out cross-validation (aLOOCV) and rootfinding LOOCV procedures as used in the main text (implemented in rags2ridges). Time complexity for the root-finding LOOCV procedure hinges on the dimension p and the sample size n. In contrast, time complexity for for the aLOOCV procedure is dependent on p, n, and S. Timings were done for all combinations of the elements in p ∈ {125, 250}, n ∈ {100, 200} and (when relevant) S ∈ {125, 250}. The basic condition number plot was produced 50 times (on simulated data of dimension n × p) for each combination of p, n and (possibly) S. For the root-finding LOOCV procedure both a rotation equivariant situation and a rotation non-equivariant situation were considered. Ridge estimator (3) of the main text was used for all timing evaluations. The results, stated as median runtimes in seconds, are given in Tables
THE SPECTRAL CONDITION NUMBER PLOT
Table S2. Benchmarking results for various values of p and S for estimator (3) of the main text. The results respresent median runtimes in seconds.
S S S S
= 125 = 250 = 500 = 1000
S S S S
= 125 = 250 = 500 = 1000
p = 125 p = 250 p = 500 p = 1000 Rotation equivariant setting .023 .044 .222 .998 .024 .046 .218 .965 .027 .050 .221 1.122 .035 .061 .233 1.188 Rotation non-equivariant setting .689 3.267 18.801 139.982 1.315 6.661 39.022 278.454 2.703 13.418 76.489 553.812 4.995 26.112 152.427 1,106.002
Table S3. Benchmarking results for various values of p and S for estimator (1) of the main text. The results respresent median runtimes in seconds.
S S S S
= 125 = 250 = 500 = 1000
S S S S
= 125 = 250 = 500 = 1000
p = 125 p = 250 p = 500 p = 1000 Rotation equivariant setting .023 .044 .214 1.179 .024 .048 .218 1.227 .027 .054 .219 1.032 .031 .054 .222 1.046 Rotation non-equivariant setting .339 1.448 9.701 65.988 .590 2.849 18.128 129.262 1.090 5.385 35.188 281.583 2.366 12.199 75.222 522.910
Table S4. Benchmarking results for various values of p and S for estimator (2) of the main text. The results respresent median runtimes in seconds.
S S S S
= 125 = 250 = 500 = 1000
p = 125 p = 250 p = 500 p = 1000 Rotation equivariant setting .026 .056 .281 1.606 .026 .059 .282 1.597 .027 .058 .261 1.600 .030 .060 .280 1.689
9
10
C.F.W. PEETERS, M.A. VAN DE WIEL, AND W.N. VAN WIERINGEN
Table S5. Benchmarking results for various values of p and n for the root-finding LOOCV procedure. The ridge estimator considered is given in equation (3) of the main text. The results respresent median runtimes in seconds.
n = 100 n = 200 n = 100 n = 200
p = 125 p = 250 Rotation equivariant setting 36.843 186.662 83.932 438.080 Rotation non-equivariant setting 37.151 186.780 84.335 439.607
Table S6. Benchmarking results for various values of p, n, and S for the approximate LOOCV procedure. The ridge estimator considered is given in equation (3) of the main text. The results respresent median runtimes in seconds.
S = 125 S = 250 S = 125 S = 250
p = 125 p = 250 n = 100, Rotation equivariant setting 53.059 393.309 100.733 802.741 n = 200, Rotation equivariant setting 105.372 779.881 212.129 1561.064
S5 and S6. The code used in this exercise can also be found in Listing 6 in Section 4 of this Supplement. Tables S5 and S6 indicate that, even under these relatively light settings, the runtimes for the aLOOCV and root-finding LOOCV procedures far exceed the runtime of (corresponding) calls to CNplot. The runtime for the root-finding LOOCV procedure is (given p) exacerbated for larger sample sizes. The runtime for the aLOOCV procedure is (given p) exacerbated for larger samples sizes, finer searchgrids, and (not shown) situations of rotation non-equivariance.
THE SPECTRAL CONDITION NUMBER PLOT
11
4. R Codes The R libraries needed to run the code snippets below can be found in Listing 1: Listing 1. Packages and dependencies # ################################################### ##-------------------------------------------------# # Packages and dependencies ##-------------------------------------------------# ################################################### # # R libraries library ( biomaRt ) library ( cgdsr ) library ( KEGG . db ) library ( micr obenchma rk ) library ( psych ) library ( DandEFA ) library ( gridExtra ) library ( plyr ) library ( rags2ridges )
The code in Listing 2 will produce Figure 1 of the main text: Listing 2. Code for Figure 1 # ################################################### ##-------------------------------------------------# # First example of copy number Plot ##-------------------------------------------------# ################################################### # # Obtain covariance matrix on p > n data p = 100 n = 25 set . seed (333) X = matrix ( rnorm ( n * p ) , nrow = n , ncol = p ) Cx