A relaxed approach to estimating large portfolios - arXiv

0 downloads 0 Views 617KB Size Report
result to non-sparse inverse covariance matrices, showing that estimation error of the optimal ... nodewise regression to estimate the inverse covariance matrix.
A Relaxed Approach to Estimating Large Portfolios and Gross Exposure

arXiv:1611.07347v1 [math.ST] 22 Nov 2016

Mehmet Caner∗

Esra Ulasan



Laurent Callot



¨ ¨ A.Ozlem Onder

§

November 23, 2016

Abstract In this paper, using a novel proof technique, we show that the estimation error of the optimal portfolio variance decreases when the number of assets increases, in contrast to existing literature, notably results derived from factor based models. This proof is based on nodewise regression concept. Nodewise regression is a technique that helps us develop an approximaterelaxed inverse for the non-invertible sample covariance matrix of asset returns. The results are valid even when the number of assets, p is larger than the sample size, n. We also estimate gross exposure of a large portfolio consistently with both growing and constant exposure cases, when p > n. Consistently estimating the growing gross exposure case is new, and shows the feasibility of such a portfolio. The main results use sub-Gaussian returns, but they are extended to weaker assumption of asset returns with bounded moments. Simulations verify and illustrate the theoretical results. In an out of sample forecasting application we show that portfolios based on our estimator outperform portfolios estimated with existing approaches in the finance literature in terms of out-of-sample variance and Sharpe ratio Keywords: high-dimensionality, penalized regression, precision matrix, portfolio optimization



Department of Economics, The Ohio State University, email:[email protected] Department of Economics, Ege University, email:[email protected] ‡ Amazon Development Center, Berlin, email:[email protected]. § Department of Economics, Ege University, email:[email protected]. †

1

1

Introduction

Estimating covariance matrix and its inverse (precision matrix) is a fundamental part of optimal portfolio allocation and portfolio risk assessment. With very large portfolios, two key problems are estimation of optimal portfolio variance and gross exposure. This paper shows that with increasing number of assets, estimation error of optimal portfolio variance decreases, unlike the previous literature. This is shown when the number of assets, p is larger than the sample size, n. Consistent estimation of gross exposure of a large portfolio is also proven even when p > n. We show this result both for growing and constant exposure cases. These results are new, and extend our knowledge of the behavior of large portfolios. All these results that we derive depend on the notion of inversion of large covariance matrix. ˆ is singular where dimension p is It is well-known that sample covariance matrix, denoted as Σ larger than n, hence it is not invertible. In statistical literature, in order to handle curse of dimensionality, the vast majority of work has focused on the inference of large covariance and precision matrices under sparsity conditions, that is, many off-diagonal components are either zero or close to zero; see e.g. Smoothly Clipped Absolute Deviation (SCAD) method in Fan and Li (2001), Adaptive Least Absolute Shrinkage and Selection Operator (Lasso) method in Zou (2006) and thresholding method in Bickel and Levina (2008a), generalized thresholding in Rothman et al. (2009) and adaptive thresholding in Cai and Liu (2011). Other thresholding methods include Bickel and Levina (2008b), Cai and Zhou (2012), El Karoui (2008), Cai and Yuan (2012) and Lam and Fan (2009). In particular, `1 penalized likelihood estimator has been considered by several authors, see e.g. Rothman et al. (2008), Friedman et al. (2008), Ravikumar et al. (2011), Cai et al. (2011), d’Aspremont et al. (2008), Javanmard and Montanari (2013), O.Banerjee et al. (2008) and Fan et al. (2009). There are also banding methods including re-parameterization of the covariance matrix and inverse through the Cholesky decomposition, see e.g. Wu and Pourahmadi (2003), Wong et al. (2003), Huang et al. (2006), Yuan and Lin (2007), Levina et al. (2008), Rothman et al. (2010). Further, neighborhood selection for sparse graphical models in a prominent article, Meinshausen and B¨ uhlmann (2006) develop nodewise regression technique to handle inverses of large matrices. Recently, van de Geer et al. (2014) constructed asymptotically honest confidence intervals for lowdimensional components of a large-dimensional parameter vector by desparsifying lasso estimator. They also reconsider lasso with nodewise regression method, which is first introduced by Meinshausen and B¨ uhlmann (2006), to obtain an approximate inverse of the Gram matrix for desparsified lasso. Moreover, Caner and Kock (2014) examined the same issue for high-dimensional models in the presence of conditional heteroscedasticity in the error terms. They derive conservative lasso estimator and asymptotic honest confidence bands with an increasing number of parameters. They also construct an approximate inverse matrix with nodewise regression based on conservative lasso rather than plain lasso. A particular interest to us is the study of large dimensional portfolios. Empirical evidence that

2

is related to poor performance of the mean-variance portfolios due to insufficient sample size can be found in Michaud (1989), Green and Hollifield (1992), Brodie et al. (2009), Jagannathan and Ma (2003) and DeMiguel et al. (2009a). In a high-dimensional case, estimation accuracy can be affected adversely by distorted eigenstructure of Σ as expressed by Daniels and Kass (2001). As noted by Markowitz (1990), this implies it will not be feasible to estimate an efficient portfolio when p > n. There are several approaches to estimate large covariance matrix. The most convenient way is to use higher frequency of data, i.e. using daily returns instead of monthly returns, see e.g. Jagannathan and Ma (2003), Liu (2009). Early solutions to this problem via shrinking the eigenvalues of the covariance matrix by Stein-type estimators can be found, see e.g. Stein (1956), Efron and Morris (1976), Dey and Srinivasan (1985), Jorion (1986), Yang and Berger (1994), Dey (1987), Tsukuma and Konno (2006), Tsukuma (2016). For shrinkage approaches, see Sch¨afer and Strimmer (2005), Warton (2011), Kubokawa and Srivastava (2008) and the references therein. One approach has been proposed by Ledoit and Wolf (2003, 2004), who respectively use a convex ˆ with (Sharpe, 1963)’s Capital Asset Pricing Model (CAPM) based covariance macombination of Σ trix and identity matrix. Recently, Ledoit and Wolf (2012) propose a nonlinear shrinkage approach of the sample eigenvalues for the case where p < n. Ledoit and Wolf (2015) extend this nonlinear shrinkage approach to the case p > n. Ledoit and Wolf (2004) and Ledoit and Wolf (2015) are only concerned about covariance estimations and no portfolio theories are studied. From similar point ˆ with a of view, Kourtis et al. (2012) introduce a linear combination of shrinking the inverse of Σ target matrix in the case p < n. Bodnar et al. (2014) extend the work of Ledoit and Wolf (2004) by constructing a general linear shrinkage estimator such the target matrix is considered to be an arbitrary symmetric positive definite matrix with uniformly bounded trace norm. Latest approach is to impose factor structure to reduce high-dimensionality of the covariance matrices, see e.g. Fan et al. (2011), Fan et al. (2013). Fan et al. (2008) propose the factor-based covariance matrix estimator depending on the observable factors and impose zero cross-correlations in the residual covariance matrix. They also derive the rates of convergence of portfolio risk in large portfolios. Fan et al. (2015) propose principal orthogonal complement thresholding (POET) estimator when the factors are unobservable. We see that Fan et al. (2008) and Fan et al. (2015) are very valuable in understanding the variance and exposure of large portfolios. Chen et al. (2015) proposes a factor-based estimator of a high dimensional covariance matrix when p is proportional to n or p > n. Fan et al. (2016), A¨ıt-Sahalia and Xiu (2016) consider estimation of for precision matrix via different norms when p > n in a factor model structure. There is no estimation of portfolio variance or gross exposure in these papers when p > n. Finally, Bodnar et al. (2016) derive the distributional properties of the generalized inverse Wishart (GIW) random matrix under singularity of the covariance matrix, and normally distributed asset returns with no asymptotics. Since the recent works by Chen et al. (2015) and Bodnar et al. (2016) consider a different Markowitz portfolio optimization problem unlike ours, we do not report results regarding these studies. 3

Some approaches directly address the portfolio weight vector, w. Jagannathan and Ma (2003) show that imposing a non-negativity (no short-sale) constraint on the portfolio weight vector turn out to be shrinkage-like effect on the covariance matrix estimation. On the contrary, DeMiguel et al. (2009b) indicate that even imposing short-sale constraints do not reduce the estimation error and hence, they suggest a naive portfolio diversification strategy based on equal proportion of wealth to each asset. Brodie et al. (2009) reformulate the Markowitz’s risk minimization problem as a regression problem by adding `1 penalty to the constrained objective function to obtain sparse portfolios. DeMiguel et al. (2009a) propose optimal portfolios with `1 and `2 constraints and allowing their strategy to nest different benchmarks such as Jagannathan and Ma (2003), Ledoit and Wolf (2003, 2004), DeMiguel et al. (2009b). Li (2015) formulate `1 norm and a combination of `1 and `2 norm constraints known as elastic-net (Zou and Hastie, 2005) on w. Fan et al. (2012) introduce `1 regularization to identify optimal large portfolio selection by imposing gross-exposure constraints. These studies have generated significant progress on shrinking the components of the asset weights for a given threshold that prevents overfitting in the optimal portfolios. However, they focus on shrinking norm constrained portfolio weight vector rather than shrinking the moments of stock returns. In other words, they are based on suggesting the optimal portfolios in the existence of estimation errors as pointed out by DeMiguel et al. (2009a). Nevertheless, more efforts are needed to be devoted to precise estimation of the moments in order to mitigate estimation error. To get better weights, we need to estimate sample inverse of covariance matrix of returns. Our method can do that when p > n. Also we do not need to shrink covariance matrix first unlike the other methods. Shrinking the covariance matrix first assumes that the researcher knows a priori which returns are uncorrelated with others. In this paper, we construct an approximate inverse instead of the non-invertible empirical covariance matrix of asset returns by nodewise regression with `1 penalty (see, Section 2) under high-dimensional settings. In our paper, unlike previous factor model based literature, we derive estimation errors for optimal portfolio variance which converge to zero in probability, when 1/p → 0. This is a new result, and overturns the previous literature that find, these errors grow with p but decrease with n only. Even though theoretical results in the previous literature show that estimation errors grow with p, the simulations show errors are decreasing with increasing p. In that respect, we also solve the puzzle in the simulations in the previous literature. The upper bound in the previous theoretical literature for estimation errors for optimal variance was not sharp. Specifically, in Fan et al. (2008), estimation errors for optimal portfolio variance converge to zero in probability when p p2 K 2 logn/n → 0, where the number of factors is K, and K grows with sample size. So previous literature allows p < n only in estimating the optimal portfolio variance. Also, we do not assume sparsity of the covariance matrix of asset returns. We assume a mild sparsity condition on the inverse of the covariance matrix of asset returns when p > n. Analysis of gross exposure is carried out for p > n case for the first time in the literature as well. Constant exposure case is also analyzed, and we extend our benchmark results to more general data sets. 4

We compare nodewise regression with the POET estimator of Fan et al. (2015) in terms of portfolio risk, portfolio weight estimation error and portfolio exposure. Our simulation results based on randomly generated covariance matrices calibrated from the US equity market data and knownfactor based data generation process such in Fan et al. (2008), indicate that nodewise regression approach performs well. We carry out empirical investigation of out-of-sample portfolio performance which provides good results. Note that factor models can be tied to existing finance through some assumptions forming a structural model. Our results combine only Markowitz based portfolio formation with new statistical literature. Our theorems clearly show that the approximation error of portfolio variance is smaller than the factor model based ones. The rest of paper is organized as follows. Section 2 introduces the nodewise regression with `1 penalty, the approximate inverse of the empirical Gram matrix and its asymptotic properties. We show convergence rates in estimating the global minimum portfolio and mean-variance optimal portfolio by usage of nodewise regression. In Section 3 we give a brief overview of the existing literature of the covariance matrix estimators and present simulation results. An empirical study is considered in Section 4. Finally, Section 5 concludes. All the proofs are given in the Appendix A and B. Let kνk∞ k, νk1 , kνk2 denote the sup and l1 , and Euclidean norm of a generic vector ν, respectively.

2

The Lasso for Nodewise Regression

ˆ of Σ ˆ In this section, we employ lasso nodewise regression to construct the approximate inverse Θ developed by Meinshausen and B¨ uhlmann (2006) and van de Geer et al. (2014) along the lines of plain lasso of Tibshirani (1994).

2.1

Method

For each j = 1, . . . , p, lasso nodewise regression is defined as: γˆj := argmin(k rj − r−j γ k22 /n + 2λj k γ k1 ),

(2.1)

γ∈Rp−1

with n × p design matrix of asset returns r = [r1 , . . . , rp ]. rj is (p × 1) vector and denotes the jth column in r and r−j denotes all columns of r except for the jth one. γˆj is (p−1) regression coefficient estimate, and specifically γˆj = {ˆ γjk ; k = 1, . . . , p, k 6= j} which we will be used in estimating the relaxed inverse of the covariance matrix. λj is a positive tuning parameter which determines the size of penalty on the parameters. We denote by Sj := {j; γj 6= 0} which will be the main building block of the jth row of inverse covariance matrix of asset returns in the portfolio and its cardinality ˆ Let Θ ˆ be is given by sj := |Sj |. The idea is to use a reasonable approximation of an inverse of Σ. ˆ is given by nodewise regression with `1 penalty on the design r. We run the such an inverse. Θ lasso p times for each regression problem rj versus r−j , where the latter is the design submatrix without the jth column. 5

ˆ first define To derive Θ, 

1 −ˆ γ1,2 . . .  γ2,1 1 ... −ˆ Cˆ :=  .. ..  .. . .  . −ˆ γp,1 −ˆ γp,2 . . .

 −ˆ γ1,p  −ˆ γ2,p  ..   .  1

and write Tˆ2 = diag(ˆ τ12 , . . . , τˆp2 ) which is p × p diagonal matrix for all j = 1, . . . , p. Note that elements of Tˆ2 are: k rj − r−j γˆj k22 τˆj2 := + λj k γˆj k1 . (2.2) n ˆ as: Then define the relaxed inverse Θ ˆ := Tˆ−2 C. ˆ Θ (2.3) ˆ is considered a relaxed (approximate) inverse of Σ. ˆ Though, Σ ˆ is self-adjoint, Θ ˆ is not. Define Θ ˆ ˆ j as a 1 × p vector and analogously for Cˆj . Thus, Θ ˆ j = Cj . the jth row of Θ τˆj2 ˆ is given below, and we only use p regressions. The specific algorithm to compute Θ 1. Estimate γj and repeat for all j. Specify γˆj vector with (p − 1) dimension. Apply lasso estimator for the regression of rj on r−j as in (2.1) 2. λj is chosen by modified BIC in the nodewise regressions. 3. Estimate Tˆ2 matrix with γˆj values by (2.2) and optimally chosen λj . Tˆ2 = diag(ˆ τ12 , . . . , τˆp2 ) which is (p × p) diagonal matrix for all j = 1, . . . , p 4. Take the transpose of γˆj and let it be a (p − 1) dimensional matrix consisting of each column of γˆj and multiply all elements with minus one. 5. Compute Cˆ matrix. ˆ 6. Use (2.3) to get Θ. ˆ is an approximate inverse for Σ. ˆ The Karush-Kuhn-Tucker conditions Next, we show why Θ for the nodewise lasso (2.1) imply that, τˆj2 =

(rj − r−j γˆj )0 rj n

(2.4)

ˆ j shows that, Divide each side by τˆj2 and using the definition Θ ˆ0 rj0 rΘ (rj − r−j γˆj )0 rj j = =1 n τˆj2 n 6

(2.5)

ˆΣ ˆ exactly equals to one. Karush-Kuhn-Tucker conditions which shows that jth diagonal term of Θ ˆΣ ˆ imply that, for the nodewise lasso (2.1) for the off-diagonal terms of Θ ˆ 0 k∞ k r0−j rΘ λj j ≤ 2 n τˆj

(2.6)

ˆΣ ˆ to have, Hence, we combine (2.5) and (2.6) results for diagonal and off-diagonal terms in Θ ˆΘ ˆ 0j − ej k∞ ≤ λj kΣ τˆj2

(2.7)

where, ej is the jth unit column vector (p × 1).

2.2

Asymptotics

In this section we provide asymptotic results. First, we will show that nodewise regression estimate of the inverse of variance covariance matrix is consistent, and then we show the consistency of the portfolio variance estimation. These results are all uniform over j = 1, · · · , p. To derive our theorems, we need the following assumptions. Assumption 1. r which is n × p matrix of asset returns has iid sub-Gaussian rows. Assumption 2. The following condition holds max sj j

p

logp/n = o(1).

Assumption 3. Smallest eigenvalue of Σ, Λmin is strictly positive and 1/Λmin = O(1). Moreover, maxj Σj,j = O(1). Also the minimal eigenvalue of Θ = Σ−1 is strictly positive. The assumptions above are Assumptions (B1)-(B3) of van de Geer et al. (2014). Assumption 1 can be relaxed to non i.i.d. non sub-Gaussian cases as shown in Caner and Kock (2014) and at subsection 2.5 we generalize our results to asset returns with bounded moments. Assumption 2 p allows us to have p > n. Note that when p > n, it means that maxj sj = o( n/logp). In a simple p case of p = an, where a > 1 is constant, then maxj sj = o( n/log(an)) which is growing with n, but less than maximum number of possible nonzeros p − 1 in a row. If the problem is such that p < n, then it is possible that maxj sj = p − 1 (no sparsity in the inverse of the covariance matrix), p so in that case, Assumption 2 can be written as p logp/n = o(1). Note that our Assumption 2 has √ extra maxj sj factor compared to van de Geer et al. (2014). This is due to portfolio optimization problem. Assumption 2 is a sparsity condition on the inverse of the covariance matrix of asset returns. This does not imply sparsity of covariance matrix of asset returns. Fan et al. (2008) assumes sparsity of residual covariance matrix of asset returns. A specific case of Fan et al. (2015) on the other hand, has the structure-formula as our Assumption 2 but sparsity is imposed on again residual covariance matrix rather than inverse of the covariance matrix. Assumption 3 allows

7

population covariance matrix to be nonsingular. The following Lemma shows one of the main results of our paper, and can be deduced from proof of Theorem 2.4 of van de Geer et al. (2014). We provide a proof in our Appendix to be complete. It shows that nodewise regression estimate of the inverse of the covariance matrix can be uniformly (over j) consistently estimated. p Lemma 1. Under Assumptions 1-3 with λj = O( logp/n) uniformly in j ∈ 1, · · · , p, ˆΣ ˆ − Ip k∞ = Op ( kΘ

p logp/n) = op (1).

p Note that λj = O( logp/n) is a standard assumption in van de Geer et al. (2014), and can be found also in B¨ uhlmann and van de Geer (2011). This rate is derived from concentration inequalities. This Lemma shows that we can estimate inverse matrix even when p > n.

2.3

Optimal Portfolio Allocation and Risk Assessment

Consider a given set of p risky assets with their returns at time t by rit , i = 1, . . . , p. We denote (p × 1) vector of returns as rt = (r1t , . . . , rpt )0 . We assume that returns are stationary and E [rt ] = µ, where µ = (µ1t . . . , µpt )0 . Full rank covariance matrix of returns is expressed as Σ ∈ Rp×p , where Σ = E [(rt − µ)(rt − µ)0 ]. A portfolio allocation is defined as composition of weights, w ∈ Rp . Specifically, w = (w1 , . . . , wp )0 represents the relative amount of invested wealth in each asset. 2.3.1

Global Minimum Variance Portfolio

First, we start with unconstrained optimization. The aim is to minimize the variance without an expected return constraint. Denote by wu the w that solves global minimum variance portfolio problem: wu = argmin(w0 Σw), such that w0 1p = 1. w

Note that p.370 of Fan et al. (2015) shows that weights of the global minimum variance portfolio is: wu =

Σ−1 1p . 10p Σ−1 1p

(2.8)

We also give the following estimate for the weights in global minimum variance portfolio w ˆu =

ˆ p Θ1 . ˆ p 10 Θ1

(2.9)

p

Note that global minimum variance without constraint on the expected return is wu0 Σwu = (10p Σ−1 1p )−1 ,

8

(2.10)

as shown in equation (11) of Fan et al. (2008). So denote the global minimum variance portfolio by ΦG = (10p Σ−1 1p )−1 , and its estimate by ˆ G = (10p Θ1 ˆ p )−1 . Φ

(2.11)

In other words, we estimate global minimum variance portfolio directly. This is also the approach used by Theorem 5 in Fan et al. (2008). A less straightforward estimate of the global minimum variance portfolio could have been using (2.9) directly in estimating left side of (2.10). This last way uses the estimates of weights and unnecessarily complicates the estimator. Next result shows estimation of global minimum variance using nodewise regression method for inverse of covariance matrix. This is one of the main results of this paper. It is remarkable that p > n is allowed in global minimum variance portfolio. Also estimation error converges to zero in probability at a rate faster than 1/p. In that sense, large number of assets help in reducing the errors, this is a new finding in this literature as far as we know. Theorem 1. Under Assumptions 1-3, uniformly in j λj =



logp/n,

ˆ G − ΦG | = op ( 1 ). |Φ p

Remarks 1. Note that Theorem 5 of Fan et al. (2008) derives the result in Theorem 1 above using factor model based covariance inverse. Their estimation error converges to zero in probability when n is much larger than p only. Denote K as the number of factors that grow with sample size but less than p in their case. Specifically, in Fan et al. (2008) when p2 K 2 (logn/n)1/2 → 0, the estimation error above converges in probability to zero. Since our rate of convergence is 1/p, our estimation errors converge to zero in probability much faster than Fan et al. (2008) result. In Fan et al. (2008), the estimation error grows quadratically with p, in our case, it declines with p. There are two reasons to get such a result. First, nodewise regression method is used in our case, and second we use a different proof technique, which takes into account rates of portfolio variance p better. We allow for p > n (when maxj sj = o( n/logp) via Assumption 2), whereas Fan et al. (2008) result only works when p < n. With large number of factors used in Fan et al. (2008), we expect our method to fare well compared to theirs. Note that Theorem 5 of Fan et al. (2008) has a sample variance based global minimum variance estimator. But this fares poorly compared with his factor based one. 2. We can analyze the following worse case scenario for our model. What if the number of factors in Fan et al. (2008) K is fixed, and also for all j, sj = (p − 1) (i.e. the number of nonzero elements in each row of the inverse of covariance matrix is p − 1) in our model? That case is not compatible with p > n for our theorem due to Assumption 2. However if we still impose Assumption

9

2, we can only allow p < n, p/n → 0. Clearly our method still can be better, in this non-sparse p n 1/6 scenario, than Fan et al. (2008) rate if p2 logn/n > 1/p, which is the case of ( logn ) < p < n. 3. The main reason we can get better approximations than factor model way of Fan et al. (2008) is our usage of nodewise regression in van de Geer et al. (2014) to get the inverse of large covariance matrix. Even when p > n, we can estimate an approximate inverse of covariance matrix. Nodewise regression based inverse variance estimate has great approximation rate compared to factor model based one in Theorem 3 of Fan et al. (2008). Their rate is op (pK 2 (logn/n)1/2 ) compared to our Op ((logp/n)1/2 ) rate in our Lemma 1. In other words, Theorem 3 of Fan et al. (2008) factor based inverse does not allow for p > n, whereas our technique allows it. Fan et al. (2008) technique first uses factor model to estimate sample variance then inverts this. In our case we estimate the inverse directly through nodewise regression. Recent papers such as Fan et al. (2016) allow inverse of the covariance matrix to be estimated at our rate, but optimization of the portfolio is not analyzed. Also it is not trivial or obvious to use the estimation of the precision matrix in portfolio theory. 2.3.2

Markowitz Mean-Variance Framework

Markowitz (1952) defines the portfolio selection problem as to find the optimal portfolio that has the least variance i.e. portfolio risk for a given expected return ρ. At time t, an investor determines the portfolio weights to minimize the mean-variance objective function: w = argmin(w0 Σw), w

subject to w0 1p = 1

and w0 µ = ρ,

where 1p = (1, . . . , 1)0 . For a given portfolio w, w0 µ and w0 Σw equal to the expected rate of return and variance, respectively. Full investment constraint (w0 1p = 1) requires that weights should sum up to 1. The target return constraint (w0 µ = ρ) indicates a certain level of desired expected portfolio return. Throughout the paper, we assume short-selling is allowed and hence the value of weights could be negative in the portfolio. The well-known solution from the Lagrangian and the first order conditions from constrained quadratic optimization is (as equation (9) in Fan et al. (2008)) : w∗ =

ρA − B −1 D − ρB −1 Σ 1p + Σ µ AD − B 2 AD − B 2

(2.12)

where, A = 10p Σ−1 1p , B = 10p Σ−1 µ and D = µ0 Σ−1 µ. Since Σ is positive-definite, A > 0 and D > 0. By virtue of Cauchy-Schwarz inequality, the system has a solution if AD − B 2 > 0. Now we derive our second result. This uses optimal weights w∗ to get the variance of the optimal portfolio. We form the following estimate of the optimal weight w∗ using (2.12). ˆ − ρB ˆ ˆ ˆ D ˆ p + ρA − B Θˆ ˆ µ, Θ1 ˆ −B ˆ2 ˆ −B ˆ2 AˆD AˆD P ˆ p, B ˆ = 10 Θˆ ˆ ˆ ˆ µ, and µ where Aˆ = 10p Θ1 ˆ0 Θˆ ˆ = n−1 nt=1 rt . p µ, D = µ w ˆ=

10

2 −2Bρ+D ˆ 2 −2Bρ+ ˆ D ˆ ˆ OP V = Aρ Now define ΨOP V = AρAD−B , and define its estimate Ψ . These will be 2 2 ˆ ˆ ˆ AD−B used immediately below. Note that the variance of the optimal portfolio from the constrained optimization above is Aρ2 − 2Bρ + D 0 = ΨOP V . (2.13) w∗ Σw∗ = AD − B 2

The estimate for the above optimal portfolio variance is ˆ 2 − 2Bρ ˆ +D ˆ Aρ 0 ˆw ˆ OP V . w ˆΣ ˆ= =Ψ ˆ −B ˆ2 AˆD Note that we have not used an estimate of the optimal portfolio variance based on estimating w∗ first from (2.12) and the equation immediately below. This could have created a much complicated estimator. Instead, like Fan et al. (2008), we use optimal portfolio variance expression and estimate the closed form solution in (2.13). The following Theorem provides the rate for the error in estimating the optimal portfolio variance. Our next result shows that we allow for p > n in estimating optimal portfolio variance unlike Theorem 6 of Fan et al. (2008). Also it is clear from our result that estimation error converges in probability to zero at a rate faster than 1/p, which shows that adding assets to portfolio helps. p Theorem 2. Under Assumptions 1-3, and assuming uniformly in j, λj = O( logp/n) with, p−2 (AD − B 2 ) ≥ C1 > 0, where C1 is a positive constant, ρ being bounded, we get ˆ OP V − ΨOP V | = op ( 1 ). |Ψ p

Remarks 1. Theorem 2 above, has a convergence rate which is better than Fan et al. (2008), Theorem 6. There, Fan et al. (2008), the rate of convergence of the estimation error to zero in probability is at rate of p2 K 2 (logn/n)1/2 . Our rate of convergence on the other hand is 1/p which is much faster than Fan et al. (2008). It is also clear that with that type of result, we can have p p > n as long as maxj sj = o( n/logp) due to Assumption 2. Note that again Fan et al. (2008) result tells estimation error is increasing with p2 , whereas our result clearly shows it is decreasing with p. The main reason is that we have a different proof technique, and we were able to relax some assumptions in Fan et al. (2008). Our result is also in line with our as well as Fan et al. (2008) simulations. 2. As in Remark 3 to Theorem 1, an interesting question is that, what if the number of factors are constant? Can we still approximate the optimal portfolio variance better than Fan et al. (2008)? We allow for all nonzero coefficients among nodewise regression coefficients, i.e. sj = p − 1 for all j. To satisfy our Assumption 2, we need in that case p < n, p/n → 0. But we can still do better, n 1/6 ) < p < n, in terms of estimation in this non-sparse case, than Fan et al. (2008) as long as ( logn rates as in Remark 3 of Theorem 1.

11

3. Unlike Fan et al. (2008), we do not assume bounded fractions for A/(AD − B 2 ), B/(AD − B 2 ), D/(AD − B 2 ). Fan et al. (2008) also assumes AD − B 2 bounded away from zero, but our Lemma A.4 shows that each element A, B, D grow at rate of p, so more amenable assumption about data may be p−2 (AD − B 2 ) > C1 > 0. What may be a primitive condition for our assumption? Let Eigmin(.), Eigmax(.) represent the minimal, maximal eigenvalues of matrices, respectively, and by definitions of A, B, D, AD − B 2 = (10p Θ1p )(µ0 Θµ) − (10p Θµ)2 ≥ C2 p2 (Eigmin(Θ))2 − p2 Eigmax(µµ0 )Eigmax(ΘΘ0 ), P via simple eigenvalue inequality, and µ0 µ = pj=1 µ2j ≥ C2 > 0, where C2 is a positive constant, which means that all of individual asset returns cannot be zero simultaneously in our portfolio. So our primitive is p−2 (AD − B 2 ) = C2 (Eigmin(Θ))2 − Eigmax(µµ0 )Eigmax(ΘΘ0 ) ≥ C1 > 0.

2.4

Estimating Gross Exposure

Note that in this subsection, we estimate the gross exposure of the portfolio. Next Theorem shows how we can consistently estimate the weights of the global minimum variance portfolio. This result works even with p > n. Theorem 3. Under Assumptions 1,3 and the sparsity assumption (maxj sj )3/2 p with λj = O( logp/n) uniformly in j, we have kw ˆu − wu k1 = Op ((max sj )3/2 j

p logp/n = o(1),

p logp/n) = op (1).

Remarks 1. Note that we have a very mild sparsity assumption. Instead of Assumption 2, our √ ˆ in `1 norm in our sparsity Assumption here needs an extra sj due to approximation error for Θ proof. Note that even though p > n, maxj sj has to be constrained so that this result in Theorem 3 holds true. Next we consider what if this constraint is relaxed. 2. One important case is non-sparse inverse covariance matrix. In other words, what if for all j = 1, · · · , p, sj = p − 1? This is the case with all nonzero cells in inverse of covariance matrix. p Then of course, p has to be less than n (p(logp)1/3 = o(n1/3 )) so that (maxj sj )3/2 logp/n = p (p − 1)3/2 logp/n = o(1). 3. We consider whether we have growing or finite gross exposure, in other words we need to analyze kwu k1 . By (5.60) √ kwu k1 = O(max sj ), j

which may grow with n. As far as we know, this is the first result of estimated gross exposure when it is growing. 12

But if we further assume maxj sj = O(1), meaning if the number of nonzero elements in each row of inverse variance matrix is finite then we have kwu k1 = O(1). So finite gross exposure is possible with one extra assumption. With finite gross exposure Theorem 3 result is kw ˆu − wu k1 = Op (

p logp/n) = op (1).

This means that with finite gross exposure we get a better approximation compared with rate in Theorem 3. 4. In Remark 2 above we find rate of approximation for weights in the case of constant exposure. Fan et al. (2015), by using factor models, estimates Σ by a sparse structure in rows and then ˆ −1 to weight estimation, p.372 of Fan et al. (2015) gets the estimates Σ−1 . Basically, applying Σ same rate as Remark 2 above, as long as maximum number of non vanishing elements in each row in Σ is finite (i.e. case of parameter q = 0 in Assumption 4.2 of Fan et al. (2015)). We now turn to the estimator of the gross exposure of the portfolio, kw∗ k1 . The estimator is w, ˆ where we use `1 norm as in Fan et al. (2015) among others. The next Theorem established consistency of the portfolio gross exposure estimator with p > n, it is one of our main results and a novel contribution to the literature. p 1/2 = o(1) and assuming uniTheorem 4.Under Assumptions, 1, 3 with (maxj sj )3/2 logp/n p −2 2 formly in j, λj = O( logp/n) with p (AD − B ) ≥ C1 > 0, where C1 is a positive constant, and ρ being bounded as well, we have kw ˆ − w∗ k1 = Op ((max sj )3/2 j

p logp/n) = op (1).

Remarks 1. We want to see whether we allow for growing gross exposure or not. In that respect we need to know the rate in kw∗ k1 . Note that w∗ =

D − ρB ρA − B Θ1p + Θµ. 2 AD − B AD − B 2

See that

|D − ρB|kΘ1p k1 |ρA − B|kΘµk1 + . |AD − B 2 | |AD − B 2 | √ √ With (5.60)-(5.63) we have kΘ1p k1 = O(p maxj sj ), kΘµk1 = O(p maxj sj ). Then dividing both numerator and denominator in the above inequality by p2 , and AD − B 2 ≥ AD with triangle inequality provides √ kw∗ k1 = O( maxj sj ), kw∗ k1 ≤

where we use Lemma A.4 A = O(p), D = O(p), |B| = O(p), AD/p2 is bounded away from zero via Assumption 3, and with ρ being bounded. This shows that we have growing gross exposure. p 2. With the Assumptions 1 and 3, with imposing now max1≤j≤p sj = O(1), logp/n = o(1), p p−2 (AD − B 2 ) ≥ C1 > 0, ρ being bounded and maxj λj = O( logp/n) leading to finite gross 13

exposure, as can be seen from the proof of Theorem 4, and p kw ˆ − w∗ k1 = Op ( logp/n) = op (1). So the rate of approximating the optimal portfolio improves greatly. This shows even for very large portfolios we can estimate the weights successfully controlling the error. 3. Here, we show that with all possible nonzero coefficients for γj , and growing exposure, (for all j = 1, · · · , p: sj = p − 1), the rate in Theorem 4 is: kw ˆ − w∗ k1 = Op (p3/2

p logp/n),

p which means we need p3/2 logp/n = o(1). This shows that plogp1/3 = o(n1/3 ), so p should be much less than n in this case.

2.5

Asset Returns with Bounded Moments

In this section, we relax assumption of the sub-Gaussianity of the asset returns. Caner and Kock (2014) work on the case of non-sub Gaussian data for the concept of the nodewise regression concept. Our Assumptions 1-2 will change and Assumption 3 will be kept. The new assumptions involve bounded moments and a more restrictive sparsity assumption, so there is a trade-off. Assumption 1*. The asset return vector ri is iid, but max1≤j≤p E|rij |l ≤ C, with l > 4. Assumption 2*. The sparsity condition is (max sj ) j

p2/l = o(1). n1/2

Note Caner and Kock (2014) find out that uniformly in j, λj = O(p2/l /n1/2 ) through their Lemma 2. So basically in sub-Gaussian returns the tuning parameter was of order (logp)1/2 /n1/2 , this means we are paying a price on relaxing the sub-Gaussianity of the returns by allowing less assets in our portfolio. In the Appendix B we show which parts of the proofs in Theorems 1-4 are affected. Also see that our tuning parameter choice affects Assumption 2*, where compared to Assumption 2 we replace (logp)1/2 with p2/l . We provide now all counterparts to Theorems 1-4, with Assumptions 1*, 2* replacing Assumptions 1-2. Theorems 5-6 are about optimal portfolio allocation under non-sub Gaussian returns. In Theorems 5-8, we allow p > n. One particular case for p > n, via Assumption 2∗ is when maxj sj = O(1), where this leads to constant exposure.

14

√ Theorem 5.Under Assumptions 1*,2*,3, assuming uniformly in j, λj = O(p2/l / n) ˆ G − ΦG | = op ( 1 ). |Φ p

2/l

Theorem 6. Under Assumptions 1*, 2*, 3, and assuming uniformly in j, λj = O( p√n ) with p−2 (AD − B 2 ) ≥ C1 > 0, where C1 is a positive constant, ρ is bounded, we get ˆ OP V − ΨOP V | = op ( 1 ). |Ψ p

Now we provide counterparts to Theorems 3-4, in terms of estimating gross exposure of the portfolios. 2/l

Theorem 7.Under Assumptions 1*,3 and the sparsity assumption (maxj sj )3/2 p√n = o(1), 2/l

assuming uniformly in j, λj = O( p√n ) we have p2/l kw ˆu − wu k1 = Op ((max sj )3/2 √ ) = op (1). j n

Note that we see here it is possible to have p > n in Theorem 7, as long as l is large, or maxj sj = O(1). Theorem 8 allows also p > n. 2/l

Theorem 8. Under Assumptions, 1*, 3, with (maxj sj )3/2 p√n = o(1), and assuming uniformly 2/l

in j, λj = O( p√n ) with p−2 (AD −B 2 ) ≥ C1 > 0 where C1 is a positive constant and ρ being bounded as well, we have p2/l kw ˆ − w∗ k1 = Op ((max sj )3/2 √ ) = op (1). j n Note that in all Theorems 5-8, the main difference between them and Theorems 1-4 is that we √ replace logp everywhere in approximations with p2/l , l > 4. The reason for that is the rate of the tuning parameter in non-sub Gaussian returns is: maxj λj = O(p2/l /n1/2 ).

3

Simulations

This section begins by discussing the implementation of the nodewise estimator, as well as alternative estimators. We then present the setup of this simulation study and follow by discussing the results.

15

3.1

Implementation of the nodewise estimator

The nodewise regression approach is implemented using the coordinate descent algorithm implemented in the glmnet package (Friedman et al., 2010). The penalty parameter λ is chosen by minimizing the Bayesian information criterion of Wang et al. (2009): 2 BIC(λ)j = log(ˆ σλ,j )+

X

1(ˆ γj,i,λ 6= 0)

i6=j

log(n) log(log(p)), n

2 is the residual variance for asset j and penalty parameter λ and γ where σ ˆλ,j ˆj,i,λ the ith estimated parameter for asset j with penalty parameter λ. Alternative information criterion were investigated, including those proposed by Caner et al. (2017), Zou et al. (2007) yielding only minor differences in the results. The entries of the covariance matrix are estimated equation by equation and a different penalty parameter is selected for each equation.

3.2

Alternative covariance matrix estimation methods

Ledoit-Wolf Shrinkage Ledoit and Wolf (2004) proposes an estimator for high-dimensional covariance matrices that is invertible and well-conditioned. Their estimator is a linear combination of the sample covariance ˆ LW = ρΣ ˆ + (1 − ρ)Ip , and they show that their estimator is matrix and an identity matrix, Σ asymptotically optimal with respect to the quadratic loss function. Expansions and generalization of the Ledoit and Wolf (2004) approach have been proposed by Ledoit and Wolf (2015); Bodnar et al. (2016). The properties of this covariance matrix estimator, and of its inverse, when used to construct portfolios has not been investigated in the literature specifically in our problems however we still use them in simulations. We do not report the results of this estimator since this technique turns out to be unstable in simulations, and existing theorems do not apply in Ledoit and Wolf (2004). However, the results can be obtained from authors on demand. 3.2.1

Multi-factor Estimator

The Arbitrage Pricing Theory is derived by Ross (1976, 1977) and the multi-factor models are proposed by Chamberlain and Rothschild (1983). These studies have motivated the use of factor models for the estimation of excess return covariance matrices. The model takes the form r = Bf + ε,

(3.1)

where r is a matrix of dimensions p × n of excess return of the assets over the risk-free interest rate and f a K × n matrix of factors. B is a p × K matrix of unknown factor loadings and ε is the matrix of idiosyncratic error terms uncorrelated with f . This model yields an estimator for the

16

covariance matrix of r: ΣF AC = Bcov(f )B0 + Σn,0 ,

(3.2)

where Σn,0 is the covariance matrix of errors ε. When the factors are observed, as in Fan et al. (2008), the matrix of loadings B can be estimated by least squares. In the case where the factors are unobserved, Fan et al. (2013, 2015) propose the POET, to estimate ΣF AC . In our simulations we compare the nodewise regression approach to the POET estimator implemented as in Fan et al. (2015), which we describe below. ˆ1 ≥ . . . ≥ λ ˆ p to be ordered eigenvalues of the sample covariance matrix Σ ˆ and ξˆp to be Let λ j=1 ˆ P OET is defined as: its corresponding eigenvectors. The estimated covariance matrix Σ ˆ P OET = Σ

ˆ K X

ˆ j ξˆj ξˆ0 + Ω, ˆ λ j

(3.3)

j=1

P  p ˆ k ξˆ2 , λ k,i ˆ k= K+1 ˆ ij = P  Ω p sij ˆ k ξˆk,i ξˆk,j , λ ˆ k=K+1

i = j,

(3.4)

i 6= j.

ˆ = (Ω ˆ ij )p×p and sij (·) : R → R is the soft-thresholding function sij = (z−τij )+ (Antoniadis where, Ω and Fan, 2001; Rothman et al., 2009; Cai and Liu, 2011), where (.)+ represents the larger of zero and a scalar with z denoting the ij th cell in the standard residual sample covariance matrix of ε. The residual is defined as least squares estimator residual in the factor model. The entry dependent threshold parameter is ! r q logp 1 ˆ ii Ω ˆ ij +√ , (3.5) τij = ζ Ω n p where ζ > 0 is a user-specified positive constant to maintain the finite sample positive definiteness ˆ In our implementation we initialize at ζ = 0, and increase ζ by increments of 0.1 until Ω ˆ is of Ω. invertible and well conditioned. ˆ is chosen using the information criterion of Bai and Ng (2002) The number of factors K  ˆ = argmin 1 tr  K 0≤k≤M p

p X

j=k+1

 ˆ j ξˆj ξˆ0  + k(p + n) log λ j pn



pn p+n

 ,

(3.6)

where M is a user-defined upper bound which we set to M = 7. An alternative factor-based estimator is proposed by Chen et al. (2015), using the procedure of Onatski (2010) to select the number of factors and applying their estimator to a different form of the Markowitz (1952) optimization problem. For this reason we do not report results for the method of Chen et al. (2015). We also run simulations with a 3 factor model, but it is very difficult to know the model a priori, so we do not report these here, and instead focus on POET.

17

3.3

Simulation setup

In every simulation we use n = 252 observations, corresponding to one year of daily returns, and generate p assets, with p = 25, 50, ..., 500. For each value of p we perform 1000 replications. Computations are carried using R 3.2.2 (R Core Team, 2015), and are fully reproducible. 3.3.1

Data generating processes

We consider two data generating processes (DGP), one based on a three factor model and the other based on randomly generated covariance matrices. The factor based DGP, noted Factor, is a replication of the data generating process of Fan et al. (2008). The process is based on 3 randomly generated factors and loadings with a Gaussian distribution and known moments. The same factors are used for every value of p considered, new loadings and standard deviations of the innovations are generated for each value of p. The innovations are iid Gaussian with Gamma distributed standard errors. The random covariance matrix DGP uses the estimated moments of the excess returns from 466 stocks of the S&P 500 observed from March 17, 2011 to March 17, 2016 (1257 observations), using the 13 weeks (3 month) treasury bill to compute the risk free returns. We use the estimated moments to generate a vector of means and a covariance matrix for the returns for each value of p as follows: 1. Generate p means for the excess returns, µRCV , from N (µ, σ 2 ). 2. Generate p standard deviations σ1 , ..., σp for the errors by sampling from a Gamma distribution G(α, β). 3. Generate off-diagonal elements for the covariance matrix by drawing from a Gaussian distribution. Fill up 10% of the upper diagonal with these random numbers (the other entries being zero) and ensure symmetry of the lower diagonal. 4. Ensure positive definiteness of the constructed covariance matrix by eigenvalue cleaning as in Callot et al. (2016); Hautsch et al. (2012). To implement this procedure, perform a spectral decomposition of the form Σ = V 0 AV where V is the matrix of eigenvectors and A the (diagonal) matrix of eigenvalues, replace the negative eigenvalues as well as those below 10−6 ˜ and construct Σ ˜ = V 0 AV ˜ . by the smallest eigenvalue larger than 10−6 yielding A, ˜ 5. Generate the matrix of excess returns by drawing from N (µRCV , Σ). ˜ is not sparse if eigenvalue cleaning was performed, and in all cases the The covariance matrix Σ inverse covariance matrix is not sparse.

18

3.3.2

Portfolio weights estimation

We consider the global minimum variance portfolio as well as the mean-variance portfolios introduced by Markowitz (1952). For the Markowitz portfolios we use two return targets, a yearly target of 10% and a daily target of 0.0378%, corresponding to a 10% return when cumulated over 252 days. Closed form solutions for the portfolio allocation vectors and its variance are given in section 2.3.2 for the Markowitz portfolio, and in section 2.3.1 for the global minimum variance portfolio. Monthly results are also calculated, and they are very similar to yearly results so they will not be reported.

3.4

Simulation results Factor

Factor

Factor

Markowitz, yearly target

Markowitz, daily target

Global minimum variance 0.100

2.25

0.09

2.00

0.06

0.050

1.75

0.03

0.025

0.075

0.00

value

100 200 300 400 500

0.000 100 200 300 400 500

100 200 300 400 500

Random covariance matrix

Random covariance matrix

Random covariance matrix

Markowitz, yearly target

Markowitz, daily target

Global minimum variance

12 9

1e−03

1e−03

5e−04

5e−04

6 3 0 100 200 300 400 500

100 200 300 400 500

100 200 300 400 500

p Estimator

POET

NodeWise

Figure 1: Portfolio variance estimation error, Mean Absolute Error 0 0b Figure 1 reports the mean absolute error (MAE) of the portfolio variance: w b Σw b − w∗ Σw∗ . For both DGPs the variance estimation error is nearly identical for the POET and nodewise estimators when considering the yearly target (leftmost column of Figure 1). When considering the daily target and the global minimum variance portfolios a different picture emerges.

19

With the factor DGP, portfolios estimated using the (factor-based) POET estimator have a marginally lower variance estimation error than the portfolios based on the nodewise estimator, while with the random covariance matrix DGP, the nodewise estimator allows for the construction of portfolios with a much lower variance estimation error irrespective of the value of p. In all panels of Figure 1 the variance estimation error decreases when the number of assets increases. This verifies our theorems unlike Fan et al. (2008). Figure 2 reports the MAE of the portfolio weight estimation error kw ˆ − wk1 . The results are similar to those found in Figure 1. When the return target is high, both estimators yield comparable portfolios with relatively large errors with both DGPs. When the target is lower (daily) or when we consider the global minimum variance portfolio, the weight estimation error using the nodewise estimator is significantly smaller than the POET estimator on the random covariance matrix DGP. Factor

Factor

Factor

Markowitz, yearly target

Markowitz, daily target

Global minimum variance

9.5

5 5

9.0 8.5

4

4

8.0

3

3

7.5 100 200 300 400 500

value

2

2

7.0

100 200 300 400 500

Random covariance matrix

Random covariance matrix

Markowitz, yearly target

Random covariance matrix

Markowitz, daily target 2.50

200

100 200 300 400 500

Global minimum variance 2.5

2.25 2.0

2.00

190

1.75 180

1.5

1.50 1.0

1.25 100 200 300 400 500

100 200 300 400 500

100 200 300 400 500

p Estimator

POET

NodeWise

Figure 2: Portfolio weights estimation error, mean absolute error Figure 3 reports the average exposure kwk ˆ 1 of the estimated portfolios weights. The upper rows of Figure 3 shows that when simulating data with the factor based DGP, the exposure of the portfolio based on the POET estimator is 5 to 10 times higher than that of portfolios based on the nodewise estimator. This is also the case with the random covariance matrix DGP for p < 100,

20

Factor

Factor

Factor

Markowitz, yearly target

Markowitz, daily target

Global minimum variance

12.5 6

6

10.0 4

7.5

4

5.0 2

value

100 200 300 400 500

2.5

2 100 200 300 400 500

Random covariance matrix

Random covariance matrix

Markowitz, yearly target

Markowitz, daily target

Global minimum variance 2.00

12

10

Random covariance matrix 2.25

14

12

100 200 300 400 500

1.75 10

8

1.50

8

1.25

6 6 100 200 300 400 500

1.00 100 200 300 400 500

p Estimator

POET

NodeWise

Figure 3: Portfolio exposure, mean absolute value

21

100 200 300 400 500

for larger values of p the exposure of the POET based weights is close, but still higher than the exposure of the portfolios based on the nodewise estimator. To summarize the results in this section, we note that nodewise regressions outperforms the POET estimator (which is specifically designed for factor model based returns) when using the random covariance matrix DGP, and the nodewise regression approach performs well when the data generating process assumes a factor structure as well. Unreported simulations show that our method is robust to returns following a student t distribution, changes in the information criterion used to select the penalty parameters, and modifications of the random covariance matrix DGP.

4 4.1

Empirics Performance Measures

In this section we perform an out of sample forecasting exercise. There will be three measures. The first measure will be Sharpe ratio (SR from now on), and the second one is portfolio turnover, and we also investigate the cumulative wealth of the computed portfolios. We compare our approach against the POET estimator. We have also tried Ledoit and Wolf (2004) based estimator in portfolio estimation and also three factor model of Fan et al. (2008) estimator. Ledoit and Wolf (2004) and Fan et al. (2008) estimators do not provide very stable results in our measures of interest. They heavily depend on the portfolio optimization method and timeline. So we do not report them, but they are available from authors on demand. Also SR will be evaluated with and without transaction cost. We use a rolling horizon method for out of sample forecasting. The sample is divided into two. The first part of the sample is treated as in-sample (training data) and the next part is out of the sample (testing data). So all of the sample has n observations. In-sample part is (1 : nI ), and out of the sample is (nI + 1 : n). The rolling window method works as follows: the portfolio weights are calculated in sample for the period in between (1 : nI ) and denoted as w ˆnI , then this is multiplied by the return in nI + 1 period to have the forecast portfolio return for nI + 1 period: w ˆn0 I rnI +1 . Then we roll the window by one period and form the portfolio weight for the period: (2 : nI + 1) which we denote w ˆnI +1 . This is again multiplied by returns at nI + 2 period to get the forecast for 0 0 nI + 2 period: w ˆnI +1 rnI +2 . In this way, we go until (including) n − 1 period and get w ˆn−1 rn . So, in the case of no transaction costs, out of sample average portfolio means across all out of sample observations is n−1 X 1 w ˆ 0 rt+1 , µ ˆos = n − nI t=n t I

and variance for out of sample is: 2 σ ˆos

n−1 X 1 = (w ˆ 0 rt+1 − µ ˆos )2 . (n − nI ) − 1 t=n t I

22

Sharpe ratio is SR = µ ˆos /ˆ σos . For transaction cost based Sharpe ratio, let c be the proportional transaction cost. This is chosen to be 50 basis points in DeMiguel et al. (2009b). Excess portfolio return at time t with transaction cost is (see Ban et al. (2016)), Returnt =

w ˆt0 rt+1

− c(1 +

w ˆt0 rt+1 )

p X

+ |w ˆt+1,j − w ˆt,j |,

j=1 + where w ˆt,j =w ˆt,j (1 + Rt+1,j )/(1 + Rt+1,p ) and Rt+1,j is the excess return added to risk free rate for jth asset, and Rt+1,p is the portfolio excess return plus risk free rate. For this definition, see Li (2015). The SR with transaction costs is: SRc = µ ˆos,c /ˆ σos,c ,

where µ ˆos,c =

n−1 X 1 Returnt , n − nI t=n I

and 2 σ ˆos,c =

n−1 X 1 (Returnt − µ ˆos,c )2 . (n − nI ) − 1 t=n I

The second measure that we analyze is portfolio turnover (PT) PT =

p n−1 XX 1 + |w ˆt+1,j − w ˆt,j |. n − nI t=n I

j=1

The last measure is wealth at period t with transaction costs from Ban et al. (2016),  Wt+1 = Wt (1 + w ˆt0 Rt+1 ) 1 − c

p X

 +  |w ˆt+1,j − w ˆt,j | .

j=1

We also show wealth at period t without transaction costs as Wt+1 = Wt (1 + w ˆt0 Rt+1 ). Note that in order to have a fair comparison with SP500 returns, we use regular returns (excess returns plus risk free rate) in the cumulative wealth calculations.

4.2

Data

We illustrate the performance of our nodewise regression based estimator (NDW) and compare with the POET estimator. We use an empirical dataset with S&P500 index with major assets (p = 395). Full sample is spanning from January 2000 to June 2016. We use monthly data. The

23

in-sample period is January 2000 - June 2010, nI = 126, p = 395, and out of sample is July 2010June 2016 with n − nI = 721 . We then estimate expected returns and covariance matrix based on a 6-year rolling window, and formulate global minimum and Markowitz portfolios. Portfolios are held for one month and rebalanced at the beginning of the next month. As a return constraint, we use monthly target of 0.7974% which is equivalent of 10% return for a year when compounded.

4.3

Results

Table 1 reports the empirical results based on the POET and our NDW estimators. In Panel A and Panel B of Table 1, we show global minimum variance and Markowitz portfolio performances with and without transaction costs. For the global minimum portfolio, our NDW estimator yields higher portfolio returns and better Sharpe ratio. Our estimator also yields lower out-of-sample portfolio variances than the POET in the aforementioned measures. We observe that the POET estimator has lower turnover rate. Performance results are more striking for Markowitz portfolios. For Markowitz portfolio with transaction cost, NDW estimator based Sharpe ratio is at least twice of POET based one. Contrary to global minimum portfolio, the NDW estimator based portfolio attains better performance in terms of portfolio turnover. Return Variance SR Panel A: Global Minimum Portfolios

PT

without transaction cost

NDW POET

0.01107 0.01060

0.00106 0.00107

0.33873 0.32270

0.18633 0.17567

0.01011 0.00957

0.00108 0.00108

0.30767 0.29110

-

with transaction cost

NDW POET

Panel B: Markowitz Portfolios without transaction cost

NDW POET

0.00714 0.00519

0.00153 0.00175

0.18240 0.12419

0.44438 0.50439

0.00509 0.00275

0.00154 0.00181

0.12982 0.06475

-

with transaction cost

NDW POET

Table 1: Monthly Portfolio Performance In Figure 4, we plot the cumulative wealth of the global minimum portfolios of NDW and the POET estimators relative to benchmark over the out-of-sample forecast horizon period. We use passively managed S&P500 index over the forecast horizon period as the benchmark. In both 1

We have computed other estimation window lengths, such as n−nI = 60, 48, 36, 24, 12. Results show that POET is doing well only in 12 month out of sample forecast.

24

figures, initial wealth is set to $1 which grows at the monthly return with and without transaction cost. We observe in Figure 4 that NDW estimator is the superior procedure for the portfolios without transaction cost over the POET estimator and S&P500 index. With the transaction cost, NDW estimator is slightly better than the S&P500 index. Note that S&P500 index does not include any transaction cost here. The difference between NDW and the POET estimator is still large. One thing to remember is that, our cumulative wealth exercise is done with $1 at start. With large portfolios, the differences in nominal terms will be very large.

gmv without transaction cost 2.25

value

2.00 1.75 1.50 1.25 1.00 2011

2012

2013

2014

2015

2016

2015

2016

date

gmv with transaction cost 2.1

value

1.8 1.5 1.2

2011

2012

2013

2014

date Estimator

SP500

POET

NodeWise

Figure 4: Cumulative wealth of POET and Nodewise (NDW)relative to the benchmark (SP500) with initial wealth of $1.

5

Conclusion

In this paper, we analyze the variance of a large portfolio. We show that increasing number of assets in a portfolio decreases estimation error for the portfolio variance, unlike the previous literature. We propose an approximate inverse of the high-dimensional non-invertible covariance matrix using nodewise regression with `1 penalty. Further, we show uniformly consistent nodewise regression estimate of the approximate inverse matrix as well as consistent estimation of the gross 25

exposure of the portfolio. We also investigate the results in case of relaxing sub-Gaussianity of the returns assumption. To our knowledge, such an estimation of approximate inverse matrix of singular covariance matrix of returns has not been previously proposed in finance literature. We compare our Nodewise estimator to the POET estimator in simulations and an application. Our numerical results demonstrate that Nodewise estimator outperforms the POET estimator in terms of out-of-sample portfolio performance by achieving smaller variance and higher return and Sharpe ratio. Nodewise estimator based portfolio has also higher wealth over the out-of-sample forecast horizon period than its competitors. Therefore, with limited observations, Nodewise estimator based portfolios serve as promising investment strategies with a significant number of assets. Appendix A. In this section we provide proofs. Here, we repeat consistency of the inverse matrix estimation in the proof of Theorem 2.4 in van de Geer et al. (2014) with more clarifying steps for the reader. Proof of Lemma 1. First by (2.7) ˆΣ ˆ − Ip k∞ ≤ max kΘ j

λj . τˆj2

(5.1)

p Then uniformly in j, λj = O( logp/n). Also see that by Theorem 2.4 of van de Geer et al. (2014), p by Assumption maxj sj logp/n = o(1), |ˆ τj2 − τj2 | = op (1). Note we define Θ = Σ−1 . Then, since τj2 = (Θ−1 j,j ) in van de Geer et al. (2014), via Assumption −1 3 kΘj k2 ≤ Λmin = O(1) uniformly in j. This last result implies minj τj2 > 0. Then by Assumption 3, τj2 ≤ Σj,j = O(1) uniformly in j. We combine these last two components to have ˆΣ ˆ − Ip k∞ = Op ( kΘ

p logp/n) = op (1).

Q.E.D. Before going through the main steps, the following norm inequality is used in all of the proofs. Take a p × p generic matrix: M , and a generic p × 1 vector x. Note that Mj0 represents 1 × p, jth

26

row vector in M , and Mj is p × 1 vector (i.e. transpose of Mj0 , or column version of Mj0 )  kM xk1

M10 x M20 x ··· ··· Mp0 x

   = k   

     k1   

= |M10 x| + |M20 x| + · · · + |Mp0 x| ≤ kM1 k1 kxk∞ + kM2 k1 kxk∞ + · · · + kMp k1 kxk∞ p X = [ kMj k1 ]kxk∞ j=1

≤ p max kMj k1 kxk∞ ,

(5.2)

j

where we use Holders inequality to get each inequality. The following three lemmata are useful for the proof of Theorem 1. Before that, we need a result from Theorem 2.4 of van de Geer et al. (2014). This is an l1 bound on the estimation error ˆ j and Θj . Under Assumption 1-3, we have between Θ ˆ j − Θj k1 = Op (sj kΘ

p logp/n).

(5.3)

ˆ p , also note that A = 10p Θ1p , where the population quantity Θ = Σ−1 . We define Aˆ = 10p Θ1 Lemma A.1. Under Assumptions 1-3, uniformly in j ∈ {1, · · · , p}, 1 ˆ |A − A| = op (1). p

Proof of Lemma A.1. First, see that ˆ p = 10p (Θ ˆ − Θ)1p + 10p Θ1p . 10p Θ1

(5.4)

Now consider the first term on the right side of (5.4), note that for a vector kxk∞ is the absolute maximal element of that vector x, ˆ − Θ)1p | ≤ k(Θ ˆ − Θ)1p k1 k1p k∞ |10p (Θ ˆ j − Θj k1 ≤ pkΘ p = Op (p max sj logp/n) = op (p), j

(5.5)

where Holders inequality is used in the first inequality, and (5.2) is used for the second inequality

27

and the rate is from Theorem 2.4 of van de Geer et al. (2014) with (5.3), and the last equality is obtained by imposing Assumption 2. Q.E.D. Proof of Theorem 1. We consider ˆ

| Ap − Ap | 1 1 p| − | = . ˆ Aˆ A | Ap Ap |

(5.6)

First, use Assumption 3 to have, (where C0 = Eigmin(Σ−1 ) > 0, C0 is a positive constant, and it represents the minimal eigenvalue of Θ = Σ−1 ) A = 10p Σ−1 1p ≥ pC0 > 0, which shows A/p ≥ C0 > 0.

(5.7)

By Lemma A.1, we have p−1 |Aˆ − A| = op (1). Then use this last equation for the numerator in (5.6) p|

op (1) 1 1 − |≤ , ˆ (op (1) + C0 )C0 A A

(5.8)

where the denominator is bounded away from zero by A/p being bounded away from zero as shown ˆ = A/p + op (1) by Lemma A.1 to get the in (5.7). Also use A/p = O(1) by Lemma A.4 and A/p denominator’s rate and the result. Q.E.D. The result below is needed for the subsequent lemmata. First, let rt be the asset return for all p assets at time t, and rt is p × 1 vector, and µ is the p × 1 population return vector. This result can be obtained by using Lemma 14.16 of B¨ uhlmann and van de Geer (2011) since our returns are deemed to be sub-Gaussian. kn−1

n X

rt − µk∞ = Op (

p logp/n) = op (1).

(5.9)

t=1

Note that we use sample mean to predict population mean. Before the next Lemma, we define ˆ = 10p Θˆ ˆ µ, and B = 10p Θµ. B Lemma A.2. Under Assumptions 1-3, uniformly in j ∈ {1, · · · , p} 1 ˆ |B − B| = op (1). p

28

ˆ by simple addition and subtraction into Proof of Lemma A.2. We can decompose B ˆ = 10p Θˆ ˆ µ = 10p (Θ ˆ − Θ)(ˆ B µ − µ)

(5.10)

ˆ − Θ)µ + 10p (Θ

(5.11)

+ 10p Θ(ˆ µ − µ)

(5.12)

+ 10p Θµ.

(5.13)

Now we analyze each of the terms above. Defining µ ˆ = n−1

Pn

t=1 rt ,

ˆ − Θ)(ˆ ˆ − Θ)1p k1 kˆ |10p (Θ µ − µ)| ≤ k(Θ µ − µk∞ ˆ j − Θj k1 ]kˆ ≤ p[ max kΘ µ − µk∞ 1≤j≤p p p = pO(max sj logp/n)Op ( logp/n), j

(5.14)

where we use Holder’s inequality in the first inequality, and the norm inequality in (5.2) with ˆ − Θ, x = 1p in the second inequality above, and the rate is by (5.3) and (5.9). M =Θ So we consider (5.11) above. Note that by Assumption 1, kµk∞ < C < ∞, where C is a positive constant. ˆ − Θ)µ| ≤ k(Θ ˆ − Θ)1p k1 kµk∞ |10p (Θ ˆ j − Θj k1 ] ≤ Cp[ max kΘ 1≤j≤p p = CpOp (max sj logp/n), j

(5.15)

where we use Holder’s inequality in the first inequality, and the norm inequality in (5.2) with ˆ − Θ, x = 1p in the second inequality above, and the rate is by (5.3). M =Θ Before the next proof, we need an additional result. First, by the proof of Theorem 2.4 in van de √ Geer et al. (2014), or proof of Lemma 5.3 of van de Geer et al. (2014), we have kγj k1 = O( sj ). Note that Θj = Cj /τj2 . Also remember taking j = 1, C1 = (1, −γ1 ) (without losing any generality here). By our Assumption 2, minj τj2 > 0. So max kΘj k1 = O(max j

j



sj ).

(5.16)

Now consider (5.12). |10p Θ(ˆ µ − µ)| ≤ kΘ1p k1 kˆ µ − µk∞ ≤ p[ max kΘj k1 ]kˆ µ − µk∞ 1≤j≤p p √ = pOp (max sj )Op ( logp/n), j

29

(5.17)

where we use Holder’s inequality in the first inequality, and the norm inequality in (5.2) with M = Θ, x = 1p in the second inequality above, and the rate is from (5.16) and (5.9). Combine (5.14)(5.15)(5.17) in (5.10)-(5.13), and note that the largest rate is coming from (5.15). So use p Assumption 2, maxj sj logp/n = o(1) to have ˆ − B| = Op (max sj p−1 |B j

p logp/n) = op (1).

(5.18)

.Q.E.D. Next, we show the uniform consistency of another term in the estimated optimal weights. Note ˆ =µ ˆ µ. that D = µ0 Θµ, and its estimator is D ˆ0 Θˆ Lemma A.3.Under Assumptions 1-3, uniformly in j ∈ {1, · · · , p} ˆ − D| = op (1). p−1 |D

Proof of Lemma A.3. By simple addition and subtraction ˆ − D = (ˆ ˆ − Θ)(ˆ D µ − µ)0 (Θ µ − µ)

(5.19)

+ (ˆ µ − µ)0 Θ(ˆ µ − µ)

(5.20)

+ 2(ˆ µ − µ)0 Θµ

(5.21)

0

ˆ − Θ)(ˆ + 2µ (Θ µ − µ)

(5.22)

0

ˆ − Θ)µ. + µ (Θ

(5.23)

We start with (5.19). ˆ − Θ)(ˆ ˆ − Θ)(ˆ |(ˆ µ − µ)0 (Θ µ − µ)| ≤ k(Θ µ − µ)k1 kˆ µ − µk∞ ˆ j − Θj k1 ] ≤ p[kˆ µ − µk∞ ]2 [max kΘ j p = pOp (logp/n)Op (max sj logp/n) j

= Op (p max sj (logp/n)3/2 ), j

(5.24)

where Holder’s inequality is used for the first inequality above, and the inequality (5.2), with ˆ − Θ and x = µ M =Θ ˆ − µ for the second inequality above, and for the rates we use (5.3), (5.9).

30

We continue with (5.20). |(ˆ µ − µ)0 (Θ)(ˆ µ − µ)| ≤ k(Θ)(ˆ µ − µ)k1 kˆ µ − µk∞ ≤ p[kˆ µ − µk∞ ]2 [max kΘj k1 ] j √ = pOp (logp/n)Op (max sj ) j √ = Op (p max sj (logp/n)), j

(5.25)

where Holder’s inequality is used for the first inequality above, and the inequality (5.2), with M = Θ and x = µ ˆ − µ for the second inequality above, for the rates we use (5.9), (5.16). Then we consider (5.21), with using kµk∞ ≤ C, µ − µ)k1 kµk∞ |(ˆ µ − µ)0 (Θ)(µ)| ≤ k(Θ)(ˆ ≤ Cp[kˆ µ − µk∞ ][max kΘj k1 ] j p √ = pOp ( logp/n)Op (max sj ) j



= Op (p max sj (logp/n)1/2 ), j

(5.26)

where Holder’s inequality is used for the first inequality above, and the inequality (5.2), with M = Θ and x = µ ˆ − µ for the second inequality above, for the rates we use (5.9), (5.16). Then we consider (5.22). ˆ − Θ)(ˆ ˆ − Θ)(µ)k1 kˆ |(µ)0 (Θ µ − µ)| ≤ k(Θ µ − µk∞ ˆ j − Θj k1 kˆ ≤ pkµk∞ max kΘ µ − µk∞ j

ˆ j − Θj k1 ]k(ˆ ≤ Cp[max kΘ µ − µ)k∞ j p p = pOp (max sj logp/n)Op ( logp/n) j

= Op (p max sj logp/n), j

(5.27)

where Holder’s inequality is used for the first inequality above, and the inequality (5.2), with ˆ − Θ and x = µ for the second inequality above, and for the third inequality above we use M =Θ kµk∞ ≤ C, and for the rates we use (5.3), (5.9).

31

Then we consider (5.23), ˆ − Θ)(µ)| ≤ k(Θ ˆ − Θ)(µ)k1 kµk∞ |(µ)0 (Θ ˆ j − Θj k1 ≤ p[kµk∞ ]2 max kΘ j

ˆ j − Θj k1 ] ≤ Cp[max kΘ j p = pOp (max sj logp/n) j

= Op (p max sj (logp/n)1/2 ), j

(5.28)

where Holder’s inequality is used for the first inequality above, and the inequality (5.2), with ˆ − Θ and x = µ for the second inequality above, and for the third inequality above we use M =Θ kµk∞ ≤ C, and for the rate we use (5.3). Note that the last rate above in (5.28) derives our result, since it is the largest rate by Assumption 2. Combine (5.24)-(5.28) in (5.19)-(5.23) and the rate in (5.28) to have ˆ − D| = Op (max sj p−1 |D

p

j

logp/n) = op (1).

(5.29)

Q.E.D. The following lemma establishes orders for the terms in the optimal weight, A, B, D. This is useful to understand the implications of the assumptions in Theorems 1-2. Note that both A, D are positive by Assumption 3, and bounded away from zero. Lemma A.4.Under Assumptions 1,3 A = O(p). |B| = O(p). D = O(p).

Proof of Lemma A.4. Note that by Assumption 3, since Σ−1 = Θ is positive definite, A, D are both positive. We do the proof for term D = µ0 Θµ. The proof for A = 10p Θ1p , is the same. D = µ0 Θµ ≤ Eigmax(Θ)kµk22 = O(p), where we use the fact that each µj is a constant by Assumption 1, and the maximal eigenvalue of Θ = Σ−1 is finite by Assumption 3. For term B, the proof can be obtained by using CauchySchwartz inequality first and the using the same analysis for terms A and D.Q.E.D. Next we need the following technical lemma, that provides the limit and the rate for the denominator in optimal portfolio. 32

p Lemma A.5.Under Assumptions 1-3, uniformly over j in λj = O( logp/n) ˆ −B ˆ 2 ) − (AD − B 2 )| = op (1). p−2 |(AˆD

Proof of Lemma A.5. Note that by simple adding and subtracting ˆ −B ˆ 2 = [(Aˆ − A) + A][(D ˆ − D) + D] − [(B ˆ − B) + B]2 . AˆD Then using this and simplifying ˆ −B ˆ 2 ) − (AD − B 2 )| ≤ p−2 {|Aˆ − A||D ˆ − D| + |Aˆ − A||D| p−2 |(AˆD ˆ − D| + (B ˆ − B)2 + 2|B||B ˆ − B|} + |A||D p = Op (max sj logp/n) = op (1), j

(5.30)

p where we use (5.5)(5.18)(5.29), Lemma A.4, and Assumption 2: maxj sj logp/n = o(1).Q.E.D. Proof of Theorem 2. Now we define notation to help us in the proof here. First set ˆ 2 − 2Bρ ˆ + D. ˆ x ˆ = Aρ

(5.31)

x = Aρ2 − 2Bρ + D.

(5.32)

ˆ −B ˆ 2. yˆ = AˆD

(5.33)

y = AD − B 2 .

(5.34)

Then we can write the estimate of the optimal portfolio variance as ˆ ˆ OP V = x Ψ , yˆ and the optimal portfolio variance is ΨOP V =

x . y

So we can setup the problem as, by adding and subtracting xy from the numerator x ˆ x x ˆy − xy + xy − xˆ y − | = p| | yˆ y yˆy p−3 {(ˆ x − x)y + x(y − yˆ)} = | | p−4 {ˆ y y}

ˆ OP V − ΨOP V | = p| p|Ψ

≤ |

|ˆ x−x| |y| p p2

+

|x| |y−ˆ y| p p2

yˆ y p2 p2

33

|.

(5.35)

We consider each term in the numerator in (5.35). Via Lemma A.1-A.3, and ρ being bounded ˆ 2 − 2Bρ ˆ +D ˆ − (Aρ2 − 2Bρ + D)| p−1 |ˆ x − x| = p−1 |Aρ ˆ − B|ρ + |D ˆ − D|} ≤ p−1 {|Aˆ − A|ρ2 + 2|B = op (1).

(5.36)

Now analyze the following term in the numerator p−2 |y| = p−2 |AD − B 2 | ≤ p−2 AD = O(1),

(5.37)

where we use B 2 > 0 in the inequality, and Lemma A.4 for the rate result, which is the final equality above in (5.37). Next, consider the following in the numerator p−1 |x| = p−1 |Aρ2 − 2Bρ + D| ≤ p−1 (Aρ2 + 2|B|ρ + D) = O(1),

(5.38)

where we use A, D being positive, and Lemma A.4, with ρ being bounded. Then Lemma A.5 provides ˆ −B ˆ 2 − (AD − B 2 )| = op (1). p−2 |ˆ y − y| = p−2 |AˆD (5.39) So the numerator in (5.35) is |ˆ x − x| |y| |x| |y − yˆ| + = op (1), p p2 p p2

(5.40)

where we use (5.36)-(5.39). We consider the denominator in (5.35) 1 | pyˆ2 py2 |

= = ≤

1 y | (ˆy−y)+y | p2 p2

1 

op (1) +

y p2



y p2

1 , (op (1) + C1 )C1

(5.41)

where we add and subtract y in the first equality, and Lemma A.5 in the second equality, and p−2 y = p−2 (AD − B 2 ) ≥ C1 > 0, and C1 is a positive constant by assumption. Next, combine (5.40)(5.41) in (5.35) to have ˆ OP V − ΨOP V | ≤ p|Ψ

C12

op (1) = op (1). + op (C1 )

Q.E.D. Proof of Theorem 3. We start with the definition of the global minimum variance portfolio 34

weight vector estimate w ˆu = where we can write w ˆ u − wu =

ˆ p Θ1 , ˆ p 10p Θ1

ˆ p Θ1 Θ1p . − 0 ˆ p 1p Θ1p 10p Θ1

ˆ p , and A = 10p Θ1p , via adding and subtracting AΘ1p from the Using the definition of Aˆ = 10p Θ1 numerator below w ˆ u − wu = =

ˆ p ) − (AΘ1 ˆ p )] p−2 [(AΘ1 p−2 (Aˆ0 A) ˆ p ) − (AΘ1p ) + (AΘ1p ) − (AΘ1 ˆ p )] p−2 [(AΘ1 . p−2 (Aˆ0 A)

Using the above result kw ˆ u − wu k 1 ≤

ˆ p k1 A k(Θ−Θ)1 p p

+

ˆ kΘ1p k1 |A−A| p p

ˆA A p p

.

(5.42)

Then in (5.48) consider the numerator. By (5.59)(5.60) ˆ − Θ)1p k1 = Op (p max sj k(Θ j

p logp/n).

√ kΘ1p k1 = O(p maxj sj ).

(5.43) (5.44)

and via Lemma A.4, A = O(p), also by (5.5) p−1 |Aˆ − A| = Op (max sj j

p logp/n).

(5.45)

By (5.43)-(5.45) ˆ − Θ)1p k1 |A − A| ˆ kΘ1p k1 A k(Θ + p p p p

p = O(1)Op (max sj logp/n) j p q + Op (max sj logp/n)O( max sj ) j j p = Op ((max sj )3/2 logp/n) = op (1), j

(5.46)

p where we use sparsity assumption (maxj sj )3/2 logp/n = o(1) in the last step. Then for the denominator in (5.42) from (5.6)-(5.8) we have, for C0 > 0, is a positive constant, Aˆ A ≥ (op (1) + C0 )C0 . p p

35

(5.47)

Now combine (5.46)(5.47) in (5.42) to have the desired result. Q.E.D. Proof of Theorem 4. Denote w∗ = ∆1 Θ1p + ∆2 Θµ, where ∆1 =

D − ρB , AD − B 2

∆2 =

ρA − B . AD − B 2

ˆ 1 Θ1 ˆ p+∆ ˆ 2 Θˆ ˆ µ, where ∆ ˆ 1, ∆ ˆ 2 represent estimators for ∆1 , ∆2 respectively. We Next, denote w ˆ=∆ ˆ 1, ∆ ˆ 2 by replacing A, B, D, in the formula for ∆1 , ∆2 with their estimators shown in above get ∆ Theorems. Next, by adding and subtracting ˆ µ − ∆1 Θ1p − ∆2 Θµ ˆ p+∆ ˆ 2 Θˆ ˆ 1 Θ1 w ˆ − w∗ = ∆ ˆ − Θ) + Θ]1p ˆ 1 − ∆1 ) + ∆1 ][(Θ = [(∆ ˆ 2 − ∆2 ) + ∆2 ][(Θ ˆ − Θ) + Θ][(ˆ + [(∆ µ − µ) + µ] − ∆1 Θ1p − ∆2 Θµ ˆ 1 − ∆1 )(Θ ˆ − Θ)1p + (∆ ˆ 1 − ∆1 )Θ1p = (∆ ˆ − Θ)1p + (∆ ˆ 2 − ∆2 )(Θ ˆ − Θ)(ˆ + ∆1 (Θ µ − µ) ˆ 2 − ∆2 )Θ(ˆ ˆ 2 − ∆2 )(Θ ˆ − Θ)µ + (∆ µ − µ) + (∆ ˆ 2 − ∆2 )Θµ + ∆2 (Θ ˆ − Θ)(ˆ + (∆ µ − µ) ˆ − Θ)µ. + ∆2 Θ(ˆ µ − µ) + ∆2 (Θ

(5.48)

ˆ 1 , ∆1 , ∆ ˆ 2 , ∆2 are all scalars, Using (5.48), and since ∆ ˆ 1 − ∆1 )|k(Θ ˆ − Θ)1p k1 + |(∆ ˆ 1 − ∆1 )|kΘ1p k1 kw ˆ − w∗ k1 ≤ |(∆ ˆ − Θ)1p k1 + |(∆ ˆ 2 − ∆2 )|k(Θ ˆ − Θ)(ˆ + |∆1 |k(Θ µ − µ)k1 ˆ 2 − ∆2 )|kΘ(ˆ ˆ 2 − ∆2 )|k(Θ ˆ − Θ)µk1 + |(∆ µ − µ)k1 + |(∆ ˆ 2 − ∆2 )|kΘµk1 + |∆2 |k(Θ ˆ − Θ)(ˆ + |(∆ µ − µ)k1 ˆ − Θ)µk1 . + |∆2 |kΘ(ˆ µ − µ)k1 + |∆2 |k(Θ

(5.49)

We consider each term above. But rather than analyzing them one by one, we analyze common elements and then determine the order of each term on the right side of (5.49). Using the definitions of yˆ, y in (5.33)(5.34) respectively, and adding and subtracting y(D − ρB) respectively from the

36

numerator, with ρ being bounded ˆ 1 − ∆1 | = p| p|∆ = p| ≤

ˆ − ρB) ˆ − yˆ(D − ρB) y(D | yˆy ˆ − ρB) ˆ − y(D − ρB) + y(D − ρB) − yˆ(D − ρB) y(D yˆy

ˆ |y| |D−D| p p2

+

ˆ |y| |ρ| |B−B| p p2 yˆ y p2 p2

+

|y−ˆ y | |D−ρB| p p2

|

.

(5.50)

Now we analyze each term in the numerator. By Lemma A.4 |y| |AD − B 2 | AD = ≤ 2 = O(1). p2 p2 p

(5.51)

ˆ − D| |y| |B ˆ − B| p |y| |D + 2 |ρ| = Op (max sj logp/n). 2 j p p p p

(5.52)

Next, by (5.18)(5.29)

Then |y − yˆ| |D − ρB| ≤ p2 p

ˆ −B ˆ 2 − (AD − B 2 )| |AˆD p2

!

D + |ρ||B| p

 = Op (max sj j

p logp/n),

(5.53)

where we use y, yˆ definitions in the inequality, and to get the rate Lemma A.4 with (5.30) is used. Combine now (5.52)(5.53) in the numerator in (5.50) to have ˆ − D| |y| |B ˆ − B| |y − yˆ| |D − ρB| p |y| |D + 2 |ρ| + = Op (max sj logp/n). 2 2 j p p p p p p

(5.54)

Then combine (5.41)(5.54) to have ˆ 1 − ∆1 | = Op (max sj p|∆ j

which simply implies ˆ 1 − ∆1 | = Op ( |∆

p logp/n),

maxj sj p logp/n). p

(5.55)

maxj sj p logp/n). p

(5.56)

Exactly following the same way we derive ˆ 2 − ∆2 | = Op ( |∆

37

Consider D − ρB D − ρB | ≤ p| | 2 AD − B AD D B p + |ρ| p = O(1), AD

p|∆1 | = p| ≤

p p

where we use Lemma A.4 to have D = O(p), |B| = O(p), and ρ being bounded, and A/p, D/p are bounded away from zero by Assumption 3 through a simple eigenvalue inequality. In the same way we obtain p|∆2 | = O(1). These last two equations provide |∆1 | = O(1/p),

(5.57)

|∆2 | = O(1/p).

(5.58)

ˆ − Θ)1p k1 ≤ p max kΘ ˆ j − Θj k1 k(Θ j p = Op (p max sj logp/n),

(5.59)

Next consider,

j

where we use (5.2) for the inequality, and (5.3) for the rate result in (5.59). Now we analyze kΘ1p k1 ≤ p max kΘj k1 = O(p j

q max sj ), j

(5.60)

where the inequality is obtained by (5.2), and the rate is derived by (5.16). Next, we consider the following term: ˆ − Θ)(ˆ ˆ j − Θj k1 k(Θ µ − µ)k1 ≤ pk(ˆ µ − µ)k∞ max kΘ j

= Op (pmaxj sj logp/n),

(5.61)

where we use (5.2) for the first inequality, and the rate is derived from (5.3)(5.9). Then consider given kµk∞ ≤ C ˆ − Θ)µk1 ≤ Cp max kΘ ˆ j − Θj k1 k(Θ j p = Op (p max sj logp/n), j

where we use (5.2) for the first inequality, and the rate is derived from (5.9). Note that q kΘµk1 = Op (p max sj ), j

38

(5.62)

(5.63)

where we use the same analysis in (5.60). Next, kΘ(ˆ µ − µ)k1 ≤ pkˆ µ − µk∞ max kΘj k1 j p √ = pOp ( logp/n)O(max sj ) j p q = Op (p max sj logp/n), j

(5.64)

where we use (5.2) for the first inequality and (5.9)(5.16) for rates. Use (5.55)-(5.64) in (5.49) to have kw ˆ − w∗ k1 = Op ((maxj sj )2 logp/n) + Op ((maxj sj )3/2 (logp/n)1/2 ) + Op (max sj (logp/n)1/2 ) + Op ((max sj )2 (logp/n)3/2 ) j

j

3/2

+ Op ((max sj ) j

3/2

+ Op ((max sj ) j

1/2

+ Op ((max sj ) j

3/2

= Op ((max sj ) j

logp/n) + Op ((max sj )2 (logp/n)) j

1/2

(logp/n)

) + Op ((max sj )(logp/n)) j

1/2

(logp/n)

) + Op ((max sj )(logp/n)1/2 ) j

1/2

(logp/n)

) = op (1),

(5.65)

where we use the fact that (maxj sj )3/2 (logp/n)1/2 is the slowest rate of convergence on the right hand side terms. Then by (maxj sj )3/2 (logp/n)1/2 = o(1) we have the last result.Q.E.D Appendix B. In this part of the Appendix, we analyze what happens when we relax sub-Gaussian returns assumption. Given Assumptions 1*,2*,3 going over proof of Lemma 1, with maxj λj = O(p2/l /n1/2 ), we get ˆΣ ˆ − Ip k∞ = Op (p2/l /n1/2 ) = op (1). kΘ Next, one of the most important inputs to proofs of Theorems 1-4 are (5.3). Lemma 2 of Caner and Kock (2014) show that ˆ j − Θj k1 = Op ((max sj )p2/l /n1/2 ). max kΘ j

j

Note that the result above is a subcase of Lemma 2 in Caner and Kock (2014) with no restrictions, h = 1 in their notation. After that we need to adjust (5.9) to reflect the new return structure. Following Lemma A.4 of Caner and Kock (2014) we have kˆ µ − µk∞ = Op (p2/l /n1/2 ) = op (1).

39

The rest of the proofs for Theorems 1-4 follow from these equations in Appendix A. So the main √ difference with Theorems 1-4 will be that in approximation rates, we replace logp with p2/l .

References A¨ıt-Sahalia, Y. and D. Xiu (2016): “Using principal component analysis to estimate a high dimensional factor model with high frequency data,” Forthcoming, Journal of Econometrics. Antoniadis, A. and J. Fan (2001): “Regularization of Wavelet Approximations,” Journal of the American Statistical Association, 96, 939–967. Bai, J. and S. Ng (2002): “Determining the Number of Factors in Approximate Factor Models,” Econometrica, 70, 191–221. Ban, G. Y., N. El Karoui, and A. E. Lim (2016): “Machine learning and portfolio optimization,” Working paper, Department of Statistics, U. California Berkeley, submitted to Management Science. Bickel, P. J. and E. Levina (2008a): “Covariance regularization by thresholding,” The Annals of Statistics, 36, 2577–2604. ——— (2008b): “Regularized estimation of large covariance matrices,” The Annals of Statistics, 36, 199–227. Bodnar, T., A. K. Gupta, and N. Parolya (2014): “On the strong convergence of the optimal linear shrinkage estimator for large dimensional covariance matrix,” Journal of Multivariate Analysis, 132, 215 – 228. ´ rski (2016): “Singular inverse Wishart distribution and Bodnar, T., S. Mazur, and K. Podgo its application to portfolio theory,” Journal of Multivariate Analysis, 143, 314–326. Brodie, J., I. Daubechies, C. De Mol, D. Giannone, and I. Loris (2009): “Sparse and stable Markowitz portfolios,” Proceedings of the National Academy of Sciences, 106, 12267–12272. ¨ hlmann, P. and S. van de Geer (2011): Statistics for High-Dimensional Data: Methods, Bu Theory and Applications, Springer Publishing Company, Incorporated, 1st ed. Cai, T. and W. Liu (2011): “Adaptive Thresholding for Sparse Covariance Matrix Estimation,” Journal of the American Statistical Association, 106, 672–684. Cai, T., W. Liu, and X. Lou (2011): “A Constrained L1 Minimization Approach to Sparse Precision Matrix Estimation,” Journal of American Statistical Association, 106, 594–607. Cai, T. and M. Yuan (2012): “Adaptive Covariance Matrix Estimation Trough Block Thresholding,” The Annals of Statistics, 40, 2014–2042. 40

Cai, T. and H. Zhou (2012): “Optimal Rates of Converge for Covariance Matrix Estimation,” The Annals of Statistics, 40, 2389–2420. Callot, L. A. F., A. B. Kock, and M. C. Medeiros (2016): “Modeling and Forecasting Large Realized Covariance Matrices and Portfolio Choice,” Journal of Applied Econometrics, forthcoming. Caner, M., X. Han, and Y. Lee (2017): “Adaptive Elastic Net GMM Estimation with Many Invalid Moment Conditions: Simultaneous Model and Moment Selection,” Journal of Business & Economic Statistics, forthcoming. Caner, M. and A. Kock (2014): “Asymptotically Honest Confidence Regions for High Dimensional Parameters by the Desparsified Conservative Lasso,” Creates research papers, School of Economics and Management, University of Aarhus. Chamberlain, G. and M. Rothschild (1983): “Arbitrage, Factor Structure, and Mean-Variance Analysis on Large Asset Markets,” Econometrica, 51, 1281–1304. Chen, B., S.-F. Huang, and G. Pan (2015): “High dimensional mean–variance optimization through factor analysis,” Journal of Multivariate Analysis, 133, 140–159. Daniels, M. and R. Kass (2001): “Shrinkage Estimator for Covariance Matrices,” Biometrics, 57, 1173–1184. d’Aspremont, A., O. Banerjee, and L. El Ghaoui (2008): “First-Order Methods for Sparse Covariance Selection,” SIAM. J. Matrix Anal. & Appl., 30. DeMiguel, V., L. Garlappi, F. J. Nogales, and R. Uppal (2009a): “A Generalized Approach to Portfolio Optimization: Improving Performance by Constraining Portfolio Norms,” Management Science, 55, 798–812. DeMiguel, V., L. Garlappi, and R. Uppal (2009b): “Optimal Versus Naive Diversification: How Inefficient is the 1/N Portfolio Strategy?” Review of Financial Studies, 22, 1915–1953. Dey, D. K. (1987): “Improved estimation of a multinormal precision matrix,” Statistics & Probability Letters, 6, 125 – 128. Dey, D. K. and C. Srinivasan (1985): “Estimation of a Covariance Matrix under Stein’s Loss,” The Annals of Statistics, 13, 1581–1591. Efron, B. and C. Morris (1976): “Multivariate Empirical Bayes and Estimation of Covariance Matrices,” The Annals of Statistics, 4, 22–32. El Karoui, N. (2008): “Operator norm consistent estimation of large-dimensional sparse covariance matrices,” The Annals of Statistics, 36, 2717–2756. 41

Fan, J., Y. Fan, and J. Lv (2008): “High dimensional covariance matrix estimation using a factor model,” Journal of Econometrics, 147, 186–197. Fan, J., Y. Feng, and Y. Wu (2009): “Network exploration via the adaptive LASSO and SCAD penalties,” The Annals of Applied Statistics, 3, 521–541. Fan, J., A. Furger, and D. Xiu (2016): “Incorporating global industrial classification standard into portfolio allocation: A simple factor-based large covariance matrix estimator with high frequency data,” Journal of Business & Economic Statistics, 34, 489–503. Fan, J. and R. Li (2001): “Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties,” Journal of the American Statistical Association, 96, 1348–1360. Fan, J., Y. Liao, and M. Mincheva (2011): “High-dimensional covariance matrix estimation in approximate factor models,” Ann. Statist., 39, 3320–3356. ——— (2013): “Large covariance estimation by thresholding principal orthogonal complements,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75, 603–680. Fan, J., Y. Liao, and X. Shi (2015): “Risks of large portfolios,” Journal of Econometrics, 186, 367 – 387. Fan, J., J. Zhang, and K. Yu (2012): “Vast Portfolio Selection With Gross-Exposure Constraints,” Journal of the American Statistical Association, 107, 592–606. Friedman, J., T. Hastie, and R. Tibshirani (2008): “Sparse inverse covariance estimation with the graphical lasso,” Biostatistics, 9, 432–441. ——— (2010): “Regularization Paths for Generalized Linear Models via Coordinate Descent,” Journal of Statistical Software, 33, 1–22. Green, R. C. and B. Hollifield (1992): “When Will Mean-Variance Efficient Portfolios be Well Diversified?” The Journal of Finance, 47, 1785–1809. Hautsch, N., L. M. Kyj, and R. C. Oomen (2012): “A blocking and regularization approach to high-dimensional realized covariance estimation,” Journal of Applied Econometrics, 27, 625–645. Huang, J. Z., N. Liu, M. Pourahmadi, and L. Liu (2006): “Covariance matrix selection and estimation via penalised normal likelihood,” Biometrika, 93, 85–98. Jagannathan, R. and T. Ma (2003): “Risk Reduction in Large Portfolios: Why Imposing the Wrong Constraints Helps,” The Journal of Finance, 58, 1651–1684. Javanmard, A. and A. Montanari (2013): “Hypothesis Testing in High-Dimensional Regression under the Gaussian Random Design Model: Asymptotic Theory,” CoRR, abs/1301.4240.

42

Jorion, P. (1986): “Bayes-Stein Estimation for Portfolio Analysis,” Journal of Financial and Quantitative Analysis, 21, 279–292. Kourtis, A., G. Dotsis, and R. N. Markellos (2012): “Parameter uncertainty in portfolio selection: Shrinking the inverse covariance matrix,” Journal of Banking & Finance, 36, 2522– 2531. Kubokawa, T. and M. S. Srivastava (2008): “Estimation of the precision matrix of a singular Wishart distribution and its application in high-dimensional data,” Journal of Multivariate Analysis, 99, 1906–1928. Lam, C. and J. Fan (2009): “Sparsistency and rates of convergence in large covariance matrix estimation,” The Annals of Statistics, 37, 4254–4278. Ledoit, O. and M. Wolf (2003): “Improved estimation of the covariance matrix of stock returns with an application to portfolio selection,” Journal of Empirical Finance, 10, 603–621. ——— (2004): “A well-conditioned estimator for large-dimensional covariance matrices,” Journal of Multivariate Analysis, 88, 365–411. ——— (2012): “Nonlinear shrinkage estimation of large-dimensional covariance matrices,” Ann. Statist., 40, 1024–1060. ——— (2015): “Spectrum estimation: A unified framework for covariance matrix estimation and PCA in large dimensions,” Journal of Multivariate Analysis, 139, 360–384. Levina, E., A. Rothman, and J. Zhu (2008): “Sparse estimation of large covariance matrices via a nested Lasso penalty,” The Annals of Applied Statistics, 2, 245–263. Li, J. (2015): “Sparse and Stable Portfolio Selection With Parameter Uncertainty,” Journal of Business & Economic Statistics, 33, 381–392. Liu, Q. (2009): “On portfolio optimization: How and when do we benefit from high-frequency data?” Journal of Applied Econometrics, 24, 560–582. Markowitz, H. (1952): “Portfolio Selection,” The Journal of Finance, 7, 77–91. ——— (1990): Mean-Variance Analysis in Portfolio Choice and Capital Markets, Basic Blackwell, Incorporated, 1st ed. ¨ hlmann (2006): “High-dimensional graphs and variable selection Meinshausen, N. and P. Bu with the Lasso,” The Annals of Statistics, 34, 1436–1462. Michaud, R. O. (1989): “The Markowitz Optimization Enigma: Is Optimized Optimal,” Financial Analysts Journal.

43

O.Banerjee, L. El Ghaoui, and A. d’Aspremont (2008): “Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data,” Journal of Machine Learning Research, 9, 485–516. Onatski, A. (2010): “Determining the number of factors from empirical distribution of eigenvalues,” The Review of Economics and Statistics, 92, 1004–1016. R Core Team (2015): R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria. Ravikumar, P., M. J. Wainwright, G. Raskutti, and B. Yu (2011): “High-dimensional covariance estimation by minimizing L1 -penalized log-determinant divergence,” Electronic Journal of Statistics, 5, 935–980. Ross, S. A. (1976): “The arbitrage theory of capital asset pricing,” Journal of Economic Theory, 13, 341–360. ——— (1977): “The Capital Asset Pricing Model (CAPM), Short-Sale Restrictions and Related Issues,” Journal of Finance, 32, 177–83. Rothman, A. J., P. J. Bickel, E. Levina, and J. Zhu (2008): “Sparse permutation invariant covariance estimation,” Electronic Journal of Statistics, 2, 494–515. Rothman, A. J., E. Levina, and J. Zhu (2009): “Generalized Thresholding of Large Covariance Matrices,” Journal of the American Statistical Association, 104, 177–186. ——— (2010): “A new approach to Cholesky-based covariance regularization in high dimensions,” Biometrika, 97, 539–550. ¨ fer, J. and K. Strimmer (2005): “A shrinkage approach to large-scale covariance matrix Scha estimation and implications for functional genomics.” Statistical Applications in Genetics and Molecular Biology, 4. Sharpe, W. (1963): “A Simplified Model for Portfolio Analysis,” Management Science, 9, 277–293. Stein, J. (1956): “Inadmissibility of the Usual Estimator for the Mean of a Multivariate Normal Distribution,” in Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, ed. by J. Neyman, University of California Press, 197–206. Tibshirani, R. (1994): “Regression Shrinkage and Selection Via the Lasso,” Journal of the Royal Statistical Society, Series B, 58, 267–288. Tsukuma, H. (2016): “Estimation of a high-dimensional covariance matrix with the Stein loss,” Journal of Multivariate Analysis, 148, 1 – 17.

44

Tsukuma, H. and Y. Konno (2006): “On Improved Estimation of Normal Precision Matrix and Discriminant Coefficients,” Journal of Multivariate Analysis, 97, 1477–1500. van de Geer, S., P. Bhlmann, Y. Ritov, and R. Dezeure (2014): “On asymptotically optimal confidence regions and tests for high-dimensional models,” The Annals of Statistics, 42, 1166–1202. Wang, H., B. Li, and C. Leng (2009): “Shrinkage tuning parameter selection with a diverging number of parameters,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71, 671–683. Warton, D. (2011): “Penalized Sandwich Estimators for Analysis of High-Dimensional Data Using Generalized Estimating Equations,” Biometrics, 67, 116–123. Wong, F., C. K. Carter, and R. Kohn (2003): “Efficient estimation of covariance selection models,” Biometrika, 90, 809–830. Wu, W. B. and M. Pourahmadi (2003): “Nonparametric estimation of large covariance matrices of longitudinal data,” Biometrika, 90, 831–844. Yang, R. and J. O. Berger (1994): “Estimation of a Covariance Matrix Using the Reference Prior,” The Annals of Statistics, 22, 1195–1211. Yuan, M. and Y. Lin (2007): “Model selection and estimation in the Gaussian graphical model,” Biometrika, 94, 19–35. Zou, H. (2006): “The Adaptive Lasso and Its Oracle Properties,” Journal of the American Statistical Association, 101, 1418–1429. Zou, H. and T. Hastie (2005): “Regularization and variable selection via the Elastic Net,” Journal of the Royal Statistical Society, Series B, 67, 301–320. Zou, H., T. Hastie, R. Tibshirani, et al. (2007): “On the degrees of freedom of the lasso,” The Annals of Statistics, 35, 2173–2192.

45