deconvolution procedure to decompose individual log earnings from the panel ... For example, the standard model of individual earnings dynamics is a linear ...
Review of Economic Studies (2010) 77, 491–533 © 2009 The Review of Economic Studies Limited
0034-6527/10/00411011$02.00 doi: 10.1111/j.1467-937X.2009.00577.x
Generalized Non-Parametric Deconvolution with an Application to Earnings Dynamics ´ STEPHANE BONHOMME Centro de Estudios Monetarios y Financieros
and JEAN-MARC ROBIN Paris School of Economics, University Paris and University College London First version received January 2008; final version accepted May 2009 (Eds.) In this paper, we construct a non-parametric estimator of the distributions of latent factors in linear independent multi-factor models under the assumption that factor loadings are known. Our approach allows estimation of the distributions of up to L (L + 1) /2 factors given L measurements. The estimator uses empirical characteristic functions, like many available deconvolution estimators. We show that it is consistent, and derive asymptotic convergence rates. Monte Carlo simulations show good finite-sample performance, less so if distributions are highly skewed or leptokurtic. We finally apply the generalized deconvolution procedure to decompose individual log earnings from the panel study of income dynamics (PSID) into permanent and transitory components.
1. INTRODUCTION In this paper, we consider linear multi-factor models of the form: Y = AX, where Y = (Y1 , . . . , YL )T is a vector of L measurements, X = (X1 , . . . , XK )T is a vector of K unobserved and mutually independent latent factors, and A is a L × K matrix of parameters. The analysis is conducted assuming that the number of factors and the matrix of factor loadings A are known. The contribution of this paper is to provide a non-parametric estimator of the distribution function of X from an independent identically distributed (i.i.d.) sample {Yn , n = 1, . . . , N } for up to K = L(L + 1)/2 factors. Applications of factor models are numerous in social sciences, and economics in particular. For example, the standard model of individual earnings dynamics is a linear multi-factor model with four additive components, or factors: a deterministic function of regressors, a fixed effect, a persistent autoregressive component, and a transitory moving-average component (e.g. Hall and Mishkin, 1982; Abowd and Card, 1989). Factor variances are usually estimated based on second-order moment restrictions. Estimating the whole distribution of factors is less common but still useful. Lillard and Willis (1978) compute transition probabilities into and out of poverty (“first passage times”). More recently, economists have shown considerable interest 491
492
REVIEW OF ECONOMIC STUDIES
in estimating complete earnings models to feed life-cycle consumption models (e.g. Guvenen, 2007a,b; Kaplan and Violante, 2008). Estimating factor distributions non-parametrically in those models is thus of substantial interest.1 Horowitz and Markatou (1996) were the first to propose fully non-parametric estimators for linear models with error components, with an application to earnings panel data. They show that, unlike in the standard semiparametric deconvolution problem,2 every component of the convolution can be identified and non-parametrically estimated if repeated observations of the dependent variable are available. The basic setup that they study has two measurements (two observations per individual), one common factor and two independent errors with identical and symmetric distribution.3 Horowitz and Markatou consider several extensions of this simple framework, namely stationary AR or MA errors, or asymmetric errors.4 However, it is not easy to see how to extend the basic approach in a systematic way to analyse more complex models. In this paper, we extend their approach and develop a general method for estimating factor distributions in linear factor models with many independent factors with different, unrestricted distributions. Our estimator generalizes the one proposed in another important paper by Li and Vuong (1998). Li and Vuong consider the same basic setup as Horowitz and Markatou, with repeated measurements and independent errors. They allow measurement errors to display different, possibly asymmetric, distributions.5 Their estimator of the densities of the common factor and the independent errors builds on an identification result due to Kotlarski (1967) (also stated by Rao, 1992, p. 21). Sz´ekely and Rao (2000) generalize Kotlarski’s identification result to the general case of linear multi-factor models Y = AX with known A and unrestricted, independent unobserved factors X. They show that the L(L + 1)/2 second-order partial derivatives of the characteristic function of Y deliver a system of functional identifying restrictions allowing identification of a maximal number of K = L(L + 1)/2 factors. Our estimator is based on this system of identifying restrictions, replacing the characteristic function of the vector of measurements by an empirical analogue, and using a smoothing kernel with trimming. Our estimator requires no optimization, unlike parametric approaches. Moreover, we provide a simple method to choose the trimming parameter, inspired by the “plug-in” method proposed in Delaigle and Gijbels (2004). We provide results on the uniform convergence of the estimator. We depart from most of the deconvolution literature, and in particular from Li and Vuong (1998), by allowing the supports of factor distributions to be unbounded. The rate of convergence of our estimator depends crucially on the smoothness of factor distributions, and may be very slow. When the characteristic function of the factor of interest has fatter tails than the characteristic functions of 1. The usual approach is to adopt flexible parametric distributions for the individual effects and the innovations (e.g. Chamberlain and Hirano, 1999; Hirano, 2002; Geweke and Keane, 2000, 2007). 2. For references on the classical deconvolution problem, see Carroll and Hall (1988), Zhang (1990), Fan (1991), and Carroll, Ruppert and Stefanski (1995). 3. Susko and Nadon (2002) consider the same setup and estimator as Horowitz and Markatou with an application to gene expression. More recently, Delaigle, Hall and Meister (2008) have proposed a modified kernel estimator that may achieve, under smoothness conditions, the same asymptotic performance as if the distribution of errors were known. 4. These extensions are developed in Horowitz (1998, pp. 125–136). 5. Li and Vuong’s estimator has been used by Li et al. (2000) in the context of a structural auction model, and in Li (2002) in a non-linear errors-in-variables model. Hall and Yao (2003) proposed an estimator that is closely related to that of Li and Vuong (1998). Related methods have been used by Schennach (2004a,b) in the context of non-linear regression and non-parametric regression, respectively, when the regressors are measured with error, and by Hu and Ridder (2007) in order to deal with measurement error when marginal information is available. © 2009 The Review of Economic Studies Limited
BONHOMME & ROBIN
GENERALIZED NON-PARAMETRIC
493
other factors, the rate may be logarithmic in N . These slow rates of convergence do not indicate a flaw in our approach but are instead a fundamental property of non-parametric deconvolution estimators. Indeed, logarithmic rates are the best convergence rates that can be attained in some circumstances (Carroll and Hall, 1988; Fan, 1991). Despite the slow asymptotic rates of convergence, Monte Carlo simulations are encouraging. When the true factor distributions are normal or Laplace, we find moderate biases and tight confidence bands. Interestingly, our generalized deconvolution estimator has the same finite-sample bias and variance as a standard deconvolution estimator assuming that all factor densities are known except the one to be estimated. Moreover, it achieves comparable finitesample performance to the estimators of Horowitz and Markatou (1996) and Li and Vuong (1998) in the basic set-up with two measurements. We also find that the shape of factor distributions strongly influences the performance of the estimator. In particular, it is more difficult to estimate skewed or leptokurtic factor distributions. We apply our methodology to individual earnings data from the panel study of income dynamics (PSID). We model the residuals of log earnings on individual covariates as the sum of an individual effect, a random walk, and a white noise, and estimate the distributions of innovations from first differences, instead of earnings levels as in Horowitz and Markatou (1996).6 Our results show that both shocks exhibit more kurtosis than the normal distribution. We use the model to analyse the respective roles of permanent and transitory shocks in earnings mobility, and to correlate the variance of earnings shocks to job changes. In particular, we find that frequent job changers face more permanent and more transitory earning shocks than job stayers. The outline of the paper is as follows. Section 2 presents the model and assumptions. In Section 3, we provide a simple proof of the identification result in Sz´ekely and Rao (2000), which we use in Section 4 to construct an estimator of factor densities. In Section 5, we prove the consistency of the estimator and discuss convergence speed. Section 6 displays Monte Carlo simulations. Section 7 presents an application to earnings dynamics. Section 8 concludes.
2. MODEL AND ASSUMPTIONS 2.1. Model The main features of the model are summarized in the following assumption (AT denotes the matrix transpose of matrix A):
Assumption A1. We consider the following basic set-up: 1. Xn = (Xn1 , . . . , XnK )T , n = 1, . . . , N , are N independent copies of a vector X = (X1 , . . . , XK )T of K real-valued, mutually independent, and non-degenerate random variables, with zero mean and finite variances. Vectors X1 , . . . , XN are unobserved to the econometrician and are called factors. 2. Yn = (Yn1 , . . . , YnL )T , n = 1, . . . , N , are N independent copies of a vector Y = (Y1 , . . . , YL )T with zero mean. Vectors Y1 , . . . , YN are observed and are called measurements. 3. Y = AX, where A = [ak ] is a known L × K matrix of scalar parameters and any two columns of A are linearly independent. 6. Note that Horowitz and Markatou assume independent shocks with identical symmetric distributions. As they use data from the Current Population Survey (CPS), which is a 2-year panel, possibilities for identifying complex error-component models are limited. © 2009 The Review of Economic Studies Limited
494
REVIEW OF ECONOMIC STUDIES
Remark 1. It is usual to denote as U instead of X a factor variable that appears in only one equation (an error). Remark 2. Measurements are demeaned. Obviously, factor distributions are only identified up to a location parameter. We therefore normalize the factor means to 0. Remark 3. Assuming that factors have finite variances implies that the characteristic functions of factors Xk are twice differentiable (Lukacs, 1970, Theorem 2.3.1.). This property is instrumental in the construction of the characteristic functions of factors from that of the vector of measurements. However, it is not a necessary condition for identification, as shown by Sz´ekely and Rao (2000). Remark 4. We assume that factor loadings are known to the researcher. Alternatively, one could assume that a root-N consistent estimator of A is available. The asymptotic results derived in this paper would remain unchanged, as we find convergence rates of density estimators that are slower than root-N . When K = L and A is non-singular, the distribution of X is trivially identified as that of A−1 Y. The aim of this paper is to propose a general method to estimate the distributions of factors when there are more factors than measurements (K > L). This situation arises naturally if there are L common factors and L errors, as in standard factor analysis. The empirical application that we shall later consider deals with earnings dynamics. The standard model assumes that log earnings residuals can be decomposed into a fixed effect, a persistent component, and a transitory component (e.g. Hall and Mishkin, 1982; Abowd and Card, 1989): P T wnt = fn + ynt + ynt , P ynt T ynt
=
P ynt−1
+ ε nt ,
n = 1, . . . , N,
t ≥ 2,
t = 1, . . . , T ,
(1) (2)
= ηnt ,
(3)
ηn1 = ηnT = 0,
(4)
where wnt is the residual of a regression of individual log earnings on a set of strictly exogeP nous regressors,7 fn is the fixed effect, ynt is the persistent component, usually modelled as T a random walk, and ynt is the transitory shock/measurement error, modelled as a white noise. Innovations ε nt and ηnt are mutually independent and independent over time. The model considered in Horowitz and Markatou (1996) is a simplified version of model (1)–(4), without the permanent component (εnt = 0), and with i.i.d. transitory shocks ηnt , which in addition are assumed symmetrically distributed. Unlike Horowitz and Markatou (1996), we shall not attempt to estimate the distribution of the fixed effect fn . The existence of an autoregressive component makes earnings levels depend on an initial condition that may be correlated with fn . As is usual in the literature on earnings dynamics (e.g. Abowd and Card, 1989; Meghir and Pistaferri, 2004), we instead difference out unobserved individual fixed effects. For example, setting T = 4 to simplify the 7. See Horowitz and Markatou (1996) for a formal treatment of cases where the dependent variable is the residual of a prior regression. Their analysis supposes strictly exogenous and bounded regressors, and root-N consistent estimates of the regression coefficients. © 2009 The Review of Economic Studies Limited
BONHOMME & ROBIN
GENERALIZED NON-PARAMETRIC
495
presentation: ⎛ ⎞ ⎞ ε n2 0 η 1 ⎠ n2 + ⎝ε n3 ⎠ . ηn3 −1 ε n4
⎞ ⎛ ⎛ 1 wn2 − wn1 ⎝wn3 − wn2 ⎠ = ⎝−1 0 wn4 − wn3
(5)
This model satisfies our set-up with L = T − 1 = 3, K = 2T − 3 = 5, and Yn = (wn2 − wn1 , wn3 − wn2 , wn4 − wn3 )T , T Xn = ηn2 , ηn3 , εn2 , ε n3 , ε n4 , ⎛ ⎞ 1 0 1 0 0 A = ⎝−1 1 0 1 0⎠ . 0 −1 0 0 1
(6)
Most of the literature on earnings dynamics focuses on estimating the variances of permanent and transitory shocks in models similar to this one. In comparison, we here address the more difficult task of non-parametrically estimating the full distributions of these shocks.
2.2. Assumptions Given the assumptions of linearity and independence, it will be convenient to work with characteristic functions. We make the following additional assumption.
Assumption A2. The characteristic functions of factor variables X1 , . . . , XK have no real zeros. This assumption is very common in the literature on non-parametric deconvolution (see Schennach, 2004a, and references therein). The characteristic functions may have complex zeros if factors have bounded support. Real zeros arise in the case of symmetric, bounded distributions, such as the uniform. Next, let Ak denote the kth column of matrix A, for k ∈ {1, . . . , K}. Then, Y = AX =
K
Ak Xk ,
(7)
k=1
and the variance-covariance matrix of Y is thus Var (Y) = AVar (X) AT =
K
Var (Xk )Ak ATk ,
(8)
k=1
as factors are independent, hence uncorrelated. Let vech be the matrix operator that acts on symmetric matrices like the standard vec operator except that it only selects the components below or on the diagonal. For example, if B = [bij ] is 3 × 3 symmetric: vec (B) = (b11 , b12 , b13 , b12 , b22 , b23 , b13 , b23 , b33 ) , vech (B) = (b11 , b12 , b13 , b22 , b23 , b33 ) .
(9)
© 2009 The Review of Economic Studies Limited
496
REVIEW OF ECONOMIC STUDIES
As Var (Y) is symmetric, one can re-express the set of second-order restrictions in (8) as vech (Var (Y)) =
K
Var (Xk )vech Ak ATk
(10)
k=1
⎛
⎞ Var (X1 ) ⎜ ⎟ .. = Q⎝ ⎠, . Var (XK ) where
Q = vech A1 AT1 , . . . , vech AK ATK .
(11)
(12)
Matrix Q has L(L + 1)/2 rows and K columns. A generic row is [a1 am1 , . . . , aK amK ] for , m = 1, . . . , L, ≤ m. Given A, factor variances are obviously identifiable only if the following assumption holds true.
Assumption A3. Matrix Q has full column rank K ≤ L(L + 1)/2. Note that Q could have rank less than K if two columns of A were proportional, a case that is ruled out by Assumption A1. Also, it is easily seen that Assumption A3 is equivalent to assuming that: rank ([A1 ⊗ A1 , . . . , AK ⊗ AK ]) = K,
(13)
where ⊗ denotes the Kronecker product, and the matrix [A1 ⊗ A1 , . . . , AK ⊗ AK ] is sometimes referred to as the Khatri–Rao matrix product of A by itself (Khatri and Rao, 1968).
2.3. Examples Example 1. The classical measurement error model. Y1 = αX + U1 , Y2 = X + U2 ,
(14)
has Y = (Y1 , Y2 )T , X = (X, U1 , U2 )T , ⎞ ⎛ 2 α 1 0 α 1 0 A= , Q = ⎝ α 0 0⎠ . 1 0 1 1 0 1
(15)
So Q has full rank 3 unless α = 0, in which case the first and third columns of A are identical. The identification of α in model (14) was studied in Reiersol (1950). Example 2. A simple spatial model. ⎧ ⎨Y1 = X1 + ρX2 + ρX3 + U1 , Y2 = ρX1 + X2 + ρX3 + U2 , ⎩ Y3 = ρX1 + ρX2 + X3 + U3 , © 2009 The Review of Economic Studies Limited
(16)
BONHOMME & ROBIN
GENERALIZED NON-PARAMETRIC
497
has Y = (Y1 , Y2 , Y3 )T , ⎛
1 A = ⎝ρ ρ
ρ 1 ρ
ρ ρ 1
X = (X1 , X2 , X3 , U1 , U2 , U3 )T , ⎛ 1 ρ2 ρ2 ⎜ ρ ρ ρ2 ⎞ ⎜ 1 0 0 ⎜ ρ ρ2 ρ ⎠ 0 1 0 , Q=⎜ ⎜ρ 2 1 ρ 2 ⎜ 0 0 1 ⎝ρ 2 ρ ρ ρ2 ρ2 1
1 0 0 0 0 0
0 0 0 1 0 0
⎞ 0 0⎟ ⎟ 0⎟ ⎟. 0⎟ ⎟ 0⎠ 1
(17)
One verifies that Q has rank 6 unless ρ ∈ {−2, 0, 1}. If ρ = 0 or ρ = 1, then some columns of A are proportional, so Q does not have full column rank. If ρ = −2, Q has rank strictly less than K = 6, although any two columns of A are linearly independent. 3. IDENTIFICATION OF FACTOR DISTRIBUTIONS In this section, we derive the identifying restrictions that will be used for estimation in the next section. A sketch of the arguments below can be found in Sz´ekely and Rao (2000, Remark 6, p. 200).
3.1. Notation Let us denote the characteristic function (c.f.) of Xk as ϕ Xk (τ ) = E eiτ Xk , τ ∈ R, = eiτ x fXk (x) dx, τ ∈ R,
(18)
√ where fXk is the probability density function (p.d.f.) of Xk , and i = −1. As previously noted, Xk having finite variance, ϕ Xk is well defined and everywhere twice differentiable. Moreover, as ϕ Xk is nowhere vanishing, the cumulant generating function (c.g.f.) of Xk , i.e. the logarithm of its characteristic function, is also well defined and everywhere twice differentiable.8 We denote the c.g.f. of Xk as κ Xk (τ ) = ln ϕ Xk (τ ) , τ ∈ R,
= ln E eiτ Xk , τ ∈ R.
(19)
The density is uniquely determined by the characteristic function by the inverse Fourier transformation (for x in the support of Xk ): 1 e−iτ x ϕ Xk (τ ) dτ . (20) fXk (x) = 2π In particular, it follows from (20) that if the c.f. of Xk is identified, then its p.d.f. is also identified. 8. Note that, for identification, it suffices to assume that the set of zeros of factor c.f.’s is Lebesgue-negligible, see e.g. Carrasco and Florens (2007). © 2009 The Review of Economic Studies Limited
498
REVIEW OF ECONOMIC STUDIES
We similarly denote the multivariate c.f. and c.g.f. of Y, which are functions that map RL into the complex plane, as T ϕ Y (t) ≡ E eit Y , t ∈ RL ,
T κ Y (t) ≡ ln E eit Y , t ∈ RL . (21) We will make extensive use of the derivatives of κ Y . Let ∂ κ Y (t) denote the th 2 κ Y (t) the second-order partial derivative of κ Y (t) with partial derivative of κ Y (t), and ∂m respect to t and tm . We also denote as ∇κ Y (t) = [∂ κ Y (t)] the gradient vector, and as 2 ∇∇ T κ Y (t) = ∂m κ Y (t) the Hessian matrix.
3.2. Identifying restrictions Because of a well-known property of characteristic functions, the c.g.f. of a linear combination of independent random variables is equal to the same linear combination of their c.g.f.’s. Specifically, as factors are assumed mutually independent, for all t = (t1 , . . . , tL ) ∈ RL , κ Y (t) =
K
κ Xk tT Ak .
(22)
κ Xk tT Ak Ak .
(23)
k=1
First-differentiating equation (22) yields: ∇κ Y (t) =
K
k=1
If K > L, there are more functions κ Xk than partial derivatives ∂ κ Y . To obtain an invertible system, we differentiate once more: ∇∇ T κ Y (t) =
K
κ Xk tT Ak Ak ATk .
(24)
k=1
As ∇∇ T κ Y (t) is symmetric, we may as well rewrite this set of restrictions as K
κ Xk tT Ak vech Ak ATk vech ∇∇ T κ Y (t) = k=1
⎞ κ X1 tT A1 ⎜ ⎟ .. = Q⎝ ⎠. . T κ XK t AK ⎛
(25)
Note that, evaluated at t = 0, equation (25) yields the covariance restrictions (10). The independence assumption on factor variables, which is more restrictive than uncorrelatedness, yields many more restrictions on factor c.g.f.’s, one for each value of t ∈ RL . Equation (25) shows that, if Q has full column rank (Assumption A3) and if factors are independent and not only uncorrelated, the second derivatives of the c.g.f.’s of factor variables are identified. Namely, inverting (25), we obtain ⎛ T ⎞ κ X1 t A1 ⎜ ⎟ .. − T (26) ⎝ ⎠ = Q vech ∇∇ κ Y (t) , . T κ XK t AK −1 T where Q− = QT Q Q . © 2009 The Review of Economic Studies Limited
BONHOMME & ROBIN
GENERALIZED NON-PARAMETRIC
499
Let τ ∈ R, and let k ∈ {1, . . . , K}. System (26) offers many overidentifying restrictions for κ Xk (τ ). Indeed, there are many ways to choose t such that tT Ak = τ . A possible choice is to take t = AτTAAk . We will provide a motivation for this choice based on asymptotic arguments k k in Section 5. We shall refer to Ak as our “preferred direction of integration”. However, this direction is by no means unique. Any choice of t = θ Tτ Aθ , for θ ∈ RL \{0}, will also work. k
− L Let Q− k denote the kth row of Q . For any direction θ ∈ R \{0} and τ ∈ R, τθ − T . κ Xk (τ ) = Qk vech ∇∇ κ Y θ T Ak
(27)
The c.g.f. of Xk then follows by integrating this equation twice. Two constants of integrations are readily available: κ Xk (0) = iEXk = 0, as factors have zero means, and κ Xk (0) = 0, because a c.f. is equal to 1 at zero. Hence, τ u vθ − T dvdu. (28) Qk vech ∇∇ κ Y κ Xk (τ ) = θ T Ak 0 0 Equation (28) can be used for estimating factor characteristic functions and densities, as we explain in the next section.
3.3. Example 1: The measurement error model In the case of model (14), we have: κ Y (t1 , t2 ) = κ X (αt1 + t2 ) + κ U1 (t1 ) + κ U2 (t2 ) , and
⎛ 2 ⎞ ⎛ 2 ∂11 κ Y (t1 , t2 ) α T 2 κ Y (t1 , t2 )⎠ = ⎝ α vech ∇∇ κ Y (t) = ⎝∂12 2 1 κ Y (t1 , t2 ) ∂22
⎞ ⎞ ⎛ κ X (αt1 + t2 ) 1 0 0 0⎠ ⎝ κ U1 (t1 ) ⎠ . κ U2 (t2 ) 0 1
(29)
(30)
This yields: ⎞ ⎛ ⎛ κ X (αt1 + t2 ) 0 ⎝ κ U (t1 ) ⎠ = ⎝1 1 κ U2 (t2 ) 0
α −1 −α −α −1
⎞⎛ 2 ⎞ ∂11 κ Y (t1 , t2 ) 0 2 0⎠ ⎝∂12 κ Y (t1 , t2 )⎠ . 2 1 κ Y (t1 , t2 ) ∂22
(31)
3.3.1. The common factor. Let us set α = 1 for comparability with previous results in T the literature. Then, ATA1A = 12 , 12 and κ X (τ ) can be thus represented as: 1
1
κ X (τ ) =
τ
0
u 0
2 ∂12 κY
v v , dvdu. 2 2
Alternatively, using θ = (0, 1) as direction of integration yields: τ u 2 κ X (τ ) = ∂12 κ Y (0, v) dvdu 0 0 τ ∂1 κ Y (0, u) du. =
(32)
(33)
0
© 2009 The Review of Economic Studies Limited
500
REVIEW OF ECONOMIC STUDIES
This is the expression used in Li and Vuong (1998) and Schennach (2004a,b), which requires only one single differentiation and integration. However, it is not true in general that the double integral in (28) can be simplified into a simple integral by using an appropriate direction of integration, as Example 2 below will show. Remark that, for any α, our preferred direction of integration is A1 = (α, 1)T . So, when |α| gets larger, Y1 contributes more to κ X relative to Y2 . This makes intuitive sense, as Y1 becomes more informative about X.
3.3.2. Errors. For U1 ,
A2 AT 2 A2
= (1, 0)T , and
2 2 ∂11 κ Y (v, 0) dvdu − ∂12 0 0 τ ∂2 κ Y (u, 0) du = κ Y (τ , 0) − τ0 ∂2 κ Y (u, 0) du. = κ Y1 (τ ) −
κ U1 (τ ) =
τ
u
(34)
0
Li and Vuong use a slightly different formula: κ U1 (τ ) = κ Y1 (τ ) −
τ
∂1 κ Y (0, u) du.
(35)
0
3.4. Example 2: The spatial model We then reconsider the case of model (16). We obtain, for the first factor: κ X1 (t1 + ρt2 + ρt3 ) =
2 2 2 + ∂13 − (ρ + 1) ∂23 ∂12 κ Y (t1 , t2 , t3 ) , (1 − ρ) (ρ + 2) ρ
(36)
and, for the first error: κ U1
2 2 2 2 + (2ρ + 1) ∂23 − ρ + ρ 2 + 1 ∂12 + ∂13 ∂11 κ Y (t1 , t2 , t3 ) . (t1 ) = (ρ + 2) ρ
Set t1 = τ − ρt2 − ρt3 . Then, τ u 2 2 2 − (ρ + 1) ∂23 ∂12 + ∂13 κ X1 (τ ) = κ Y (v − ρt2 − ρt3 , t2 , t3 ) dvdu. (1 − ρ) (ρ + 2) ρ 0 0
(37)
(38)
One easily verifies that, even when t2 = t3 = 0, the double integral does not simplify to a single integral. Lastly, in this example our preferred solution is τ u 2 2 2 ∂12 + ∂13 − (ρ + 1) ∂23 (1, ρ, ρ)T κ X1 (τ ) = κY v dvdu. (39) 1 + 2ρ 2 (1 − ρ) (ρ + 2) ρ 0 0 Here also, this solution has intuitive appeal when |ρ| increases. © 2009 The Review of Economic Studies Limited
BONHOMME & ROBIN
GENERALIZED NON-PARAMETRIC
501
4. ESTIMATION We here introduce our estimator of factor densities. Asymptotic theory is in the next section.
4.1. Characteristic functions Given an i.i.d. sample {Y1 , . . . , YN } of size N , with Yn = (Yn1 , . . . , YnL )T , we first estimate κ Y and its derivatives by empirical analogues, replacing mathematical expectations by arithmetic means: T κ Y (t) = ln EN eit Y , T EN Y eit Y
T = ∂ κ Y (t), (40) ∂ κ Y (t) = i EN eit Y and
T T T EN Y Ym eit Y EN Y eit Y EN Ym eit Y 2 2
+
= ∂m κ Y (t), ∂ m κ Y (t) = − EN eitT Y EN eitT Y EN eitT Y
where EN denotes the empirical expectation operator: EN g(Y) = N 1 itT Yn exp ( κ Y (t)) = e , N
1 N
N n=1
(41)
g(Yn ). For example:
t ∈ RL ,
(42)
n=1
is the empirical characteristic function of Y. Then, for any θ ∈ RL (the direction of integration), we estimate factor c.g.f.s as: τ u vθ T κ Xk (τ ) = dvdu, τ ∈ R. Q− vech ∇∇ κ Y k θ T Ak 0 0 Equivalently, the characteristic function of Xk is estimated as τ u vθ T dvdu , Q− vech ∇∇ κ ϕ Xk (τ ) = exp Y k θ T Ak 0 0
τ ∈ R.
(43)
(44)
Our estimator of ϕ Xk depends on the direction of integration θ. We suggest choose θ = Ak (our preferred choice). More generally, it is possible to average various alternative estimators over a distribution W of θ ’s: τ u vθ T dW (θ) dvdu . (45) Q− vech ∇∇ κ ϕ Xk (τ ) = exp Y k θ T Ak 0 0 In the Monte-Carlo section, we will show the performance of an estimator based on W = 2 for some σ. δ , where the θ ’s are drawn from N A , σ I θ j k L j j
4.2. Density functions The p.d.f. of Xk is obtained from its characteristic function by the inverse Fourier transformation (20). However, it is well-known that the integral in (20) does not converge when the characteristic function is replaced by its empirical analogue, using ϕ Xk instead of ϕ Xk (e.g. Horowitz, 1998, p. 104). © 2009 The Review of Economic Studies Limited
502
REVIEW OF ECONOMIC STUDIES
To ensure convergence we truncate the integral on a compact interval [−TN , TN ], where TN tends to infinity with the sample size N at a rate that will be discussed in the next section. The p.d.f. of Xk is then estimated as τ 1 ϕH e−iτ x ϕ Xk (τ )dτ , (46) fXk (x) = 2π TN where ϕ Xk (τ ) is given by (45). In equation (46), ϕ H is a function supported on [−1, 1] which is the Fourier transform of a kernel H of even order: ϕ H (u) = eiuv H (v)dv.9 The kernel H allows smoothing the estimation of the density, especially in the tails. We shall use the second-order kernel 15 144 sin(x) 5 48 cos(x) 1− 2 − 2− 2 , (47) H2 (v) = πx 4 x π x5 x which corresponds to: 3 ϕ H2 (u) = 1 − u2 · 1{u ∈ [−1, 1]}.
(48)
The second-order kernel H2 has often been used in the deconvolution literature (see, e.g. Delaigle and Gijbels, 2002, and references therein). Higher-order kernels may also be used in place of H2 , such as the infinite-order kernel H∞ (v) = sin(v)/v used in Li and Vuong (1998), which yields ϕ H∞ (u) = 1{u ∈ [−1, 1]}. Higher-order kernels reduce the bias of the density estimate at the cost of higher variance.
Numerical issues. Computing the density estimator fXk in practice requires integrating three times: twice to recover the c.g.f. of Xk , and once to perform the inverse Fourier transformation. It is well-known that usual fast integration techniques such as Romberg’s method may give very misleading results when computing the inverse Fourier transform, because the function to integrate is strongly oscillating (Delaigle and Gijbels, 2007). For this reason, we use as integration method the slow but reliable trapezoid rule, with a large number of nods. In our experiments, using 200 equidistant nods (over the interval [−TN , TN ], where the choice of TN will be discussed below) gave very good approximations. In addition, our estimator of the characteristic function of Xk given by (44) or (45) does not guarantee that ϕ Xk is less than 1, as a proper c.f. should be. In practice, we suggest setting ϕ Xk (τ ) > 1.1. ϕ Xk (τ ) = 0 whenever 5. ASYMPTOTIC THEORY In this section, we study the asymptotic properties of the estimator and show that fXk given by (46) is a uniformly consistent estimator of fXk , for all k = 1, . . . , K. All mathematical proofs are in the Appendix. The study of the asymptotic properties of our estimator is in two steps. First, we characterize the properties of the estimator of factor characteristic functions ϕ Xk (τ ). Then, we study their inverse Fourier transforms, that is, the density estimators fXk . We characterize the rates of uniform convergence of the estimators. Consistent with the literature on non-parametric deconvolution, the rates are obtained by bounding the uniform 9. A kernel of order q is a function H , not necessarily non-negative, such that v k H (v) is integrable for all k ≤ q, v k H (v)dv = 0 for all k ≤ q − 1, and v q H (v)dv = 0. See, e.g. Rao (1983, p. 40). © 2009 The Review of Economic Studies Limited
BONHOMME & ROBIN
GENERALIZED NON-PARAMETRIC
503
distance between fXk and fXk (resp. ϕ Xk and ϕ Xk ). Upper bounds are useful for providing sufficient conditions for consistency and for understanding the properties of factor distributions which improve convergence. However, we do not prove that these upper bounds are sharp, i.e. that our convergence rates are optimal. Deriving optimal rates requires computing lower bounds on the uniform distance between fXk and fXk . Fan (1991) provides optimal rates of convergence in the classical deconvolution problem.10 Finding optimal convergence rates for the multi-factor model we consider in this paper is a difficult task, which we do not address.
5.1. Characteristic functions T
T
The c.f. estimator ϕ Xk involves means of functions of measurements of the form eit Y , Y eit Y , itT Y or Y Ym e . Because ϕ Xk (τ ) is then obtained by integration over t, a uniform consisT T T tency result is needed for EN eit Y , EN Y eit Y , and EN Y Ym eit Y . The next lemma extends T T Horowitz and Markatou’s Lemma 1, which deals with EN eit Y , to empirical means of Y eit Y T or Y Ym eit Y . Li and Vuong (1998)11 assume that factor distributions have bounded support. This may be quite restrictive in some cases, as in the case of earnings data which display particularly wide ranges of values. Here we relax this assumption and allow for unbounded support. For any vector t = (t1 , . . . , tL )T ∈ RL (L ≥ 1), we use the sup norm: |t| = max |t |.
Lemma 1. Let X be a scalar random variable and let Y be a vector of L random variables. Define Z = (X, YT )T . Let F denote the c.d.f. of Z (E denotes the corresponding expectation operator) and let FN (resp. EN ) denote the empirical c.d.f. (resp. mean) corresponding to a sample ZN ≡ (Z1 , . . . , ZN ) of N i.i.d. draws from F . Assume that the moment generating T functions of X2 and |XY| exist in some neighbourhood of 0. Define ft (x, y) = xeit y , for t ∈ RL . Let TN and εN be chosen such that δ
TN = BN 2 , B, δ > 0, ln N εN = √ . N
(49)
√ Then, there exists C > 8 2 + L(1 + δ) such that sup |EN ft − Eft | < CεN
|t|≤TN
(50)
almost surely when N tends to infinity. 1 bounds the estimation error on Eft on the compact interval [−TN , TN ] by Lemma ln N O √N , provided that TN does not grow faster than some power of N . The proof requires
that the moment generating functions (m.g.f.) of X 2 and |XY| exist in some neighbourhood of the origin. This implies in particular that every moment of X 2 and |XY| is finite. This is
10. Delaigle, Hall and Meister (2008) also provide optimal rates in the measurement error model (14). 11. Horowitz and Markatou (1996) do not assume support boundedness. However, as pointed out by Hu and Ridder (2008), their reference (p. 164) to theorem 2.37 of Pollard (1984) is inexact, and support boundedness is implicitly needed. © 2009 The Review of Economic Studies Limited
504
REVIEW OF ECONOMIC STUDIES
a restrictive assumption, which could be relaxed by assuming the existence of the first J ≥ 2 moments of |X| and |XY|, at the cost of a lower rate of convergence in Lemma 1.12 We remark also that existence of the moment generating function is less restrictive than assuming that factors have bounded support. Under this assumption, Li and Vuong (1998)
N 1 2 , as support boundedness allows the use of the law of obtain a better bound O ln ln N 13 iterated logarithm as in Cs¨org˝
o (1981, Theorem
1). itT Y itT Y and EN Y Ym e , for every , m = 1, . . . , L, if the Lemma 1 applies to EN Y e following assumption holds.
Assumption A4. All variables Y2 , Y Ym , and Y Ym Yj for , m, j in {1, . . . , L}, have their m.g.f.’s existing in some neighbourhood around 0. Assuming that this assumption holds, the following uniform consistency result for the characteristic functions of factors follows.
Theorem 1. Suppose that there exists an integrable, decreasing function g : R+ → [0, 1], such that |ϕ Y (t)| ≥ g(|t|) for |t| large enough. Then there exists some positive constant C such that: −3 θ ϕ Xk (τ ) − ϕ Xk (τ ) ≤ C g TN T dW (θ ) TN2 εN a.s. (51) sup θ Ak |τ |≤TN where εN and TN are as in Lemma 1, and with the additional restriction that the right-hand side in (51) is o(1) for consistency. Theorem 1 provides a heuristic argument for our preferred choice for the directions of integration θ = Ak . As norms are equivalent in finite dimensional spaces and as g is decreasing, −3 one can replace the sup norm of θ TθA by its Euclidian norm in g TN θ TθA . This quantity k
k
is minimum for θ = Ak , which minimizes the Euclidian norm of θ TθA .14 So, the fastest rate is k attained when W assigns all mass to θ = Ak . In Section 6, we will also provide finite sample evidence that supports this choice of direction of integration.15 In the rest of this section, for expositional simplicity, we consider the special case where W assigns all mass to one single θ, and redefine g such that (51) becomes sup ϕ Xk (τ ) − ϕ Xk (τ ) ≤ C
|τ |≤TN
with the additional restriction that
TN2 ε N g(TN )3
TN2 εN g(TN )3
(52)
a.s.
is o(1) for consistency.
12. When only the existence of the first J ≥ 2 moments of |X| and |XY| is assumed, one can show that the − 1 1− 1 +γ J estimation error on Eft on the interval [−TN , TN ] is bounded by O N 2 , for any γ > 0. 13. Li and Vuong’s argument (1998, p. 146), that boundedness is not a strong assumption as it can be achieved by transforming Y suitably and assuming that the transformed Y follows the linear factor model, is a bit contrived. Economic theory usually strongly conditions the form of the model and statistical theory has little say in that construction process. 14. Indeed, Cauchy–Schwarz inequality implies that (θ T Ak )2 ≤ (θ T θ )(ATk Ak ), so
θTθ (θ T Ak )2
≥
AT Ak k . (ATk Ak )2
15. Note that, though intuitive, our choice of direction of integration relies on a heuristic argument, as the constant C in (51) may depend on the direction of integration. Moreover, a drawback of our preferred direction is that it is not invariant to a linear transformation of the model (as BY = BAX) unless the transformation is a rotation (B orthogonal). Hence, our preferred direction of integration depends on the model representation, which is arbitrary. © 2009 The Review of Economic Studies Limited
BONHOMME & ROBIN
GENERALIZED NON-PARAMETRIC
505
Theorem 1 shows that the rate of uniform convergence for ϕ Xk depends on the tail of the characteristic function of the vector of measurements, as characterized by function g, which measures the smoothness of the distribution of Y.16 Furthermore, the estimation error increases with TN because of the integration step. More estimation errors add up if one has to integrate an estimate of κ Xk over a wider range. Hence, the rate of convergence decreases when TN grows, both because of a wider range of integration and of division by small values of the characteristic function. Lastly, it is instructive to compare the convergence rate of Theorem 1 to the convergence rates obtained in simpler models by other authors. The general form of our estimator proceeds from the second-order differentiation of κ Y (t) and a subsequent double integration.17 If no differentiation were required (as in Horowitz and Markatou, 1996; or Delaigle et al., 2008) or if first-order differentiation sufficed (as in Li and Vuong, 1998), then both g(TN ) and TN would be raised to a smaller power. This is why these authors obtain faster convergence rates of the estimators of factor characteristic functions. Note, however, that previous work on deconvolution focuses on much simpler models than the general multi-factor models that we consider. In the present case, differentiation and integration are likely to be necessary steps in the construction of the estimator.
Special cases. In a seminal contribution to the standard deconvolution theory with only one unknown factor distribution, Fan (1991) distinguished two classes of distributions: ordinary smooth, for which the c.f. converges to zero at a polynomial rate (e.g. Laplace or Gamma), and supersmooth distributions, for which the c.f. converges to zero at an exponential rate (e.g. normal). Here we illustrate the previous results in the case of ordinary smooth and supersmooth factor distributions. Let us first consider the case where all factor distributions are ordinary smooth: g(t) = t −β , t > 0, β > 1.
(53)
√ δ Taking TN = N 2 and ε N = ln N/ N , as in Lemma 1, we obtain TN2 ε N = g(TN )3
ln N N
1− 2
, 1+ 23 β δ
(54)
1 , the estimator of factor c.f. converges uniformly on [−TN , TN ], the and for any δ < 2+3β estimation error being bounded by (54). Hence, a smoother distribution of Y (a larger value of β) requires more trimming (a lower δ). Let us then consider the case where factor distributions are supersmooth, so:
g(t) = t −β 0 e−βt
1/β 1
, t > 0, β, β 1 > 0.
(55)
T2ε
Then, because of the restriction that g(TN N)3 = o(1) in Theorem 1, TN cannot increase at N polynomial rate any longer. One has to restrict TN to a logarithmic function of N in order to 16. Technically, the estimation error on ϕ Xk decreases with g(TN ) because the estimator is based on the differentiation of κ Y (t) = ln ϕ Y (t). A term 1/ |ϕ Y (t)| thus appears, which is bounded by 1/g(|t|), hence by 1/g(TN ) if |t| ≤ TN and |t| large enough. 17. Hence, TN is squared in (52) because of the double integration, and g(TN ) is cubed because of the secondorder differentiation (+2) of the logarithm (+1) of the characteristic function of Y. © 2009 The Review of Economic Studies Limited
506
REVIEW OF ECONOMIC STUDIES
ensure uniform convergence of ϕ Xk on [−TN , TN ]. For example, taking TN = (δ ln N )β 1 , the rate becomes: 1+(2+3β 0 )β 1 TN2 εN (2+3β 0 )β 1 (ln N ) = δ , 1 gY (TN )3 N 2 −3βδ 1 6β .
and the estimator is consistent if δ < more trimming (lower δ).
(56)
Again, a smoother distribution (β larger) requires
5.2. Density functions The following theorem gives conditions under which fXk converges uniformly to fXk when the sample size tends to infinity.
Theorem 2. Suppose g : R+ → [0, 1] "K that there exists an integrable, decreasingTfunction ! g (|τ |) for τ = (τ 1 , . . . , τ K ) and for |τ | = max (|τ k |) such that |ϕ X (τ )| = k=1 ϕ Xk (τ k ) ≥ ! large enough. Suppose also that there exist K integrable functions hk : R+ → [0, 1] such that hk (|τ |) ≥ ϕ Xk (τ ) for all τ . Lastly, let H be a kernel of even order q ≥ 2 with Fourier transform satisfying ϕ H (t) = 0 for |t| > 1. Then there exists two positive constants C1 and C2 such that: sup fXk (x) − fXk (x) x
≤ C1
TN3 C2 εN + q 3 g(TN ) TN
TN
−TN
v q hk (|v|)dv + 2
+∞
hk (v)dv,
a.s.
(57)
TN
where g(t) = ! g (L |A| t), |A| = maxi,j (|aij |), and εN and TN are as in Lemma 1, with the additional restriction that
TN3 εN g(TN )3
= o(1) for consistency.
The estimation error on fXk , in sup norm, has two components. The first component directly results from the estimation error on the characteristic function ϕ Xk .18 The second component results from the fact that the inverse Fourier transform yielding fXk depends on the smoothing kernel ϕ H . For example, with ϕ H (u) = ϕ H∞ (u) = 1{u ∈ [−1, 1]}, then 1 fXk (x) = (58) e−ivx ϕ Xk (v)dv, 2π TN 1 e−ivx ϕ Xk (v)dv, (59) fXk (x) = 2π −TN and thus 1 fXk (x) − fXk (x) = 2π
TN
−TN
1 ϕ Xk (v) − ϕ Xk (v) dv − e−ivx π
∞
TN
e−ivx Re ϕ Xk (v) dv. (60)
∞
One thus has to ensure that TN e−ivx Re ϕ Xk (v) dv tends to zero when N tends to infinity. Interestingly, this term is decreasing in TN , and it is smaller the thinner the tails of ϕ Xk . 18. In particular, TN is cubed in (57) instead of squared because of the integral in the inverse Fourier transform. © 2009 The Review of Economic Studies Limited
BONHOMME & ROBIN
GENERALIZED NON-PARAMETRIC
507
The presence of the second component on the right-hand side of (57) has two consequences. First, although we emphasized in the previous section that more trimming (a lower value of TN ) is necessary to ensure a faster convergence of the estimator of ϕ Xk on [−TN , TN ], less trimming (a larger TN ) is required to reduce the second component of (57). This tension between two conflicting objectives arises in standard non-parametric deconvolution, and explains why asymptotic convergence rates are often slow. Second, although the presence of smoother factor distributions makes deconvolution more difficult, a smoother distribution of Xk also makes the second component in (57) smaller, improving the convergence rate of the density estimator.19
Special cases. We illustrate the results in the case of ordinary smooth and supersmooth factor distributions. To minimize the complexity of the discussion, we focus on the kernel ϕ H (u) = ϕ H∞ (u) = 1{u ∈ [−1, 1]}. Remark that ϕ H∞ is an infinite-order kernel (q = ∞), so T that the term T1q −TNN v q hk (|v|)dv is zero in (57). N Let us start with ordinary smooth factors. As Xk is ordinary smooth, one can choose: hk (t) = t −α , α > 1.
(61)
δ
Let TN = N 2 , δ > 0. Then,
+∞
TN
δ
N − 2 (α−1) hk (v)dv = . α−1
(62)
So, using (54), we have (C being a positive constant): ln N 1 sup fXk (x) − fXk (x) ≤ C + δ 1 3 x N 2 (α−1) N 2 − 2 (1+β)δ
(63)
a.s.
We remark that the first term on the right-hand side of (63) increases when δ increases, as a larger TN reduces the convergence rate of the characteristic function. However, the second term on the ∞right-hand side of (63) decreases when δ increases, as less trimming decreases the quantity | TN e−ivx Re [ϕ Xk (v)]dv|. Hence, intuitively, there should exist an optimal degree of trimming. We remark also that the convergence rate of fXk is polynomial in N . For example, by choosing δ arbitrarily close to 1/(2 + 3β + α), we see that the estimation error on fXk satisfies: sup fXk (x) − fXk (x) ≤ C x
ln N α−1
(64)
a.s.
N 4+6β+2α
So the rate is faster when α increases (smoother Xk ), as long as β (the smoothness of Y) stays constant. Moreover, with β ≥ α, the right-hand side of (64) is smallest when α = β → +∞, ln N equal to O 1 . In comparison, Fan (1991, p. 1265) and Li and Vuong (1998) obtain faster N8
rates of convergence.20 Recall that the rates that we obtain are not sharp, so that the speed 19. This also happens in the standard deconvolution problem: Y = X + U , U known, where the performance of the estimator of the density of X improves when the smoothness of U decreases, and when the smoothness of X increases. 1
1
20. Fan (1991) and Li and Vuong (1998) bound the estimation error by N − 2 and N − 6 , respectively. Li and Vuong’s estimator requires one first-order differentiation and one integration. This case yields T3 N ε N g(TN )3
in (57). This explains the difference between
1 N− 6
and
T2 N ε N g(TN )2
instead of
1 N− 8 .
© 2009 The Review of Economic Studies Limited
508
REVIEW OF ECONOMIC STUDIES
of convergence of our estimator could actually be faster than the rate of convergence to zero 1 of ln N/N 8 . Nevertheless, this result suggests that the asymptotic rate of convergence of our estimator is slow, relative to other estimators used in the deconvolution literature. Slower rates are the price to pay for allowing for many unobservable components. Interestingly, when factor Xk is ordinary smooth but Y is supersmooth, we obtain much slower rates of convergence. This case may arise if one of the factor variables (different from Xk ) has a supersmooth distribution.21 Let TN = (δ ln N )β 1 . Then: +∞ (δ ln N )−β 1 (α−1) hk (v)dv = , (65) α−1 TN which yields, using (56): # sup fXk (x) − fXk (x) ≤ C
(ln N )1+3(1+β 0 )β 1
x
1
N 2 −3βδ
+
1 (δ ln N )β 1 (α−1)
$ a.s.
(66)
This rate is logarithmic in N , because of the presence of the second term on the right-hand side of (66). Logarithmic rates of convergence are also obtained in the classical deconvolution problem with one unknown factor, when the distribution of the factor of interest is ordinary smooth while the distribution of the error is supersmooth (Carroll and Hall, 1988; Fan, 1991). Before ending this discussion, note that additional convergence rates can be derived in the cases where Xk follows a supersmooth distribution, although the expressions are more involved.
5.3. Practical choice of the trimming parameter TN We use a method recently developed in deconvolution kernel density estimation to choose the trimming parameter TN . In the context of the deconvolution problem with known error distribution, Delaigle and Gijbels (2002, 2004) propose to base the choice of the bandwidth on an approximation of the mean integrated squared error (MISE) of the kernel density estimator. Comparing different approaches, they find that a “plug-in” method works well in many simulation designs. We adapt Delaigle and Gijbels’ method to the case of a multi-factor model Y = AX as follows. For k ∈ {1, . . . , K}, let θ be a direction of integration. Then
θ T Am θ TY θ T AX = = X + Xm . k θ T Ak θ T Ak θ T Ak m=k
(67)
θ T Am We treat the distribution of m=k θ T Ak Xm in (67) as if it were known. In this case, the problem of estimating the density of factor Xk boils down to a deconvolution problem with known error distribution, and the method of Delaigle and Gijbels (2004) can be applied. A detailed presentation of the “plug-in” method and of our bandwidth selection procedure is given in Appendix D. 21. For example, in the simple deconvolution model Y = X + U with X ordinary smooth and U supersmooth, ϕ Y (t) = ϕ X (t)ϕ U (t) has thin tails because of the presence of ϕ U , hence Y is supersmooth. © 2009 The Review of Economic Studies Limited
BONHOMME & ROBIN
GENERALIZED NON-PARAMETRIC
509
6. MONTE CARLO SIMULATIONS In this section, we study the finite-sample behaviour of our density estimator. GAUSS codes implementing the density estimators are available online, together with the dataset used in the next section.22
6.1. Measurement error model We start with the estimation of the density of X in the measurement error model (14) with α = 1. Namely: Y1 = X + U1 (68) Y2 = X + U2 , where U1 , U2 and X are mutually independent, and have mean zero and variance 1. We first consider the case of normal errors U1 and U2 , and various choices of distribution for X. In Figure 1 we report the outcomes of 100 simulations of samples of size N = 1000. In the first column we estimate the density of X using our method, assuming that all three distributions are unknown, and with our preferred choice of direction of integration 12 , 12 . In the second column we estimate the density of X from: U1 + U2 Y1 + Y2 =X+ , (69) 2 2 2 assuming that U1 +U has known c.f. ϕ U1 +U2 (u) = exp − 14 u2 . We use kernel deconvolution for 2 2 estimation, with the second-order kernel H2 for smoothing, and choose the trimming parameter TN using the “plug-in” method of Delaigle and Gijbels (2004). Lastly, in the third column we show the Gaussian kernel density estimator of X for comparison, using Silverman’s rule of thumb for choosing the bandwidth. Obviously, this last estimator cannot be computed with real data as X is unobserved. On each graph, the thin solid line represents the population density of X, and the thick solid line is the pointwise median of simulations. The dashed lines delimit the 10–90% pointwise confidence bands. Both non-parametric deconvolution methods estimate normal factor distributions well. However, the density at the mode is significantly biased–the true value being systematically outside the confidence band. They both display very similar biases, and the same confidence bands, only moderately wider than when X is observed without error. This suggests that repeated measurements can be very effective at providing information on the distributions of unknown latent variables. Also, the informal choice of bandwidth that we use appears to give very good results, as good as for the deconvolution problem with known error distribution for which it was initially designed. For non-Gaussian factor distributions, we observe that the deconvolution estimators have some difficulty to capture skewness and kurtosis. The Gamma (5, 1) and Gamma (2, 1) distributions have skewness 0.9 and 1.4, and kurtosis 4.2 and 6, respectively. We see that the bias is larger in the second case. To further study the impact of factor kurtosis on estimation, we consider for X a two-components normal mixture that has excess kurtosis equal to 100, 1 3 406 that is: X ∼ 400 403 N(0, 2 ) + 403 N(0, 6 ). The bias is also larger than in the case where X is normal, although the estimator does a good job at capturing the peak of the density. Lastly, we 22. Available at: http://www.cemfi.es/∼bonhomme/ © 2009 The Review of Economic Studies Limited
510
REVIEW OF ECONOMIC STUDIES Normal
0.6
0.6
0.6
0.5
0.5
0.5
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0.0 −5
0.0 −5
−3
−1
1
3
5
−3
−1
1
3
5
Gamma(5,1)
0.0 −5
0.6
0.6
0.5
0.5
0.5
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0.0 −5
−3
−1
1
3
5
0.0 −5
−3
−1
1
−3
−1
1
3
5
−3
−1
1
3
5
−3
−1
1
3
5
−3
−1
1
3
5
−1
1
3
5
0.6
3
5
0.0 −5
Gamma(2,1) 0.6
0.6
0.6
0.5
0.5
0.5
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0.0 −5
0.0 −5
−3
−1
1
3
5
−3
−1
1
3
5
0.0 −5
Normal mixture, unimodal 0.7
0.7
0.7
0.6
0.6
0.6
0.5
0.5
0.5
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0.0 −5
0.0 −5
−3
−1
1
3
5
−3
−1
1
3
5
0.0 −5
Normal mixture, bimodal 0.6
0.6
0.6
0.5
0.5
0.5
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0.0 −5
0.0 −5
−3
−1
1
3
jU1, jU2 unknown
5
−3
−1
1
jU1, jU2 known
3
5
0.0 −5
−3
X observed
Figure 1 Monte Carlo estimates of fX in the measurement error model with normal errors Note: Density of X in model (68). Thin line = true; thick = median of 100 simulations; dashed = 10–90% confidence 1 3 406 1 bands. “Normal mixture, unimodal” is 400 403 N(0, 2 ) + 403 N(0, 6 ), “Normal mixture, bimodal” is 2 N(−2, 1) + 1 2 N(2, 1). N = 1000. © 2009 The Review of Economic Studies Limited
BONHOMME & ROBIN
GENERALIZED NON-PARAMETRIC
511
TABLE 1 MISE of various density estimators in the measurement error model fX
fU1
Standard normal errors (1) Normal Laplace Gamma (2,1) Gamma (5,1) Log-normal Normal mixture, unimodal Normal mixture, bimodal
Normal Laplace Gamma (2,1) Gamma (5,1) Log-normal Normal mixture, unimodal Normal mixture, bimodal
0.0011 0.0030 0.0036 0.0012 0.099 0.0076 0.0024
(2)
(3)
0.0052 0.0050 0.025 0.026 0.026 0.024 0.011 0.010 0.29 0.24 0.020 0.022 0.041 0.049 Standard Laplace errors
(4) 0.0069 0.0045 0.0044 0.0062 0.0032 0.0040 0.0097
(1)
(2)
(3)
(4)
0.0012 0.0027 0.0030 0.0012 0.061 0.0074 0.0027
0.0047 0.018 0.018 0.0069 0.22 0.016 0.029
0.0042 0.020 0.018 0.0070 0.16 0.016 0.038
0.034 0.020 0.019 0.027 0.011 0.021 0.035
Note: See the note to Figure 1. (1) refers to the case where X is observed, (2) to the deconvolution estimator with known error distributions, and (3) to our generalized deconvolution estimator of fX . (4) refers to our estimator of fU1 . N = 1000, 100 simulations.
generate a bimodal distribution as a two-component mixture of normals with different means: X ∼ 12 N(−2, 1) + 12 N(2, 1). The estimator fails to capture the bimodality.23 In Table 1 we report the MISE of various estimators fX of fX , given by: MI SE = E
2 fX (x) − fX (x) dx .
(70)
In addition to the case of standard normal errors, we also report the results for Laplacedistributed errors. X follows one of the five distributions of Figure 1, and may also be Laplace or log-normally distributed. The estimators we consider are: a kernel density estimate of the density when the factor is observed (column 1), the deconvolution estimator with known error distributions (column 2), and our generalized deconvolution estimator (column 3). Our estimator of fU1 is reported in column (4). Table 1 confirms that the performance of our estimator is comparable to that of the deconvolution estimator with known error distributions. When errors are distributed as Laplace random variables, theory suggests that the deconvolution problem should be less difficult, and the MISE is indeed slightly lower than in the case of normal errors. Still, the differences between the cases of ordinary smooth and supersmooth errors do not seem very large. 23. It is worth noting that in these various designs, we experimented increasing the sample size to N = 10,000, and still obtained a sizeable bias (although reduced compared to the case N = 1000). © 2009 The Review of Economic Studies Limited
512
REVIEW OF ECONOMIC STUDIES TABLE 2 Comparison with other estimators in the measurement error model MISE of fX (1) Normal Laplace Gamma (2,1) Gamma (5,1) Log-normal Normal mixture, unimodal Normal mixture, bimodal
Normal Laplace Gamma (2,1) Gamma (5,1) Log-normal Normal mixture, unimodal Normal mixture, bimodal
0.0090 0.035 0.031 0.016 0.24 0.030 0.062
(2) 0.0079 0.037 0.035 0.013 0.23 0.024 0.063 MISE of
(3) 0.0049 0.026 0.024 0.011 0.35 0.019 0.048 fU
(4) 0.0056 0.026 0.028 0.0087 0.25 0.020 0.051
1
(1)
(2)
(3)
0.0062 0.0042 0.0043 0.0056 0.0039 0.0048 0.0084
0.0096 0.0056 0.0051 0.0069 0.0027 0.0058 0.017
0.0066 0.0046 0.0046 0.0054 0.0026 0.0047 0.0099
(4) 0.0072 0.0051 0.0057 0.0064 0.0041 0.0055 0.0084
Note: See the note to Figure 1. (1) refers to the estimator in Horowitz and Markatou (1996), (2) to the estimator of Li and Vuong (1998), (3) to our generalized deconvolution estimator, and (4) to a modification of our estimator which averages ten different directions of integration, see the text. Errors are standard normal. N = 1000, 100 simulations.
6.2. Comparison with other estimators Here we still consider the measurement error model (68), using the same setup as above, and compare our estimator to the ones of Horowitz and Markatou (1996, HM hereafter) and Li and Vuong (1998, LV hereafter). For all estimators we use the smoothing kernel H2 and select the trimming parameter TN using the “plug-in” method of Delaigle and Gijbels (2004). The first two columns of Table 2 display the MISE of the HM and LV estimators. Column (3) gives the MISE of our estimator using our preferred direction of integration 12 , 12 (LV use (0, 1)). Lastly, column (4) shows the MISE of an estimator that averages over the directions of integration, see (45), using ten draws of a bivariate normal distribution with mean 12 , 12 and standard deviation 0.10. We show the MISE of the estimators of the density of X (the factor) and of that of U1 (the first error). Our estimator performs very well compared to HM and LV. The MISE of fX is consistently lower for our estimator, except when X is lognormally distributed, in which case all estimators do badly. For the error U1 , the MISE of our estimator is comparable to that of HM, and generally lower than that of LV. Also, the estimator that averages over various directions of integration performs similar to or slightly worse than the one using our preferred direction of integration.24 In Section 5 we argued that choosing the direction of integration that minimizes the norm of θ TθA yields faster convergence rates. We now provide some direct Monte Carlo evidence in k support of this claim. 24. We also tried to average directions of integration around the origin, and obtained much larger MISE. © 2009 The Review of Economic Studies Limited
BONHOMME & ROBIN
GENERALIZED NON-PARAMETRIC
(a) 0
(b) 0
−1
−1
−2
−2
−3
0
1
2
3
−3
0
1 2 2
(c) 0
(d) 0
−1
−1
−2
−2
0
1
2
3
Re kX t ,q = 0, 1
3
Re kY t , t
Re kY 0, t
−3
2
513
4
5
−3
0
1
2 3 Re kX t ,q = 1 , 1
4
5
2 2
Figure 2 Monte Carlo simulations for the estimated characteristic functions in the measurement error model Note: Estimates of κ Y and κ X is the measurement error model (68). τ 2 is plotted on the x-axis, c.g.f.’s on the y-axis. Thin line = true; thick = median of 100 simulations; dashed = 10–90% confidence bands. N = 1000.
Figure 2 presents Monte Carlo simulations for estimates of the c.g.f.’s of Y and X in the measurement error model (14). The setup is as before (100 simulations). In panels (a) and (b), we plot empirical estimates of Re κ Y (0, τ ) and Re κ Y ( τ2 , τ2 ), for τ ∈ R+ . We set the scale on 2 the x-axis equal to τ 2 . The c.f. of the standard normal distribution being e−t /2 , the true value of the c.g.f. is then a straight line with slope −1 in panel (a), and −3/4 in panel (b). The c.g.f. of Y is well estimated over a wide range but the precision is worse at higher frequencies (see also Diggle and Hall, 1993). This explains why Re κ Y ( τ2 , τ2 ) is better estimated, for a given τ , than Re κ Y (0, τ ). In panel (c), we report estimates of the c.g.f. of X obtained by Li and Vuong’s method; that is, integrating along the direction (0, 1). Panel (d) shows the results of our preferred method, integrating along the direction 12 , 12 . The c.g.f. of X is better estimated (more precisely and with less bias) by the second method.
6.3. Spatial model We then consider the spatial model with L = 3 and K = 6: ⎧ ⎨Y1 = 2X1 + X2 + X3 + U1 Y2 = X1 + 2X2 + X3 + U2 ⎩ Y3 = X1 + X2 + 2X3 + U3 ,
(71)
© 2009 The Review of Economic Studies Limited
514
REVIEW OF ECONOMIC STUDIES
where Xk , k = 1, . . . , 3, and Uk , k = 1, . . . , 3, are mutually independent. This corresponds to model (16) with ρ = 1/2. All factor densities belong to the same parametric family. We only let their variances differ: the variances of X1 , X2 and X3 are equal to 1, while U1 , U2 and U3 have either variance 1 (first column in the figure), 4 (second column), or 16 (third column). The sample size is N = 1000, and the number of simulations and the conventions used in graphical display are the same as for the measurement error model. Figure 3 presents the results. We see that when errors U1 , U2 and U3 have moderate variance (1 or 4) the density of X1 is well estimated. The results are comparable to the ones obtained in Figure 1, with a slightly larger bias. When error variances increase to 16, the density of X1 becomes badly estimated. For other distributions, namely Gamma or mixture of normals, we generally obtain worse results than for the measurement error model (68). This is confirmed by Table 3, which shows the MISE of the density estimates of X1 and U1 . We remark that, when σ 2 increases, the performance of fX1 worsens while that of fU1 improves. Estimating six factor densities using three measurements is of course more difficult than estimating three factor densities using two measurements. Yet, in the case of moderate error variances, the shapes of the densities are reasonably well reproduced. This suggests that nonparametric deconvolution techniques can be successfully applied to difficult problems, where the number of factors one is trying to extract is large relative to the number of available measurements.
7. APPLICATION TO EARNINGS DYNAMICS In this section, we apply our methodology to estimate the distributions of permanent and transitory shocks in a simple model of earnings dynamics.
7.1. The data We use PSID data, between 1978 and 1987. Let ynt denote the logarithm of annual earnings, and let xnt be a vector of regressors, namely: education dummies, a quadratic polynomial in age, a race dummy, geographic indicators, and year dummies. We compute the residuals of the ordinary least squares (OLS) regression of ynt = ynt − ynt−1 on xnt = xnt − xnt−1 , and denote them as wnt . In the sequel we shall refer to wnt as wage growth residuals, while keeping in mind that they reflect changes in wage rates and hours worked. We select employed male workers who have non-missing observations of wnt for the whole period, and for whom wage growth does not exceed 150% in absolute value. We obtain a balanced panel of 624 individuals, with nine observations of wage growth per individual. Descriptive statistics are presented in the first column of Table 4. Wage growth residuals wnt are the measurements that we use in this application. We shall also consider moving sums of wage growth residuals, defined as s wnt = wnt − wn,t−s =
s
wn,t−k+1 , for s = 1, 2, . . .
(72)
k=1
Table 5 shows the marginal moments of these variables, as well as their first three autocorrelation coefficients. Focusing on the first row, we see that the variance of s wnt increases with s. This indicates that wage differences between two points in time are more dispersed the longer the lag. Another feature of Table 5 is the high kurtosis of wage growth residuals. This evidence of non-normality is consistent with previous findings on US data. © 2009 The Review of Economic Studies Limited
BONHOMME & ROBIN
GENERALIZED NON-PARAMETRIC Normal
0.6
0.6
0.5
0.5
0.5
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0.0 −5
−3
−1
1
3
0.0 5 −5
−3
−1
0.6
1
3
5
0.0 −5
0.6
0.5
0.5
0.5
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0.0 −5
0.0 −5
−1
1
3
5
−3
−1
1
3
5
−3
−1
1
3
5
Gamma(5,1)
0.6
−3
515
0.6
−3
−1
1
3
5
0.0 −5
Gamma(2,1) 0.6
0.6
0.6
0.5
0.5
0.5
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0.0 −5
0.0 −5
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 −5
−3
−3
−1
−1
1
1
3
5
5
0.0 −5
−3
−1
1
3
5
3
Normal mixture, unimodal 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 5 −5 −3 −1 1 3 5
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 −5
−3
−1
1
3
5
−3
−1
1 = 16
3
5
−3
−1
1
3
Normal mixture, bimodal
0.6
0.6
0.5
0.5
0.5
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0.0 −5
0.0 −5
−3
−1 2
1 =1
3
5
0.6
−3
−1 2
1 =4
3
5
0.0 −5
2
Figure 3 Monte Carlo estimates of fX1 in model (16) Note: Density of X1 in model (71). Xk , k = 1, 2, 3, are drawn from the same distribution with mean zero and variance 1. Uk , k = 1, 2, 3, are drawn from the same distribution as X1 , X2 , X3 with mean zero and variance σ 2 . “Normal 1 3 406 1 1 mixture, unimodal” is 400 403 N(0, 2 ) + 403 N(0, 6 ), “Normal mixture, bimodal” is 2 N(−2, 1) + 2 N(2, 1). N = 1000, 100 simulations. © 2009 The Review of Economic Studies Limited
516
REVIEW OF ECONOMIC STUDIES TABLE 3 MISE of the generalized deconvolution estimator in the spatial model MISE of fX1 Normal Laplace Gamma(2,1) Gamma(5,1) Log-normal Normal mixture, unimodal Normal mixture, bimodal
σ2 = 1 0.013 0.039 0.036 0.018 0.35 0.030 0.074
σ2 = 4 0.022 0.059 0.053 0.028 0.30 0.044 0.077
MISE of fU1 σ 2 = 16 0.047 0.098 0.092 0.058 0.40 0.10 0.098
σ2 = 1 0.040 0.081 0.072 0.050 0.32 0.078 0.098
σ2 = 4 0.0096 0.038 0.039 0.019 0.15 0.028 0.068
σ 2 = 16 0.0049 0.028 0.027 0.0094 0.12 0.022 0.055
Note: All factors follow the same distribution up to scale: X1 to X3 have unitary variance, while U1 to U3 have variance σ 2 . N = 1000, 100 simulations. TABLE 4 Means of variables Job changes Annual earnings (/1000) Age High-school dropout High-school graduate Hours Married White North east North central South SMSA Number
All
None
One/two
36.4 37.4 0.21 0.54 2194 0.85 0.70 0.15 0.26 0.43 0.59 624
35.3 39.6 0.22 0.59 2191 0.84 0.63 0.15 0.30 0.44 0.60 150
36.7 37.3 0.23 0.51 2199 0.84 0.69 0.13 0.24 0.51 0.55 234
Three/more 37.0 36.1 0.16 0.54 2191 0.85 0.75 0.17 0.27 0.36 0.61 240
Note: Balanced subsample of 624 individuals extracted from the PSID, 1978–1987. “None” = no job change; “One/two” = one or two job changes; “Three/more” = three or more job changes.
TABLE 5 Moments of wage growth residuals Wage growth
t/t + 1
t/t + 2
t/t + 3
t/t + T
Variance Skewness Kurtosis Autocorrelation 1 Autocorrelation 2 Autocorrelation 3
0.055 –0.077 10.3 –0.33 –0.06 –0.02
0.073 0.062 11.2 0.21 –0.34 –0.06
0.086 –0.073 8.0 0.35 0.08 –0.34
0.137 0.457 4.8 – – –
Note: Balanced subsample of 624 individuals extracted from the PSID, 1978–1987. Wage growth residuals are the OLS residuals of first-differenced log earnings on regressors. Wage growth between t and t + s is obtained as the sum of s consecutive wage growth residuals.
© 2009 The Review of Economic Studies Limited
BONHOMME & ROBIN
GENERALIZED NON-PARAMETRIC
517
7.2. Model and estimation We consider the model outlined in Section 2: P T wnt = ynt + ynt ,
= ε nt + ηnt − ηn,t−1 ,
i = 1, . . . , N,
t = 2, . . . , T ,
(73)
P P P follows a random walk: ynt = ynt−1 + ε nt , where εnt and ηnt are white noise where ynt innovations with variances σ 2ε and σ 2η . As ηn1 and ε n2 are not separately identified, we normalize P as the permanent component ηn1 to zero. Likewise, we set ηnT to zero. We shall refer to ynt and to ηnt as the transitory component. Permanent-transitory decompositions are very popular in the earnings dynamics literature, see among others Hall and Mishkin (1982) and Abowd andCard (1989). There is a growing concern that the distributions of wage shocks might be non-normal (e.g. Geweke and Keane, 2000). To assess this issue, Horowitz and Markatou (1996) estimate a model of earnings levels with an individual fixed effect and a transitory i.i.d. shock. There is no permanent shock in their model. Their estimation procedure is fully non-parametric. However, one particular implication of their model is that wnt , 2 wnt , . . . are identically distributed. This is clearly at odds with the evidence presented in Table 5. The introduction of a permanent component easily permits the capture of the increase in Var (s wnt ) when s increases.25 Turning to estimation, the earnings dynamics model (73) is a linear factor model with L = T − 1 equations and K = 2T − 3 factors (see Section 2). The estimator given by (46) thus yields consistent estimates of the densities of all shocks. We estimate these densities using the second-order kernel H2 and Delaigle and Gijbels’ (2004) method to select the trimming parameter TN . We then average all estimated densities.26 This has the advantage that, if the stationarity restrictions do not hold, one still estimates consistently the average of shock densities over the period. This is appealing in our context, as there is ample evidence that the US economy was subject to large changes in the wage distribution at the beginning of the 1980s. Likewise, different variances can be estimated for all shocks, and averaged ex post, yielding estimates σ 2ε and σ 2η . We use equally weighted minimum distance to estimate those variances.
7.3. Results The estimated average variance of permanent shocks is σ 2ε = 0.0208, and the estimated average 2 variance of transitory shocks is σ η = 0.0185, with standard errors of 0.0029 and 0.0017, respectively.27 According to these estimates, permanent shocks account for 36% of the total variance of wage growth residuals. Figure 4 presents the estimated densities obtained by averaging density estimates over the period. The permanent and transitory components are shown in panels (a) and (b), respectively. In each panel, the thick solid line represents the density of the shock, standardized to have unit variance, and the thin solid line represents the standard normal density, which we draw for comparison. The dashed lines delimit the bootstrapped 10–90% confidence band.28 25. Notice that model (73) implies that: Var (s wnt ) − Var (wnt ) = (s − 1) σ 2ε . The marginal distributions of wnt and 2 wnt thus contain all the necessary information to identify σ 2ε and σ 2η . 26. We verified that averaging c.g.f.’s instead yielded very similar results. 27. Standard errors were computed by 1000 iterations of individual block bootstrap. 28. We remark that, as we do not derive the asymptotic distribution of the non-parametric estimator, the validity of the bootstrap in our context is difficult to verify. © 2009 The Review of Economic Studies Limited
518
REVIEW OF ECONOMIC STUDIES (a)
(b)
1.5
1.0
1.5
density
density
1.0
0.5
0.0
0.5
–5
–3
–1 1 permanent shock
3
5
0.0 –5
–3
–1 1 transitory shock
3
5
Figure 4 Non-parametric estimates of the densities of standardized permanent and transitory shocks Note: Density estimates of εnt and ηnt , both standardized to have unit variance. Density estimate (thick); 10–90% confidence bands of 100 bootstrap simulations (dashed); standard normal density (thin).
Figure 4 shows that none of the two distributions is Gaussian. Both permanent and transitory shocks appear strongly leptokurtic. In particular, they have high modes and fatter tails than the normal. Moreover, the transitory part seems to have higher kurtosis than the permanent component.29 Lastly, both densities are approximately symmetric. As noted above, we estimate the densities of permanent and transitory shocks by averaging the period-specific density estimates. Figure 5 shows the density estimates of the nonstandardized permanent and transitory shocks in every period. In particular, we see that the density of permanent shocks tends to be more peaked in the second half of the period, suggesting an increase in kurtosis, although the density shapes are not well estimated enough to be conclusive.
7.4. Fit Figure 6 compares the predicted densities of s wnt , s = 1, 2, 3, using the model and the estimated densities of permanent and transitory shocks, with kernel density estimates. In panels (a1) to (c1), the thin line is a kernel estimator of the density s wnt (s = 1, 2, 3 ). The thick line is the predicted density. The dashed line shows the density that is predicted under the assumption that shocks are normally distributed. The predicted densities of s wnt , s = 1, 2, 3, were calculated analytically by convolution of the estimated densities of ε nt and ηnt . Figure 6 shows that our specification reproduces two features apparent in Table 5: the high kurtosis of wage growth residuals, and the decreasing kurtosis when the time lag increases. Note that the high mode of the density is remarkably well captured by our non-parametric method, even in the case of 3 wnt . In contrast, the normal specification gives a rather poor fit. We then present in Table 6 the moments of wage growth residuals, as in the data and as predicted under normality and non-parametrically. We see that variances are severely underestimated, reflecting a rather bad estimation of the density in the tails. Moreover, the estimated kurtosis is 5.6, which is significantly non-normal but very different from the kurtosis of the distribution to be fitted (10.3). Overall, our method captures the shapes of the densities 29. We checked that varying the trimming parameter TN around the value that we obtained using Delaigle and Gijbels’ (2004) method had little effect on the estimate fε , but a stronger effect on fη , tail oscillations increasing with TN . © 2009 The Review of Economic Studies Limited
BONHOMME & ROBIN
GENERALIZED NON-PARAMETRIC
519
Permanent shocks 10
10
10
8
8
8
6
6
6
4
4
4
2
2
2
0 −0.6 −0.4 −0.2 −0.0 0.2 0.4 0.6
0 −0.6 −0.4 −0.2 −0.0 0.2 0.4 0.6
0 −0.6 −0.4 −0.2 −0.0 0.2 0.4 0.6
1979*
1980
1981
10
10
10
8
8
8
6
6
6
4
4
4
2
2
2
0 −0.6 −0.4 −0.2 −0.0 0.2 0.4 0.6
0 −0.6 −0.4 −0.2 −0.0 0.2 0.4 0.6
0 −0.6 −0.4 −0.2 −0.0 0.2 0.4 0.6
1982
1983
1984
10
10
10
8
8
8
6
6
6
4
4
4
2
2
2
0 −0.6 −0.4 −0.2 −0.0 0.2 0.4 0.6
0 −0.6 −0.4 −0.2 −0.0 0.2 0.4 0.6
0 −0.6 −0.4 −0.2 −0.0 0.2 0.4 0.6
1986
1987*
1985
Transitory shocks 10
10
10
10
8
8
8
8
6
6
6
6
4
4
4
4
2
2
2
2
0 −0.6 −0.4 −0.2 −0.0 0.2 0.4 0.6
0 −0.6 −0.4 −0.2 −0.0 0.2 0.4 0.6
0 −0.6 −0.4 −0.2 −0.0 0.2 0.4 0.6
0 −0.6 −0.4 −0.2 −0.0 0.2 0.4 0.6
1979
1980
1981
1982
10
10
10
10
8
8
8
8
6
6
6
6
4
4
4
4
2
2
2
2
0 −0.6 −0.4 −0.2 −0.0 0.2 0.4 0.6
0 −0.6 −0.4 −0.2 −0.0 0.2 0.4 0.6
0 −0.6 −0.4 −0.2 −0.0 0.2 0.4 0.6
0 −0.6 −0.4 −0.2 −0.0 0.2 0.4 0.6
1983
1984
1985
1986
Figure 5 Density estimates of the non-standardized permanent and transitory shocks in every period Note: Density estimates of εnt and ηnt in every period. In the first and last periods, the “permanent” shock includes the permanent and transitory components (indicated by ∗ ). © 2009 The Review of Economic Studies Limited
520
REVIEW OF ECONOMIC STUDIES
3.5
3.0
3.0
2.5
2.5
density
(a2) 4.0
3.5
density
(a1) 4.0
2.0
2.0
1.5
1.5
1.0
1.0
0.5
0.5
0.0 −1.5
−1.0
−0.5
0.0
0.5
1.0
0.0 −1.5
1.5
wage growth, t/t + 1
2.5
2.0
2.0 density
2.5
density
(b2) 3.0
1.5
1.0
0.5
0.5 −0.5
0.0
0.5
1.0
0.0 −1.5
1.5
wage growth, t/t + 2
2.5
2.0
2.0 density
2.5
density
(c2) 3.0
1.5
1.0
0.5
0.5 −0.5
0.0
0.5
1.0
1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
1.5
1.0
−1.0
0.5
wage growth t/t + 2, normal mixture
(c1) 3.0
0.0 −1.5
0.0
1.5
1.0
−1.0
−0.5
wage growth t/t + 1, normal mixture
(b1) 3.0
0.0 −1.5
−1.0
1.0
wage growth, t/t + 3
1.5
0.0 −1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
wage growth t/t + 3, normal mixture
Figure 6 Fit of the model, densities of wage growth residuals Note: Graphs (a1), (b1) and (c1) show the fit of wage growth residuals calculated over 1, 2 and 3 years, respectively, using the generalized deconvolution estimator. Graphs (a2), (b2) and (c2): densities are estimated by maximum likelihood, where shocks follow two-component mixtures of zero mean normals. Predicted density (thick); kernel density estimate (thin); normal (dashed).
of wage growth variables very well, but fails at fitting the tails, which leads to underestimating higher moments. To fit the moments better, we use the non-parametric estimates fε and fη as a guide to find a convenient parametric form for factor densities. Figure 4 suggests that a mixture of two © 2009 The Review of Economic Studies Limited
BONHOMME & ROBIN
GENERALIZED NON-PARAMETRIC
521
TABLE 6 Fit of the model, moments of wage growth residuals Wage growth
t/t + 1
t/t + 2
t/t + 3
Data Variance Skewness Kurtosis
0.055 –0.08 10.3
0.073 0.086 0.06 –0.07 11.2 8.0 Predicted, non-parametric
Variance Skewness Kurtosis
0.037 –0.02 5.6
0.053 –0.02 4.6 Predicted, normal
0.069 –0.02 4.2
Variance Skewness Kurtosis
0.057 0 3
0.076 0 3 Predicted, normal mixture
0.096 0 3
Variance Skewness Kurtosis
0.058 0 6.3
0.072 0 5.3
0.086 0 4.8
Note: See the note to Figure 6. Moments are predicted using the predicted densities shown in Figure 6, by computing the integrals numerically.
normals centred at zero may work well in practice. We thus estimate model (73) under this parametric specification for both ε nt and ηnt . Parameters are estimated by maximum likelihood, using the EM algorithm of Dempster, Laird and Rubin (1977). Panels (a2) to (c2) in Figure 6 show the fit of the model. The shape of the densities is very well reproduced. Moreover, the last three rows of Table 6 show that the normal mixture specification yields much better estimates of the variance and kurtosis of wage growth residuals. Notice that the normal mixture model was already used by Geweke and Keane (2000) to model earnings dynamics. Our results strongly support this modelling choice.
7.5. Wage mobility We then use the model to weight the respective influence of permanent and transitory shocks in wage mobility. To this end, we compute the conditional expectations of the permanent and s−1 transitory components of s wnt , s = 1, 2, 3: E and E(η ε | w nt−r s nt nt − ηnt−s |s wnt ). r=0 To do so, we first compute the conditional distribution of permanent and transitory shocks using Bayes’ rule. For example, the conditional density of the permanent shock given wage observations is given by: fε (ε) fη (η)fη (w − ε + η)dη fε (ε)f (w|ε) = , (74) f (ε|w) = ε)f (w|! ε)d! ε ε) fη (η)fη (w −! ε + η)dηd! ε fε (! fε (! where fε is the p.d.f. of ε and fη is the p.d.f. of η. We proceed similarly for transitory shocks ηnt − ηnt−1 . Figure 7 plots these conditional expectations. We verify that the volatility of earnings is more likely to have a permanent origin if s is large. In panel (a), we see for example that a log wage growth of ±100% has a transitory origin for more than ±60% and a permanent origin © 2009 The Review of Economic Studies Limited
522
REVIEW OF ECONOMIC STUDIES
(a)
(b)
0.0
−0.5
−0.5
0.0
0.5
1.0
1.0
% permanet/ transitory
0.5
−1.0 −1.0
(c) 1.0
% permanet/ transitory
% permanet/ transitory
1.0
0.5
0.0
−0.5
−1.0 −1.0
% wage growth, t/t+1
−0.5
0.0
0.5
% wage growth, t/t+2
1.0
0.5
0.0
−0.5
−1.0 −1.0
−0.5
0.0
0.5
1.0
% wage growth, t/t+3
Figure 7 Conditional expectations of shocks given wage growth residuals Note: See the note to Figure 6. (a): conditional expectation of εnt (thick) and ηnt − ηn,t−1 (thin) given wnt ; (b): εnt + εn,t−1 (thick) and ηnt − ηn,t−2 (thin) given 2 wnt ; (c): εnt + εn,t−1 + εn,t−2 (thick) and ηnt − ηn,t−3 (thin) given 3 wnt .
for less that ±30%. In panel (c), we see that a change 3 wnt of ±100% is almost twice more likely to be permanent than transitory.
7.6. Job changes Finally, we address the issue of the link between the degree of permanence of wage shocks and job-to-job mobility. It is notoriously difficult to identify job changes precisely in the PSID (see Brown and Light, 1992), so we tend to think of this exercise as tentative. We adopt the simplest criteria to identify job changes, setting the job change dummy equal to 1 if tenure is less than 12 months.30 We then classify individuals into job stayers (no job change during the period), infrequent job changers (one or two job changes), and frequent job changers (more than three job changes). The last three columns of Table 4 in the Appendix give descriptive statistics for these three groups of individuals. Then we compute the densities of permanent and transitory shocks given wage growth residuals, separately for each category of job changers by averaging within each group the conditional densities that we have already calculated. Table 7 presents the variances of permanent and transitory shocks for each mobility group. Focusing on the first three rows, we see that wage volatility, as measured by the variance, is higher for frequent job changers. Moreover, these individuals are more likely to experience both permanent and transitory wage changes. The transitory variance is about 15% higher for infrequent job movers than for job stayers (0.023 versus 0.020), and about 2.3 times higher for frequent job movers (0.046). At the same time, the permanent variance is about 15% higher for infrequent job movers than for job stayers (0.016 versus 0.014), and about 60% higher for frequent job movers (0.022). As permanent shocks accumulate over time while transitory shocks do not, the difference in wage growth volatility increases with the length of time over which wage growth is computed. For example, the variance of wage growth over 10 years is 0.16 (= 0.020 + 10∗0.014) for an individual who stayed with the same employer over the whole period, while it is about 0.27 (= 0.046 + 10∗0.022) for an individual who has changed job three times or more. 30. Note that there were two “tenure” variables before 1987 in the PSID: time in position and time with employer. We take the former as our definition of tenure. © 2009 The Review of Economic Studies Limited
BONHOMME & ROBIN
GENERALIZED NON-PARAMETRIC
523
TABLE 7 Variances of the shocks by categories of job changers Job changes
None
Total Permanent Transitory
0.034 0.014 0.020
Total Permanent Transitory
0.041 0.025 0.016
Total Permanent Transitory
0.054 0.037 0.017
One/two
Three/more
Wage growth, t/t + 1 0.039 0.016 0.023 wage growth, t/t + 2 0.053 0.032 0.021 wage growth, t/t + 3 0.063 0.044 0.019
0.068 0.022 0.046 0.089 0.053 0.036 0.108 0.076 0.032
Note: See the note to Figure 6. “None” = no job change in the observation period; “One/two” = one or two job changes; “Three/more” = three or more job changes. Variances of wage growth residuals (“Total”) and the variances of the permanent and transitory parts, conditional on having experienced a given number of job changes.
These results give some basis to the interpretation of permanent shocks to log earnings as resulting for a large part from job changes.31 Nevertheless, identifying permanent shocks with job changes is likely to be wrong for two reasons. First, part of the shocks faced by job stayers is permanent. Indeed, the share of permanent variance in total variance is higher for job stayers (40%) than for frequent job changers (30%). This finding suggests that there might be other permanent wage movements, caused for example by within-job promotions. Second, job changers also face more transitory shocks. Describing precisely these effects requires modelling job change decisions together with wage profiles.
8. CONCLUSION This paper provides a generalization of the non-parametric estimator of Li and Vuong (1998) to linear independent factor models, allowing for any number of measurements, L, and at most K = L(L+1) latent factors. On the theoretical side, the main lessons of the standard deconvolu2 tion literature carry over to the more general context that we consider in this paper. In particular, asymptotic convergence rates are slow, and it is more difficult to estimate the distribution of one factor if the characteristic functions of the other factors have thinner tails. Our Monte Carlo results yield interesting insights. The finite-sample performance of our estimator seems rather good, remarkably similar to the performance of the kernel deconvolution estimator, which assumes that the distributions of all factors but one are known and at least as good as alternative estimators proposed by Horowitz and Markatou (1996) and Li and Vuong (1998) in the measurement error model. Moreover, the performance critically depends 31. Note that we do not identify the part of the wage growth variance that comes from differences in hours worked from the one coming from differences in wage rates. Nor are we able to tell whether job or individual-specific components are mostly responsible for the results. © 2009 The Review of Economic Studies Limited
524
REVIEW OF ECONOMIC STUDIES
on the shape of the distributions to be estimated, as we find that it is easier to estimate distributions with little skewness or excess kurtosis.32 In any case, identifying the distributions of more factors than measurements should be viewed as considerably more difficult than the classical non-parametric deconvolution problem. Given the difficulty of the problem at hand, we view the results of our simulations and the application as a confirmation that the generalized non-parametric deconvolution approach that we propose can be successfully applied to a wide range of distributions. The empirical application shows that the permanent and transitory components of individual earnings dynamics are clearly non-normal. Predicting transitory and permanent shocks for the individuals in the sample, we see that frequent job changers face more permanent and transitory earnings shocks than job stayers. These results have important consequences for welfare analysis. For example, savings and insurance could be very different if the risk of large deviations is much higher than is usually assumed with normal shocks. Of course, the model of earnings dynamics that we have considered is very limited. One might want to add non-i.i.d. transitory shocks and yet allow for measurement error (as in Abowd and Card, 1989). We experimented with an MA(1) transitory shock without much success. It seems very difficult to non-parametrically identify the MA(1) component from the PSID data. Thus, maybe the sample is not appropriate, or a single non-normal MA(0) transitory shock/measurement error is enough to describe the PSID data. Another interesting issue is the assumption of independence between factors that we maintain throughout this analysis. Meghir and Pistaferri (2004) shows evidence of autoregressive conditional heteroskedasticity in permanent and transitory components. It is not straightforward at all to extend the study of the non-parametric identification and estimation of factor densities in conditionally heteroskedastic factor models like: ynt = Aε nt ,
ε knt = σ (εnt−1 )ηknt ,
k = 1, . . . , K,
(75)
T where ηnt = (η1nt , . . . , ηK nt ) is a K × 1 vector of i.i.d. random variables. But this is a very interesting problem for future research.
APPENDIX A. PROOF OF LEMMA 1 (i) First, we remark that
EN ft − Eft = EN Re (ft ) − ERe (ft ) + i EN Im (ft ) − EIm (ft )
(A1)
sup |EN ft − Eft | ≤ sup |EN Re (ft ) − ERe (ft )| + sup |EN Im (ft ) − EIm (ft )| .
(A2)
and, for any T > 0,
|t|≤T
|t|≤T
|t|≤T
It will thus suffice to show that the proposition is true for the family of functions Re (ft )(x, y) = x cos(tT y), t ∈ RL , for it to be true for functions Im (ft ) and ft . So, without loss of generality, we prove the result for real functions ft (x, y) = x cos(tT y), using the same notation for ft and its real part. 32. In Bonhomme and Robin (2009), we show that skewness and peakedness are required for the matrix of factor loadings to be identified from higher-order moments. There is thus a tension between obtaining a precise estimate of factor loadings and a precise estimate of the distribution of factors in models where second-order information is not sufficient to ensure the identification of the factor loadings. © 2009 The Review of Economic Studies Limited
BONHOMME & ROBIN
GENERALIZED NON-PARAMETRIC
525
(ii) For any T , let G = {ft (x, y), |t| ≤ T }. The first step of the proof is to find the L1 -covering number of G.33 For any couple (t1 , t2 ), x cos(tT1 y) − x cos(tT2 y) ≤ x(tT1 y − tT2 y)
|xy (t1 − t2 )| ≤
≤
(A3)
|xy | · |t1 − t2 |
≤ L |xy| · |t1 − t2 | . Discretize (−T , T )L into Let gj (x, y) = x
2T LEN |XY|
cos(tTj y).
ε
L − 1 points tj by cutting [−T , T ] into equidistant segments of length
Then, for all t ∈ [−T , T ] , there exists j such that L
EN X cos(tT Y) − X cos(tTj Y) ≤ LEN |XY| · t − tj
(A4)
≤ ε. It follows that the L1 -covering number of G satisfies
N1 (ε, PN , G) ≤ C
T EN |XY| ε
ε LEN |XY| .
(A5) L ,
where PN is the probability measure obtained by independent sampling from F , and C = (2L)L . Note that although the covering number is indeed inversely proportional to a power of ε, Theorem 2.37 of Pollard (1984, p. 34) cannot be applied for three reasons. First, the upper bound to the covering number depends on the sample PN via EN |XY|. Second, the functions in G are not bounded because the support of X is unbounded. Third, eventually, one will index T on N to make it go to infinity with N. A specific proof therefore has to be tailored to adjust Pollard’s proof of Theorem 2.37 to our set-up. (iii) Equations (30) and (31) in Pollard (1984, p. 31) imply that, for a given sample ZN , % & ' Nε2 T EN |XY| L (A6) EN X 2 , exp − Pr sup |EN ft − Eft | > ε ZN ≤ 8C ε 128 |t|≤T 2
as EN gj2 ≤ EN X 2 for all j , and provided that Var EN ft ≤ ε8 . Now,
NVar EN ft = Var X cos tT Y 2 2 − EX cos tT Y = E X 2 cos tT Y ≤ EX 2 ≡ M1 < ∞. So inequality (A6) is true for N ≥ Then, for all k > 0: % Pr
8M1 . ε2
sup |EN ft − Eft | > ε
|t|≤T
(A7)
&
&
% = Pr
sup |EN ft − Eft | > ε EN X < k, EN |XY| < k 2
|t|≤T
(
× Pr EN X 2 < k, EN |XY| < k %
)
sup |EN ft − Eft | > ε EN X ≥ k or EN |XY| ≥ k
+ Pr (
|t|≤T
× Pr EN X 2 ≥ k or EN |XY| ≥ k % ≤ Pr
& 2
) &
sup |EN ft − Eft | > ε EN X 2 < k, EN |XY| < k
|t|≤T
) ( + Pr EN X 2 ≥ k or EN |XY| ≥ k ,
(A8)
where the last inequality results from bounding two of the four probabilities above by 1. 33. Let Q be a probability measure on S and F be a class of functions in L1 (Q). For ε > 0, the covering number N*1 (ε, Q,*F) is the smallest integer m such that there exists functions g1 , . . . , gm in L1 (Q) such that minj EQ *f − gj * ≤ ε for all f ∈ F (Pollard, 1984, p. 25). © 2009 The Review of Economic Studies Limited
526
REVIEW OF ECONOMIC STUDIES To obtain a final inequality, use a general Chernoff bound: ) ( ) ( Pr EN X 2 ≥ k or EN |XY| ≥ k = Pr exp EN X 2 ≥ ek or exp (EN |XY|) ≥ ek
E exp EN X 2 + E exp (EN |XY|) . ≤ ek
(A9)
Now, # $ N
1 2 E exp EN X 2 = E exp Xn N =
N +
n=1
E exp
n=1
Xn2 N
(A10)
N 2 = EeX /N N 1 = MX2 , N
denoting as MX2 the moment generating function of X 2 . By assumption, MX2 (t) exists for all t in a neighbourhood around 0. Hence all moments of X 2 are finite, and MX2 (1/N) = 1 + EX 2 /N + O(1/N 2 ),
(A11)
lim MX2 (1/N)N = exp EX 2 .
(A12)
so that N →∞
One can thus bound E exp EN X 2 by 2 exp EX 2 for N large enough. The same argument applies to E[exp (EN |XY|)]. Therefore, % & exp EX 2 + exp (E|XY|) Tk L Nε2 exp − , (A13) Pr sup |EN ft − Eft | > ε ≤ 8C +2 ε 128k ek |t|≤T 8M
for any N large enough that satisfies N ≥ 21 . ε (iv) Lastly, index ε, T , and k by N, and suppose that εN tends to 0 and that TN and kN tend to infinity in such a way that
1 < ∞, e kN N
(A14)
and
%
exp L ln
N
TN k N εN
−
Nε2N 128kN
& < ∞.
(A15)
A standard application of the Borel–Cantelli lemma then implies that sup |EN ft − Eft | < εN ,
|t|≤TN
a.s.
(A16)
Details about the Borel–Cantelli argument for almost sure convergence can be found in Pollard (2002, pp. 34–35). N ε2 T k Condition (A15) is satisfied if exp L ln Nε N − 128kN decreases faster than 1/N. Let N
N
ln N ε N = C √ , C > 0, N kN = (1 + α) ln N, © 2009 The Review of Economic Studies Limited
(A17)
BONHOMME & ROBIN
GENERALIZED NON-PARAMETRIC
527
with α > 0 to satisfy condition (A14). Hence Nε2N C2 = ln N. 128kN 128(1 + α)
(A18)
Let also δ
TN = BN 2 , B, δ > 0.
(A19)
Then, L ln
TN k N εN
−
Nε2N L C2 B(1 + α) + (1 + δ) − ln N = L ln 128kN A 2 128(1 + α)
(A20)
decreases faster than − ln N if C 2 > 64 [L (1 + δ) + 2] (1 + α).
(A21)
√ Whatever δ > 0, one can thus choose any C > 8 2 + L(1 + δ). This proves Lemma 1. Remark. Compared to the proof of the law of the iterated logarithm (also referred to as the “log log law”), it is T k the additional term L ln Nε N that makes all the difference. This term arises from the necessity to cover the set of N functions G. If X and Y are bounded, then one can proceed differently, and adapt the proof of Theorem 1 in Cs¨org˝o (1981) that uses the law of the iterated logarithm.
APPENDIX B. PROOF OF THEOREM 1 In this proof and the next, all convergence statements are implicitly understood to hold almost surely.34 Here, we aim at bounding sup|τ |≤TN ϕ Xk (τ ) − ϕ Xk (τ ) , where ϕ Xk (τ ) = exp
τ
0
0
u
Q− k vech
∇∇ T κY
vθ θ T Ak
dW (θ ) dvdu
(B1)
for some distribution W . This will easily follow from bounding, for any , m = 1, . . . , L, Cm (θ ) = sup
|τ |≤TN
0
τ
0
u
∂ 2 κY ∂t ∂tm
vθ θ Ak
dvdu −
T
τ
0
0
u
∂ 2κ Y ∂t ∂tm
vθ θ Ak T
dvdu .
(B2)
T
T T (i) Fix t ∈ RL . Denote ϕ(t) ≡ ϕ Y (t) = E eit Y , ψ (t) = E Y eit Y and ξ m (t) = E Y Ym eit Y , for any , m = 1, . . . , L. Then, Lemma 1 implies that, for all functions f in {ϕ, {ψ } , {ξ m },m }: sup f(t) − f (t) < CεN ,
|t|≤TN
(B3)
with δ
TN = BN 2 , B, δ > 0, ln N εN = √ , N
(B4)
√ for some C > 8 2 + L(1 + δ). (ii) There exists c such that |ϕ(t)| ≥ g(|t|) when |t| > c. As g is decreasing, then for all c < |t| ≤ TN , |ϕ(t)| ≥ g(|t|) ≥ g(TN ).
(B5)
34. So, ZN = O(aN ) means that there exists C > 0 such that ZN ≤ CaN almost surely when N tends to infinity. © 2009 The Review of Economic Studies Limited
528
REVIEW OF ECONOMIC STUDIES
Hence,
, inf |ϕ(t)| ≥ min g(TN ), inf |ϕ(t)| .
|t|≤TN
|t|≤c
(B6)
Notice that, because of Assumption A2 and the continuity of ϕ :35 inf |ϕ(t)| > 0.
(B7)
|t|≤c
) ( So, as lim|t|→∞ g(|t|) = 0, it follows that min g(TN ), inf|t|≤c |ϕ(t)| = g(TN ) for TN large enough. Consequently, for N large enough, sup
|t|≤TN
CεN ϕ (t) − ϕ(t) < = o (1) . ϕ(t) g(TN )
The last equality follows from the fact that T 2 εN N g(TN )3
T 2 εN N g(TN )3
≥
εN g(TN )
(B8)
for N large enough, and that, by assumption,
= o(1).
(iii) We have
T E Y eit Y ∂κ Y (t) ψ (t) =i =i T , ∂t ϕ(t) E eit Y
(B9)
and (t) (t) (t) (t) ψ ψ (t) ψ ψ ψ ψ (t) − = − + − ϕ (t) ϕ(t) ϕ (t) ϕ(t) ϕ(t) ϕ(t) =−
(t) ψ ϕ(t)
ϕ (t)−ϕ(t) ϕ(t) ϕ (t)−ϕ(t) + ϕ(t)
1
+
1 ψ (t) − ψ (t) . ϕ(t)
(B10)
(t) as follows: One can bound ψ (t) ≤ sup ψ (t) − ψ (t) + sup ψ (t) sup ψ
|t|≤TN
|t|≤TN
|t|≤TN
(t) − ψ (t) + E |Y | = O(1), ≤ sup ψ
(B11)
(t) ψ (t) ψ O(εN ) − = o (1) . = ϕ (t) ϕ(t) g(TN )2
(B12)
|t|≤TN
as E |Y | < ∞. It follows that sup
|t|≤TN
The same argument applies to show that sup
|t|≤TN
ξ m (t) ξ (t) O(ε N ) = o (1) − m = ϕ (t) ϕ(t) g(TN )2
(B13)
for all , m, if E |Y Ym | < ∞. (iv) It is easy to extend these results to second derivatives of c.g.f.’s: ∂ 2κ Y (t) ∂t ∂tm T T T E Y eit Y E Ym eit Y E Y Ym eit Y + =− T T T E eit Y E eit Y E eit Y
ζ m (t) ≡
=−
ξ m (t) ψ (t) ψ m (t) + . ϕ(t) ϕ(t) ϕ(t)
35. We remark that |ϕ Y (t)| = |ϕ X (AT t)| > 0 for all t by Assumption A2. © 2009 The Review of Economic Studies Limited
(B14)
BONHOMME & ROBIN Let ζ m (t) = −
ξ m (t) ϕ (t)
+
(t) ψ m (t) ψ ϕ (t) ϕ (t) .
GENERALIZED NON-PARAMETRIC
529
Then,
ξ (t) ξ (t) ζ m (t) − ζ m (t) = − m − m ϕ (t) ϕ(t) ψ (t) ψ (t) ψ m (t) ψ m (t) ψ (t) ψ (t) + − + − m ϕ (t) ϕ(t) ϕ(t) ϕ (t) ϕ(t) ϕ(t) ψ (t) ψ m (t) ψ (t) ψ m (t) + − − . ϕ (t) ϕ(t) ϕ (t) ϕ(t)
(B15)
Since E |Y | ψ (t) ≤ ϕ(t) g(TN )
sup
|t|≤TN
(B16)
for all , it follows that O(εN ) O(εN ) ζ m (t) − ζ m (t) = sup + + g(TN )2 g(TN )3 |t|≤TN
O(εN ) g(TN )2
2 =
O(εN ) g(TN )3
(B17)
for N large enough such that g(TN ) < 1. (v) Fix a direction of integration θ ∈ RL \{0}, and τ ∈ R. Then: Cm (θ ) =
τ
u
sup
ζ m
vθ
dvdu −
τ
u
ζ m
vθ
θ T Ak θ T Ak 0 0 $ # τ2 vθ vθ ζ m sup − ζ m ≤ sup T T 2 |v|≤TN θ Ak θ Ak τ ∈[−TN ,TN ] 2 ≤ TN sup ζ m (t) − ζ m (t) τ ∈[−TN ,TN ]
0
dvdu
0
|t|≤TN Tθ θ Ak
=
g TN
TN2
3 O(εN ).
(B18)
θ θ T Ak
Hence, for any distribution W :36
Cm (θ ) dW (θ) =
g TN
θ θ T Ak
−3
dW (θ) TN2 O(ε N ).
(B19)
(vi) It easily follows from the previous step that: sup τ ∈[−TN ,TN ]
κ Xk (τ ) − κ Xk (τ ) =
g TN
θ θ T Ak
−3
dW (θ ) TN2 O(εN ) = o(1).
(B20)
κ Xk (τ ) − κ Xk (τ ) < 1 for N large enough. Therefore, for N large enough In particular, supτ ∈[−TN ,TN ] sup τ ∈[−TN ,TN ]
ϕ Xk (τ ) − ϕ Xk (τ ) = ≤
sup τ ∈[−TN ,TN ]
sup τ ∈[−TN ,TN ]
exp κ Xk (τ ) − exp κ Xk (τ ) , κ Xk (τ ) − κ Xk (τ ) ,
(B21)
36. Technically, we need some support conditions on W that ensure that the statement O(εN ) above is uniform in θ. © 2009 The Review of Economic Studies Limited
530
REVIEW OF ECONOMIC STUDIES
from which it follows that sup τ ∈[−TN ,TN ]
g TN
ϕ Xk (τ ) − ϕ Xk (τ ) =
θ θ T Ak
−3
dW (θ ) TN2 O(εN ).
(B22)
This ends the proof of Theorem 1.
APPENDIX C. PROOF OF THEOREM 2 For all x in the support of Xk :
v ϕ Xk (v) − ϕ Xk (v) dv e−ivx ϕH TN v 1 − 1 e−ivx ϕ Xk (v)dv, + ϕH 2π TN
1 fXk (x) − fXk (x) = 2π
where ϕ H is the c.f. of a smoothing kernel which is equal to zero outside [−1, 1]. So, for N large enough: v v 1 TN ϕ Xk (v) − ϕ Xk (v) dv + ϕH ϕH − 1 hk (|v|)dv fXk (x) − fXk (x) ≤ 2π −TN TN TN v TN 1 − 1 hk (|v|)dv, sup ≤ ϕ Xk (τ ) − ϕ Xk (τ ) + ϕH π |τ |≤TN 2π TN
(C1)
(C2)
where we have used that |ϕ H | ≤ 1 (as ϕ H is a c.f.), and that |ϕ Xk (v)| ≤ hk (|v|) for all |v|. Note that K T T + |ϕ Y (t)| = E eit Y = E eit AX = ϕ Xk tT Ak ≥ ! g (L |A| |t|), g ( tT A ) ≥ !
where |A| = maxi,j
(C3)
k=1
aij . In the last inequality we have used that ! g is decreasing, and that ⎛ ⎞ L
tT A = max ⎝ aij tj ⎠ ≤ L |A| |t| . i
(C4)
j =1
Define g(t) = ! g (L |A| t). Function g inherits ! g ’s properties: it maps R+ onto [0, 1], it is decreasing, and it is integrable, so that in particular g (|t|) → 0 when |t| → ∞. We can thus apply Theorem 1 and obtain: ϕ Xk (τ ) − ϕ Xk (τ ) = sup
|τ |≤TN
TN2 O(εN ) g(TN )3
(C5)
with εN and TN as in Lemma 1. If H is a higherorder kernel of order q ≥ 2, then there exists a function m such that ϕ H (v) = 1 + m(v)v q for all v ∈ [−1, 1], and ϕ H (v) = 0 for v ∈ / [−1, 1], where m is continuous on [−1, 1]. So the last term on the right-hand side of (C2) is: T +∞ N v v q v − 1 hk (|v|)dv = hk (|v|)dv + 2 hk (|v|)dv ϕH m TN TN TN TN −TN $ # +∞ T N 1 v q hk (|v|)dv + 2 hk (|v|)dv, (C6) ≤ sup |m(v)| · q TN −TN v∈[−1,1] TN where supv∈[−1,1] |m(v)| < ∞ since m is continuous on [−1, 1]. This ends the proof of Theorem 2.
APPENDIX D. “PLUG-IN” BANDWIDTH SELECTION We here present the “plug-in” method of Delaigle and Gijbels (2004) to choose the bandwidth in deconvolution kernel density estimation. We focus on second-order kernels in the presentation. © 2009 The Review of Economic Studies Limited
BONHOMME & ROBIN
GENERALIZED NON-PARAMETRIC
531
Known error distribution. To present the method, let us consider the deconvolution problem with known error distribution Y = X + U , where fU , or equivalently ϕ U , is known. Based on a random sample Y1 , . . . , YN , the deconvolution kernel density estimator of fX is given by: v ϕ (v) 1 e−ivx Y ϕH dv, (D1) fX (x) = 2π TN ϕ U (v) where ϕ Y (v) = EN eivY is the empirical characteristic function of Y . Let the MISE of fX be: 2 MISE (TN ) = E fX (x) − fX (x) dx .
(D2)
The choice of TN relies on the following approximation of the MISE: 2 μ2H,2 R fX v 1 −2 ϕ U (v) dv + ϕH . MISE (TN ) ≈ 2π N TN 4TN4 where
(D3)
μH,2 = R fX =
v 2 H (v)dv,
2 fX (x) dx.
(D4)
For example, μH2 ,2 = 6. The plug-in method estimates R fX by the following algorithm. 1. Estimate R fX as if X was normally distributed: fX = R
8! . 9 √ (X) 2 π Var
(D5)
29 4!
2. Minimize the following quantity with respect to T : v f μH,2 R 1 X − + v6 ϕH T2 2π N T
2
ϕ U (v)
−2
dv.
f . This step yields T. This quantity can be interpreted as the squared asymptotic bias of R X 3. Compute: 2 ϕ Y (v) 2 v fX = 1 R dv. v6 ϕ H 2π ϕ U (v) T
(D6)
(D7)
f . 4. Iterate one more time steps 2 and 3. This yields R X Finally, once R fX has been estimated, TN is obtained as the minimizer of the approximated MISE given by the right-hand side of (D3). of
Unknown error distribution. In practice, we replace ϕ U (v) in the above expressions by an estimate of the c.f. θ T Am Xm , as explained in 5.3. Because of (67), a consistent estimate of that c.f. is given by m=k T θ Ak
ϕ U (v) =
ϕY
vθ θ T Ak
ϕ Xk (v)
,
(D8)
ϕ Xk is the estimate of the c.f. of Xk given by (45). where ϕ Y is the empirical c.f. of Y, and Acknowledgements. We thank the editor and two anonymous referees for useful comments that helped improve the paper. We thank Laura Hospido for providing the data. We also thank Manuel Arellano, Jean-Pierre Florens and Eric Renault for comments and suggestions. The usual disclaimer applies. St´ephane Bonhomme gratefully acknowledges the financial support from the Spanish Ministry of Science and Innovation through the Consolider-Ingenio 2010 Project “Consolidating Economics”, and from the Spanish MEC, Grant SEJ2005-08880. Jean-Marc Robin gratefully acknowledges the financial support from the Economic and Social Research Council for the ESRC Centre for Microdata Methods and Practice, “Cemmap” (grant reference RES-589-28-0001). © 2009 The Review of Economic Studies Limited
532
REVIEW OF ECONOMIC STUDIES
REFERENCES ABOWD, J. and CARD, D. (1989), “On the Covariance Structure of Earnings and Hours Changes”, Econometrica, 57, 411–445. BONHOMME, S. and ROBIN, J.-M. (2009), “Consistent Noisy Independent Component Analysis”, Journal of Econometrics, 149 (1), 12–25. BROWN, J. and LIGHT, A. (1992), “Interpreting Panel Data on Job Tenure”, Journal of Labor Economics, 10, 219–257. CARRASCO, M. and FLORENS, J. P. (2007), “Spectral Method for Deconvolving a Density” (Mimeo). CARROLL, R. J. and HALL, P. (1988), “Optimal Rates of Convergence for Deconvoluting a Density”, Journal of the American Statistical Association, 83, 1184–1186. CARROLL, R. J., RUPPERT, D. and STEFANSKI, L. A. (1995), Measurement Error in Nonlinear Models (New York: Chapman & Hall). CHAMBERLAIN, G. and HIRANO, K. (1999), “Predictive Distributions Based on Longitudinal Earnings Data”, ´ Annales d’Economie et de Statistiques, 55–56, 211–242. ¨ ˝ S. (1981), “Limit Behaviour of the Empirical Characteristic Function”, Annals of Probability, 9 (1), 130– CSORG O, 144. DELAIGLE, A. and GIJBELS, I. (2002), “Estimation of Integrated Squared Density Derivatives from a Contaminated Sample”, Journal of the Royal Statistical Society, Series B, 64, 869–886. DELAIGLE, A. and GIJBELS, I. (2004), “Comparison of Data-Driven Bandwidth Selection Procedures in Deconvolution Kernel Density Estimation”, Computational Statistics & Data Analysis, 45, 249–267. DELAIGLE, A. and GIJBELS, I. (2007), “Frequent Problems in Calculating Integrals and Optimizing Objective Functions: A Case Study in Density Deconvolution”, Statistics and Computing, 17, 349–355. DELAIGLE, A., HALL, P. and MEISTER, A. (2008), “On Deconvolution with Repeated Measurements”, Annals of Statistics, 36 (2), 665–685. DEMPSTER, A. P., LAIRD, N. M. and RUBIN, D. B. (1977), “Maximum Likelihood from Incomplete Data via the EM Algorithm”, Journal of the Royal Statistical Society, Series B, 39 (1), 1–38. DIGGLE, P. J. and HALL, P. (1993), “A Fourier Approach to Nonparametric Deconvolution of a Density Estimate”, Journal of the Royal Statistical Society, Series B, 55, 523–531. FAN, J. Q. (1991), “On the Optimal Rates of Convergence for Nonparametric Deconvolution Problems”, Annals of Statistics, 19, 1257–1272. GEWEKE, J. and KEANE, M. (2000), “An Empirical Analysis of Earnings Dynamics Among Men in the PSID: 1968–1989”, Journal of Econometrics, 96, 293–356. GEWEKE, J. and KEANE, M. (2007), “Smoothly Mixing Regressions”, Journal of Econometrics, 138, 252–290. GUVENEN, F. (2007a), “Learning Your Earning: Are Labor Income Shocks Really Very Persistent?”, American Economic Review, 97, 687–712. GUVENEN, F. (2007b), “An Empirical Investigation of Labor Income Processes” (Mimeo). HALL, R. and MISHKIN, F. (1982), “The Sensitivity of Consumption to Transitory Income: Estimates from Panel Data on Households”, Econometrica, 50, 461–481. HALL, P. and YAO, Q. (2003), “Inference in Components of Variance Models with Low Replications”, Annals of Statistics, 31, 414–441. HIRANO, K. (2002), “Semiparametric Bayesian Inference in Autoregressive Panel Data Models”, Econometrica, 70, 780–799. HOROWITZ, J. L. (1998), Semiparametric Methods in Econometrics (New York: Springer-Verlag). HOROWITZ, J. L. and MARKATOU, M. (1996), “Semiparametric Estimation of Regression Models for Panel Data”, Review of Economic Studies, 63, 145–168. HU, Y. and RIDDER, G. (2007), “Estimation of Nonlinear Models with Mismeasured Regressors Using Marginal Information” (Mimeo, available at: /http://www.econ.jhu.edu/People/Hu/EIV-marg-2007.pdf). HU, Y. and RIDDER, G. (2008), “On Deconvolution as a First Stage Nonparametric Estimator” (Mimeo). KAPLAN, G. and VIOLANTE, G. (2008), “How Much Insurance in Bewley Models?” (Mimeo). KHATRI, C. G. and RAO, C. R. (1968), “Solutions to Some Functional Equations and their Applications to Characterization of Probability Distributions”, Sankhy¨a, 30, 167–180. KOTLARSKI, I. (1967), “On Characterizing the Gamma and Normal Distribution”, Pacific Journal of Mathematics, 20, 69–76. LI, T. (2002), “Robust and Consistent Estimation of Nonlinear Errors-in-Variables Models”, Journal of Econometrics, 110, 1–26. LI, T., PERRIGNE, I. and VUONG, Q. (2000), “Conditionally Independent Private Information in OSC Wildcat Auctions”, Journal of Econometrics, 98, 129–161. LI, T. and VUONG, Q. (1998), “Nonparametric Estimation of the Measurement Error Model Using Multiple Indicators”, Journal of Multivariate Analysis, 65, 139–165. LILLARD, L. and WILLIS, R. (1978), “Dynamic Aspects of Earnings Mobility”, Econometrica, 46, 985–1012. LUKACS, E. (1970), Characteristic Functions, 2nd edn (London: Griffin). MEGHIR, C. and PISTAFERRI, L. (2004), “Income Variance Dynamics and Heterogeneity”, Econometrica, 72, 1–32. POLLARD, D. (1984), Convergence of Stochastic Processes (New York: Springer). POLLARD, D. (2002), A User’s Guide to Measure Theoretical Probability (Cambridge: Cambridge University Press). © 2009 The Review of Economic Studies Limited
BONHOMME & ROBIN
GENERALIZED NON-PARAMETRIC
533
RAO, P. (1983), Nonparametric Functional Estimation (New York: Academic Press). RAO, P. (1992), Identifiability in Stochastic Models (New York: Academic Press). REIERSOL, O. (1950), “Identifiability of a Linear Relation Between Variables Which are Subject to Error”, Econometrica, 18 (4), 375–389. SCHENNACH, S. (2004a), “Estimation of Nonlinear Models with Measurement Error”, Econometrica, 72, 33–75. SCHENNACH, S. (2004b), “Nonparametric Estimation in the Presence of Measurement Error”, Econometric Theory, 20, 1046–94. SUSKO, E. and NADON, R. (2002), “Estimation of a Residual Distribution with Small Numbers of Repeated Measurements”, Canadian Journal of Statistics, 30, 383–400. ´ SZEKELY, G. J. and RAO, C. R. (2000), “Identifiability of Distributions of Independent Random Variables by Linear Combinations and Moments”, Sankhy¨a, 62, 193–202. ZHANG, C. (1990), “Fourier Methods for Estimating Mixing Densities and Distributions”, Annals of Statistics, 18, 806–831.
© 2009 The Review of Economic Studies Limited