Inference in Semiparametric Dynamic Models for Binary Longitudinal Data Siddhartha C HIB and Ivan J ELIAZKOV This article deals with the analysis of a hierarchical semiparametric model for dynamic binary longitudinal responses. The main complicating components of the model are an unknown covariate function and serial correlation in the errors. Existing estimation methods for models with these features are of O(N 3 ), where N is the total number of observations in the sample. Therefore, nonparametric estimation is largely infeasible when the sample size is large, as in typical in the longitudinal setting. Here we propose a new O(N) Markov chain Monte Carlo based algorithm for estimation of the nonparametric function when the errors are correlated, thus contributing to the growing literature on semiparametric and nonparametric mixed-effects models for binary data. In addition, we address the problem of model choice to enable the formal comparison of our semiparametric model with competing parametric and semiparametric specifications. The performance of the methods is illustrated with detailed studies involving simulated and real data. KEY WORDS: Average covariate effect; Bayes factor; Bayesian model comparison; Clustered data; Correlated binary data; Labor force participation; Longitudinal data; Marginal likelihood; Markov chain Monte Carlo; Markov process priors; Nonparametric estimation; Partially linear model.
1. INTRODUCTION This article discusses techniques for analyzing semiparametric models for dynamic binary longitudinal data. A hierarchical Bayesian approach is adopted to combine the main regression components: nonparametric functional form, dynamic dependence through lags of the response variable and serial correlation in the errors, and multidimensional heterogeneity. In this context, we pursue several objectives. First, we propose new computationally efficient estimation techniques to carry out the analysis. Computational efficiency is key, because longitudinal (panel) data pose special computational challenges compared with cross-sectional or time series data, so that “brute force” estimation becomes infeasible in many settings. Second, we address the problem of model choice by computing marginal likelihoods and Bayes factors to determine the posterior probabilities of competing models. This allows for the formal comparison of semiparametric versus parametric models, and addresses the problems of variable selection and lag determination. Third, we propose a simulation-based approach to calculate the average covariate effects, which provides interpretability of the estimates despite the nonlinearity and intertemporal dependence in the model. We examine the methods in a simulation study and then apply them in the analysis of women’s labor force participation. To illustrate the model, let yit be the binary response of interest, where the indices i and t (i = 1, . . . , n, t = 1, . . . , Ti ) refer to units (e.g., individuals, firms, countries) and time. We consider a dynamic partially linear binary choice model where yit depends parametrically on the covariate vectors x˜ it and wit (containing two disjoint sets of covariates) and nonparametrically on the covariate sit in the form yit = 1 x˜ it δ + wit β i + g(sit ) + φ1 yi,t−1 + · · · + φJ yi,t−J + εit > 0 , (1) where 1{·} is an indicator function, δ and β i are vectors of common (fixed) and unit-specific (random) effects, φ1 , . . . , φJ are Siddhartha Chib is the Harry C. Hartkopf Professor of Econometrics and Statistics, John M. Olin School of Business, Washington University, St. Louis, MO 63130 (E-mail:
[email protected]). Ivan Jeliazkov is Assistant Professor, Department of Economics, University of California, Irvine, CA 92697 (E-mail:
[email protected]). The authors thank the editor, associate editor, and two referees, along with Edward Greenberg, Dale Poirier, and Justin Tobias, for helpful comments.
lag coefficients, g(·) is an unknown function, and εit is a serially correlated error term. We assume that g(·) is a smooth but otherwise unrestricted function that is estimated nonparametrically. For this reason, x˜ it does not contain an intercept or the covariate sit , although those may be included in wit ; identification issues are discussed in Section 2.2. The model in (1) aims to exploit the panel structure to distinguish among three important sources of intertemporal dependence in the observations ( yi1 , . . . , yiTi ). One source is due to the lags yi,t−1 , . . . , yi,t−J , which capture the notion of “state dependence,” where the probability of response may depend on past occurrences because of altered preferences, trade-offs, or constraints. A second source of dependence is the presence of serial correlation in the errors (εi1 , . . . , εiTi ). Finally, the observations ( yi1 , . . . , yiTi ) can also be correlated because of heterogeneity; these differences are captured through the individual effects β i . Addressing the differences among units is also essential in guarding against the emergence of “spurious state dependence,” because temporal pseudodependence could occur due to the fact that history may simply serve as a proxy for these unobserved differences (Heckman 1981). The presence of the nonparametric function in the binary response model in (1) raises a number of challenges for estimation, because of the intractability of the likelihood function. Many of these problems have been largely overcome in the Bayesian context by Wood and Kohn (1998) and Shively, Kohn, and Wood (1999), based on the framework of Albert and Chib (1993). One open problem, however, is the analysis of semiparametric models with serially correlated errors. This issue has been studied by Diggle and Hutchinson (1989), Altman (1990), Smith, Wong, and Kohn (1998), Wang (1998), and Opsomer, Wang, and Yang (2001). These studies conclude that serial correlation, if ignored, poses fundamental problems that can have substantial adverse consequences for estimation of the nonparametric function. The ability to estimate models with serial correlation is limited in practice, however, because of the computational intensity of existing algorithms. In these cases the estimation algorithms are O(N 3 ), where N = ni=1 Ti
685
© 2006 American Statistical Association Journal of the American Statistical Association June 2006, Vol. 101, No. 474, Theory and Methods DOI 10.1198/016214505000000871
686
Journal of the American Statistical Association, June 2006
is the total number of observations in the sample. This feature renders nonparametric estimation infeasible in panel data settings where sample sizes are generally large. Here we propose a new O(N) algorithm for estimating the nonparametric function when the errors are correlated, thus contributing to the growing literature on static semiparametric and nonparametric mixedeffects models for binary data (see, e.g., Lin and Zhang 1999; Lin and Carroll 2001; Karcher and Wang 2001). As far as we know, in the literature there are no O(N) estimators for (binary or continuous) panel data with correlated errors. A second open problem is the question of model comparison in the semiparametric setting. Because the marginal likelihood, which is used to compare two models on the basis of their Bayes factor, is a large-dimensional integral, previous work (e.g., DiMatteo, Genovese, and Kass 2001; Wood, Kohn, Shively, and Jiang 2002; Hansen and Kooperberg 2002) has relied on measures such as the Akaike information criterion (AIC) and Bayes information criterion (BIC). However, computation of these criteria is infeasible in our context. We extend that literature by describing an approach, based on Chib (1995), for calculating marginal likelihoods and Bayes factors. We use this approach in Section 7 to compare several competing parametric and semiparametric models. The article is organized as follows. Section 2 completes the statistical model, and Section 3 presents the Markov chain Monte Carlo (MCMC) fitting method. Section 4 shows how the average effects of the covariates on the probability of response are calculated, and Section 5 is concerned with model comparison. Section 6 presents a detailed simulation study of the performance of the estimation method. Section 7 studies the intertemporal labor force participation of a panel of married women. Finally, Section 8 presents concluding remarks. 2. HIERARCHICAL MODELING AND PRIORS This section presents the hierarchical structure used in modeling the regression components in (1). For simplicity, and to keep the discussion focused on our main topics— semiparametric estimation with correlated errors and Bayesian model comparison—we present the model in detail only under the assumptions of this section. A number of modifications, extensions, and generalizations are possible, and we mention some of them as we proceed. 2.1 The Smoothness Prior on g (·) Suppose that the N observations in the covariate vector s, whose effect is modeled nonparametrically, determine the m × 1 design point vector v, m ≤ N, with entries equal to the unique ordered values of s with v1 < · · · < vm and with g = (g(v1 ), . . . , g(vm )) = (g1 , . . . , gm ) as the corresponding function evaluations. The idea is to model the function evaluations as a stochastic process that controls the degree of local variation between neighboring states in g to strike a balance between a good fit and a smooth regression function (Whittaker 1923). Following Fahrmeir and Lang (2001), we model the function evaluations as resulting from the realization of a second-order Markov process, with the specification aimed at penalizing rough functions g(·). A range of similar smoothness priors, with some discussion and comparisons, can be found in
the literature on nonparametric modeling (for some specific examples, see Wahba 1978; Shiller 1984; Silverman 1985; Besag, Green, Higdon, and Mengersen 1995; Koop and Poirier 2004). Defining ht = vt − vt−1 , the second-order random-walk specification is given by ht ht gt−1 − gt = 1 + gt−2 + ut , ut ∼ N (0, τ 2 ht ), ht−1 ht−1 (2) where τ 2 is a smoothness parameter. Small values of τ 2 produce smoother functions, whereas larger values allow the function to be more flexible and interpolate the data more closely. The weight ht adjusts the variance to account for possibly irregular spacing between consecutive points in the design point vector. Other possibilities are conceivable for the weights (see, e.g., Shiller 1984; Besag et al. 1995; Fahrmeir and Lang 2001); the one given here implies that the variance grows linearly with the distance ht , a property satisfied by random walks. This linearity is appealing because it implies that conditional on gt−1 and gt−2 , the variance of gt+k , k ≥ 0, will depend only on the distance vt+k − vt−1 , but not on the number of points k that lie in between. We now deviate from the prior of Fahrmeir and Lang (2001), and, in fact, from much of the literature on nonparametric functional modeling, by working with a version of the prior in (2) that is proper. The prior in (2) is improper because it incorporates information only about deviations from linearity, but says nothing about the linearity itself. To rectify this problem, we complete the specification of the smoothness prior by providing a distribution for the initial states of the random-walk process, g10 g1 2 τ ∼N , τ 2 G0 , (3) g2 g20 where G0 is a 2 × 2 symmetric positive definite matrix. The prior on the initial conditions (3) induces a prior on linear functions of v that is equivalent to the usual priors placed on the intercept and slope parameters in univariate linear regression. This can be seen more precisely by iterating (2) in expectation (to eliminate ut which is the source of the nonlinearity), starting with initial states as specified in (3). Thus, conditional on g1 and g2 , the mean of the Markov process in (2) is a straight line that goes through g1 and g2 . As a consequence, the intercept and slope of that line will have a distribution directly related to the distribution in (3) in a one-to-one mapping. This is useful in setting the prior parameters g10 , g20 , and G0 based on the same information that would be used in a corresponding linear model. The directed Markovian structure in (2) and (3) implies a joint density for the elements of g, which can be obtained by rewriting the Markov process in a random field form. After defining 1 1 h3 h3 −(1 + ) 1 h h 2 2 H= .. .. .. . . . hm hm −(1 + hm−1 ) 1 hm−1
Chib and Jeliazkov: Inference in Semiparametric Dynamic Models
and
=
687
G0
,
h3 ..
. hm
where the off-diagonal 0’s in H and have been suppressed in the notation, the prior on g, which is equivalent to the secondorder Markov process prior in (2) and (3), becomes g|τ 2 ∼ N (g0 , τ 2 K−1 ),
(4)
where g0 = H−1 g˜ , with g˜ = (g10 , g20 , 0, . . . , 0) , and the penalty matrix K is given by K = H −1 H. Equivalently, g0 can be derived by taking recursive expectations of (2) starting with the mean in (3), and as argued earlier, the points in g0 will form a straight line. Several points deserve emphasis. First, a key feature of the prior in (4), due to the information in (3), is that it is proper. In contrast, priors in the literature are generally specified with a reduced-rank penalty matrix K and thus are improper. Because improper priors preclude the possibility of formal finite-sample model comparison using marginal likelihoods and Bayes factors (O’Hagan 1994, chap. 7; Kass and Raftery 1995), our prior removes an important impediment to Bayesian model selection. Second, it is important to note that the m × m penalty matrix K is banded, which has considerable practical value because manipulations involving banded matrices take O(m) operations, rather than O(m3 ) for inversions or O(m2 ) for multiplication by a vector (and m may be as large as the total number of data points N). Third, Markov process priors are conceptually simple and adaptable to different orders, so as to meet problem-specific tasks (Besag et al. 1995; Fahrmeir and Lang 2001). For example, a first-order prior gt = gt−1 + ut penalizes abrupt jumps between successive states of the Markov process, whereas higher-order priors embody more subtle notions of “smoothness” related to the rates of change in the function. Because the prior on g is defined conditional on the hyperparameter τ 2 , in the next level of the hierarchy we specify the prior distribution τ 2 ∼ IG(ν0 /2, δ0 /2). In setting the parameters ν0 and δ0 , it is helpful to use the well-known mapping between the mean and variance of the inverse gamma distribution and the parameters ν0 and δ0 (e.g., Gelman, Carlin, Stern, and Rubin 1995, app. A). The choice of these parameters will affect the estimated g depending on the other sources of variance in the model. Some intuition can be gained by considering the sampling algorithm that we present in the next section, where in the sampling of g, the inverse of τ 2 weighs the components of the smoothness prior K and g0 and competes with the inverse of the error variance, which weighs the function g that maximizes the fit in the likelihood. 2.2 Priors on the Linear Effects and Dynamic Parameters Turning attention to the parametric facets of the model, and in anticipation of the subsequent estimation of the model by the approach of Albert and Chib (1993), we begin by rewriting the model in (1) in terms of the latent variables {zit } as zit = x˜ it δ + wit β i + g(sit ) + φ1 1{zi,t−1 > 0} + · · · + φJ 1{zi,t−J > 0} + εit ,
where yit = 1{zit > 0}. [In the presample (t = −J + 1, . . . , 0), the latent data {zit } are not modeled and for our purposes can simply be taken to equal the presample {yit }.] Suppose that εit is a serially correlated error term that follows a mean-0 stationary pth-order autoregressive [AR( p)] process parameterized in terms of ρ = (ρ1 , . . . , ρp ) , that is, εit = ρ1 εi,t−1 + · · · + ρp εi,t−p + vit ,
(5)
where vit are independent N (0, 1). We can express the process in (5) in terms of a polynomial in the backshift operator L as ρ(L)εit = vit , where ρ(L) = 1 − ρ1 L − · · · − ρp Lp and stationarity is maintained by requiring that all roots of ρ(L) lie outside the unit circle. Estimation with this and several other correlation structures is discussed in Section 3. Stacking the data for each cluster, let yi ≡ ( yi1 , . . . , yiTi ) denote the Ti observations in the ith cluster, and similarly define the lag vectors
yi,−j ≡ yi,1−j , . . . , yi,Ti −j
= 1{zi,1−j > 0}, . . . , 1 zi,Ti −j > 0 , j = 1, . . . , J. Then, for the observations in the ith cluster, we have that ˜ i δ + Wi β i + gi + Li φ + εi , zi = X
(6)
˜ i = (˜xi1 , . . . , x˜ iTi ) , Wi = (wi1 , . . . , where zi = (zi1 , . . . , ziTi ) , X wiTi ) , gi = (g(si1 ), . . . , g(siTi )) , si = (si1 , . . . , siTi ) , Li = (yi,−1 , . . . , yi,−J ), φ = (φ1 , . . . , φJ ) , and the errors εi = (εi1 , . . . , εiTi ) follow the distribution ε i ∼ N (0, i ). The covariance matrix i is the Ti × Ti Toeplitz matrix implied by the AR process, the construction of which we will discuss later in this article. But first consider the modeling of the unobserved effects. In the spirit of Mundlak (1978), we assume that the distribution of the q-vector β i is Gaussian with mean value depending on the initial observations yi0 ≡ (yi,−J+1 , . . . , yi0 ) and the covariates for subject i. In particular, we let ˜ i , Wi , si , γ , D ∼ N (Ai γ , D), β i |yi0 , X
i = 1, . . . , n, (7)
or equivalently, β i = Ai γ + bi ,
bi ∼ N (0, D),
i = 1, . . . , n,
(8)
where the matrix Ai can be defined quite flexibly, given the specifics of the problem at hand. In the simplest case in which Wi does not include an intercept and β i is independent of the covariates, a parsimonious way to model the dependence of β i on yi0 is to let Ai be a q × 2q matrix given by Ai = I ⊗ (1, y¯ i0 ), where y¯ i0 is the mean of the entries in yi0 . [More generally, but less parsimoniously, Ai can also be given by I ⊗ (1, yi0 ).] Moreover, Ai also may contain within-cluster means (or entire covariate sequences) of a subset of covariates—those suspected of being correlated with the random effects for each cluster. If r¯ ij ( j = 1, . . . , q) denotes the vector of such covariate means (or entire sequence of covariates), then Ai may be written as 1 y¯ i0 r¯ i1 .. Ai = . . 1 y¯ i0 r¯ iq
688
Journal of the American Statistical Association, June 2006
We later adjust this specification for the case when a random intercept is present. An example of the matrix Ai is illustrated in the application of Section 7. Using (8), equation (6) can be equivalently expressed as zi = Xi β + gi + Wi bi + εi ,
(9)
with ˜i Xi = ( X
Wi Ai
Li )
and
β = ( δ
γ
φ ) .
The presence of the individual effects, bi , induces correlation among the observations in a cluster, but the clusters are modeled as independent. Therefore, even though the columns of Wi can form a subset of the columns of Xi (as is usual in parametric longitudinal models) and Wi can contain the covariate s whose effect is modeled through g(s), the parameters of the model are likelihood-identified through the presence of intercluster and intracluster sources of variation. However, there is an important caveat for the semiparametric model considered here. We note that neither si nor an intercept should be present in the matrix Xi , even though they can be in order to resolve the identification problem that would otherwise emerge under a general, unrestricted g(s). Hence, in more general specifications containing a random intercept in the model, one must adjust Ai for identification purposes. If the random intercept is the ith column of Wi , then the column of Ai that is an ith unit vector should be dropped, so that Wi Ai does not include an intercept. Similarly, if si is the jth column of Wi , then the column of Ai that is a jth unit vector should be dropped, so that the product Wi Ai does not contain si . It should also be noted that the presence of an unrestricted g(·) does not prevent the inclusion of temporally invariant covariates (e.g., gender, race, vari˜ i or Wi , as long as these vary among ous dummies) in either X clusters. One should be aware, however, that their simultaneous inclusion into Ai to model correlation with a random intercept leaves the likelihood unidentified (because Wi Ai will cause Xi to contain two or more identical columns across all i). The hierarchical structure of the model is completed by the introduction of (semiconjugate) prior densities for the model parameters β, ρ, and D. Gaussian priors are used to summarize the prior information about the k-vector β, whereas a Wishart prior is used for the q × q matrix D−1 , namely β ∼ N (β 0 , B0 ) and D−1 ∼ W(r0 , R0 ). The prior on ρ is specified as ρ ∼ N (ρ 0 , P0 )ISρ , where ISρ is an indicator of the set Sρ containing the ρ that satisfy stationarity. We clarify that, in contrast with the serially correlated errors, stationarity is not an issue for the state-dependence coefficients φ (which are part of β), because these multiply the binary lags and thus serve simply as intercept shifts. The foregoing semiconjugate priors are useful for computational reasons, but the analysis of models with other general priors can be conducted by weighted resampling of the MCMC draws obtained from a model using the priors mentioned here. 3. ESTIMATION This section presents a new estimation method for longitudinal models with unobserved heterogeneity and serial correlation in the errors. As far as we know, other O(N) methods for estimating this model are not yet available. A main difficulty is that even a single evaluation of the likelihood function requires
evaluation of a multiple integral, and hence penalized likelihood estimation is impractical. Moreover, because m, the number of unique values of s, can be as large as the sample size N, any maximization over the values g can be extremely difficult to apply. Of course, any such maximization needs to be conditioned on a smoothness parameter, but determining that parameter is problematic by current methods such as cross-validation, which are designed for continuous data, but not for the case of longitudinal binary data. Our estimation algorithm takes advantage of the approach of Albert and Chib (1993) to simplify simulation of the posterior distribution by MCMC methods. In the longitudinal data setup, the latent data representation of the model with AR( p) serial correlation was given in (9), where the errors εi = (εi1 , . . . , εiTi ) follow the distribution εi ∼ N (0, i ) and i is the Ti × Ti Toeplitz matrix implied by the AR process. For the general AR( p) case, the matrix i can be determined as follows. Let ϕj = E(εit εi,t−j ) be the jth autocovariance (satisfying ϕj = ϕ−j ). It can be shown (cf. Hamilton 1994, sec. 3.4) that the autocovariances follow the same pth-order difference equation as the process itself, that is, ϕj = ρ1 ϕj−1 + · · · + ρp ϕj−p . The first p values (ϕ0 , ϕ1 , . . . , ϕp−1 ) are given by the first p elements of the first column of the p2 × p2 matrix [I − F ⊗ F]−1 , where ⊗ denotes the Kronecker product and F is the p × p matrix ρ . F≡ Ip−1 0( p−1)×1 Using the sequence of autocovariances ϕj obtained in this way, the matrix i can be constructed using i [j, k] = ϕj−k . For example, in the AR(1) case, we have i [j, k] = ρ | j−k| /(1 − ρ 2 ). After marginalizing bi using the distribution of the random effects, the latent zi can be expressed as zi = X i β + gi + ui ,
(10)
where the error vector is normal with variance matrix Vi = i + Wi DWi . This implies that the contribution of the ith cluster to the likelihood function (conditioned on gi ), ··· N (zi |Xi β + gi , Vi ) dzi , (11) Pr(yi |β, gi , D, ρ) = BiTi
Bi1
where Bit is the interval (0, ∞) if yit = 1 or the interval (−∞, 0] if yit = 0 is, in general, difficult to calculate. This is true even in the case of uncorrelated errors when Vi = I + Wi DWi , making it impractical to obtain smoothness parameter estimates by cross-validation and to subsequently use penalized likelihood estimation to obtain estimates of g in generalized mixed-effects models. Previous studies by Diggle and Hutchinson (1989), Altman (1990), and Smith et al. (1998) have drawn attention to the fact that when the errors are treated as independent when they are not, the correlation in the errors can adversely affect the nonparametric estimate of the regression function. For example, when the covariate s is in temporal order, the unknown function g(s) can be confounded with the autocorrelated error process, because both are stochastic processes in time. If the serial correlation in the errors is ignored, then the estimate of g(s) can become too rough as it attempts to mimic the errors. The undersmoothing can be visible even for mild serial correlation.
Chib and Jeliazkov: Inference in Semiparametric Dynamic Models
689
Smith et al. (1998) pointed out that even if the independent variable is not time, modeling the autocorrelation in the errors gives more efficient nonparametric estimates, because it reduces the effective error variance similarly to the case of parametric regression. To describe the approach, we stack the observations in (9) for all n clusters and write z = Xβ + Qg + Wb + ε,
ε ∼ N (0, ),
after defining the vectors z = (z1 , . . . , zn ) and the matrices X1 . X = .. , =
W=
Xn
1 ..
(12)
and b = (b1 , . . . , bn )
W1 ..
,
.
and
Wn
.
. n
In the foregoing, Q is an incidence matrix of dimension N × m, with entries Qij = 1 if si = vj and 0 otherwise. In other words, the ith row of Q contains a 1 in the position where the observation on s for that row matches the design point from the vector v, with all remaining elements 0’s, so that s = Qv. The fact that the errors are not orthogonal (and is not diagonal) requires only minor adjustments to the sampling of β, z, D, τ 2 , and b; standard Bayes updates for models with serial correlation can be applied to obtain and simulate the posterior densities (given in Algorithm 1 later in this section). But the sampling of g is problematic. To understand the difficulty, note that with uncorrelated errors (i.e., when = I), we have the following full-conditional distribution: g|y, β, b, τ 2 , z ∼ N (ˆg, G). In this case, G = (K/τ 2 + Q Q)−1 and gˆ = G(Kg0 /τ 2 + Q (z − Xβ − Wb)). Remark 1 presents a computationally efficient method for sampling this density. Remark 1. In sampling g, one should note that Q Q is a diagonal matrix with a jth diagonal entry equal to the number of values in s corresponding to the design point vj . Because K and Q Q are banded, G−1 is banded as well. Thus sampling of g need not include an inversion to obtain G and gˆ . The mean gˆ can be found instead by solving G−1 gˆ = (Kg0 /τ 2 + Q (z − Xβ − Wb)), which is done in O(m) operations by back substitution. Also, let P P = G−1 , where P is the Cholesky decomposition of G−1 and is also banded. To efficiently obtain a random draw from N (ˆg, G), sample u ∼ N (0, I), and solve Px = u for x by back substitution. It follows that x ∼ N (0, G). Adding the mean gˆ to x, yields a draw g ∼ N (ˆg, G). Unfortunately, after accounting for the autocorrelated errors, the full-conditional distribution g|y, β, b, τ 2 , z, ∼ N (ˆg, G) will involve the matrix G−1 = (K/τ 2 + Q −1 Q), which is no longer banded (even though −1 is banded). Hence the computational shortcuts discussed in Remark 1 are inapplicable. Intuitively, bandedness fails because serial correlation introduces dependence between observations that are neighbors on the basis of the ordering of the covariate s, whereas the function evaluations g depend on neighbors determined according to the ordering in v, the vector of unique and ordered values of s (with s = Qv).
Diggle and Hutchinson (1989) and Altman (1990) considered a special case that can still result in O(N) estimation. In this case, attention was restricted to univariate models for nonclustered data where the independent variable is time. Then, because the elements in s are already unique and ordered, we have v = s or, in other words, Q = I. This implies that G−1 = (K/τ 2 + Q −1 Q) = (K/τ 2 + −1 ) is banded, and estimation can be done in O(N) operations as outlined in Remark 1. Unfortunately, in panel data settings Q is unlikely to be an identity matrix even when s is time, because repeating values in s will tend to emerge across clusters. Even when all values in s are unique, Q will be some permutation (not necessarily an identity) matrix, because the ordering of s need not be consistent with the cluster structure. The general case where s is allowed to be any covariate (not necessarily time) was considered by Smith et al. (1998), but their algorithm was O(N 3 ), which works with the nonbanded matrix G−1 . Thus the applicability of that method is limited to only small datasets and is infeasible in panel data settings, where the sample size N can run into the thousands. Finally, we note that the method of orthogonalizing the errors by working with the transformed data ρ(L)zit , ρ(L)xit , ρ(L)g(sit ), and ρ(L)wit (cf. Harvey 1981, chap. 6; Chib 1993) works well in parametric models but is not a solution here, because it is equivalent to premultiplying the matrices X, W, and Q by the Cholesky decomposition of −1 , which still leaves G−1 nonbanded. Here we propose a different approach to orthogonalizing the errors that exploits the longitudinal nature of the data. In particular, the idea is to decompose the errors into a correlated and an orthogonal parts, and to deal with the correlated part in much the same way in which we deal with the random effects. Once the correlated part is given, the nonparametric estimation of g can proceed as efficiently as before. To illustrate, decompose the matrix i = Ri + κI, where Ri is a symmetric positive definite matrix and κI is a diagonal matrix with κ > 0. Furthermore, let Ci be the Cholesky decomposition of Ri such that Ci Ci = Ri . Then i = Ci Ci + κI, and the model can be rewritten as zi = Xi β + Wi bi + gi + ε i = Xi β + Wi bi + gi + Ci ui + ξ i ,
(13)
where ui ∼ N (0, I) and ξi ∼ N (0, κI) are mutually independent. Stacking the observations in (13) for all n clusters, in analogy with (12), we have z = Xβ + Qg + Wb + C u + ξ , where
ξ ∼ N (0, κI),
(14)
u =(u1 , . . . , un )
and C is given by C1 .. . C= . Cn
Because the covariance matrix of ξ is diagonal, conditional on C u, we have obtained an orthogonalization of the serially correlated errors that can be used to sample g efficiently. It now remains to prove that a decomposition of i into the sum of a symmetric positive definite matrix Ri and a (positive definite) diagonal matrix κI exists, and to show how it can be found. The details are formalized here and rely on results given by Ortega (1987, pp. 31–32).
690
Journal of the American Statistical Association, June 2006
Theorem 1. Let the matrix A have eigenvalues {λi }, and let p(Z) ≡ κ0 I + κ1 Z + κ2 Z2 + · · · + κm Zm be some (scalar- or matrix-valued) polynomial of degree m. Then {p(λi )} are eigenvalues of the matrix p(A), and A and p(A) have the same eigenvectors.
3. 4.
The proof follows from first principles by writing the matrices in the polynomial in terms of their spectral decomposition and collecting terms. The result that we originally sought to show can now be derived using a special case of the polynomial in the theorem, that is, p(Z) = Z − κI.
5. 6.
Lemma 1. A Ti × Ti symmetric positive definite matrix i can always be written as the sum of a symmetric positive definite matrix Ri and a positive definite diagonal matrix κI. Proof. Let p(i ) = i − κI, which is the definition of Ri . i , which are strictly Then, if the eigenvalues of i are {λij }Tj=1 positive and real because i is symmetric and positive definite, it follows from Theorem 1 that the eigenvalues of Ri will be i . Then, choosing κ such that min{λij } > κ > 0 guar{λij − κ}Tj=1 antees that Lemma 1 holds. Because the previous decomposition is not unique, various values of κ will correspond to the same model and the same dynamics. Therefore, in practice the choice of κ will be based on convenience and numerical stability. One simple choice that has performed well in our simulations is to set κ = min{λij }/2. Also note that i depends only on ρ and not on the data, so the decomposition need not be performed for every i. Based on the preceding discussion, we present the estimation algorithm. For the sampling of the vector of AR coefficients ρ = (ρ1 , . . . , ρp ) , it is useful to define the following quantities. Let eit = zit − xit β − wit bi − g(sit ), ei = (ei,p+1 , . . . , ei,Ti ) , e = (e1 , . . . , en ) , and E denote the (N − np) × p matrix with rows containing p lags of eit (i = 1, . . . , Ti , t ≥ p + 1), that is, (ei,t−1 , . . . , ei,t−p ). Finally, let the initial p values of eit in each cluster be given by ei1 = (ei1 , . . . , eip ) , and let p be the p × p stationary covariance matrix of the AR( p) error process, which is a function of ρ and is constructed identically to {i }. Simulation of ρ is by the Metropolis–Hastings (M–H) algorithm (Hastings 1970; Tierney 1994; Chib and Greenberg 1995). Algorithm 1: Model with state dependence and AR( p) serial correlation. 1. Sample {zi }|y, D, g, β, and ρ marginal of {bi } by drawing, for i ≤ n, t ≤ Ti if yit = 1 T N (0,∞) (µit , vit ) zit ∼ T N (−∞,0] (µit , vit ) if yit = 0, where T N (a,b) (µit , vit ) is a normal distribution truncated to the interval (a, b) with mean µit = E(zit |zi\t , β, gi , Vi ) and variance vit = var(zit |zi\t , β, gi , Vi ), with Vi = i + Wi DWi , and i determined by ρ as discussed earlier. 2. Sample β, {bi }, {ui }|y, D, {zit }, gi , and ρ in one block by drawing the following: ˆ B), where βˆ = B(B−1 β 0 + (a) β|y, D, {zit }, g, ρ ∼ N (β, 0 n −1 −1 n −1 i=1 Xi Vi (zi −gi )) and B = (B0 + i=1 Xi Vi × Xi )−1 (b) bi |y, D, {zit }, β, g, ρ ∼ N (bˆ i , Bi ) with bˆ i = Bi Wi × −1 + W −1 W )−1 −1 i i (zi − Xi β − gi ) and Bi = (D i i for i = 1, . . . , n
(c) ui |y, {zit }, {bi }, β, g, ρ ∼ N (uˆ i , Ui ), where uˆ i = Ui Ci (zi − Xi β − gi − Wi bi )/κ and Ui = (I + Ci Ci / κ)−1 for i = 1, . . . , n. n −1 Sample D−1 |{bi } ∼ Wp {r0 + n, (R−1 i=1 bi bi ) }. 0 + 2 Sample g|y, β, {bi }, τ , {zit }, {ui1 } ∼ N (ˆg, G), where G = (K/τ 2 + Q Q/κ)−1 and gˆ = G(Kg0 /τ 2 + Q (z − Xβ − Wb − C u)/κ). Because G−1 is banded, estimation can proceed efficiently as discussed in Remark 1. δ0 +(g−g0 ) K(g−g0 ) ). Sample τ 2 |g ∼ IG( ν0 +m 2 , 2 ˆ P) × ISρ , Sample ρ|y, g, β, {bi }, {zit } ∝ (ρ) × N (ρ, −1 −1 where ρˆ = P(P0 ρ 0 + E e), P = (P0 + E E)−1 , and (ρ) = |p |−n/2 exp(− 12 ni=1 ei1 −1 p ei1 ).
In the M–H step of Algorithm 1, a proposal draw ρ is generated from the density N (ρ, ˆ P)ISρ and is subsequently accepted as the next sample value with probability min{(ρ )/(ρ), 1} (see Chib and Greenberg 1994). If the candidate value ρ is rejected, then the current value ρ is repeated as the next value of the MCMC sample. We note several aspects of this algorithm. First, because β, {bi }, and {ui } are correlated by construction, they are sampled in one block to speed up mixing of the chain. This is done by using (10) to sample β from a conditional density that does not depend on {bi } and {ui }, followed by drawing {bi } from a conditional density that depends on β but not on {ui }, and finally drawing {ui } from its full-conditional density (Chib and Carlin 1999). Second, the MCMC approach to estimating τ 2 in this hierarchical model is an alternative to cross-validation that can be applied to both continuous and discrete data and fully accounts for parameter uncertainty, unlike plug-in approaches, which do not account for the variability due to estimating the smoothing parameter. An important alternative to the these methods is the maximum integrated likelihood approach to determining τ 2 (Kohn, Ansley, and Tharm 1991), but this is not feasible in this binary data setting, because of the N-dimensional integration that would be required. Several extensions are possible. The approach for dealing with dependent errors can accommodate other correlation structures, such as exponentially correlated error sequences i [t, s] = exp{−α|t − s|r } for scalars α and r (e.g., Diggle and Hutchinson 1989). The method can also handle estimation of non-Toeplitz i using the algorithms of Chib and Greenberg (1998). Yet other extensions can be pursued in the latent variable step; for example, the methods can be adapted to models for polychotomous data and models with t-links (Albert and Chib 1993) or to mixture-of-normals link functions, including an approximation to the logit link (Wood and Kohn 1998). Of course, Algorithm 1 subsumes an algorithm for the estimation of a simpler, fully parametric version of the model. Applications to continuous-data settings are also immediate. 4. AVERAGE COVARIATE EFFECTS We now turn to the question of finding the effect of a change in a given covariate xj . This is important for understanding the model and for determining the impact of an intervention on one or more of the covariates. However, a change in a covariate affects both the contemporaneous response and its future values. This effect also depends on all other covariates and model parameters. The impact is quite complex and nonlinear because of
Chib and Jeliazkov: Inference in Semiparametric Dynamic Models
the nonlinear link function, the state dependence, the serial correlation, the unknown function, and the random effects. Hence it is calculated marginalized over the remaining covariates and the parameters. Suppose that the model for a new individual i is given by zit = xit β + wit bi + g(sit ) + εit , where the definitions of xit and β are as in Sections 2 and 3, and we are interested in the effect of a particular x, say x1 , on contemporaneous and future yit . Splitting xit and β accordingly, we rewrite the foregoing model as zit = x1it β1 + x2it β 2 + wit bi + g(sit ) + εit . The average covariate effect can then be analyzed from a predictive perspective applied to this new individual i. Given the specific context, we may consider various scenarios of interest; examples of economic policy interventions may include increasing or decreasing income by some percentage, and imposing an additional year of education. These interventions may affect the covariate values in a single period (e.g., a one-time tax break) or in multiple periods (e.g., a permanent tax reduction or increasing the minimum mandatory level of education). i For specificity, suppose that one thinks of setting {x1it }Tt=1 to † Ti the values {x1it }t=1 . (Again, only a subset of these values could be affected by the intervention, whereas the others can remain unchanged.) For a predictive horizon of t = 1, 2, . . . , Ti (where Ti is the smallest of the cluster sizes in the observed data), one is now interested in the distribution of yi1 , yi2 , . . . , yiTi marginalized over {x2it }, bi , and θ = (β, φ, g, D, τ 2 , ρ) given the current data y = (y1 , . . . , yn ). A practical procedure is to marginalize out the covariates as a Monte Carlo average using their empirical distribution, while also integrating out θ with respect to the posterior distribution π(θ |y). Of course, bi is independent of y and hence can be integrated out of the joint distribution of {zi1 , . . . , ziTi } analytically using the distribution N (0, D), without recourse to Monte Carlo. Therefore, the goal is to obtain a sample of draws from the distribution † zi1 , . . . , ziTi |y, {x1it } † = zi1 , . . . , ziTi |y, {x1it }, {x2it }, {wit }, {sit }, θ × π({x2it }, {wit }, {sit })π(θ |y) d{xit } d{wit } d{sit } dθ. A sample from this predictive distribution can be obtained by the method of composition applied in the following way. Randomly draw an individual and extract the sequence of covariate values. Sample a value for θ from the posterior density, and † sample {zi1 , . . . , ziTi } jointly from [zi1 , . . . , ziTi |y, {x1it }, {x2it }, {wit }, {sit }, θ ], constructing the {yit } in the usual way. Repeat this for other individuals and other draws from the posterior distribution to obtain the predictive probability mass function of ( yi1 , . . . , yiTi ). Repeat this analysis for a different {x1it }, say ‡ {x1it }. The difference in the computed pointwise probabilities † } are changed then gives the effect of x1 as the values {x1it ‡ to {x1it }. This approach can be similarly applied to other elements of xit and wit . Quite significantly, it can be applied in determining
691
the effect of the nonparametric component g(s) because the error bands that are usually reported in the estimation of g(·) are pointwise, not joint, and do not provide sufficient information on which to make probabilistic statements about the functional shape (Hastie and Tibshirani 1990, p. 62), differences such as g(s† ) − g(s‡ ), and the effect of s on the probability of response. In addition, we can condition on certain variables (gender, race, and specific initial conditions) that might determine a particular subsample of interest, in which case our procedures are applied only to observations in the subsample. An application of the techniques is considered in Section 7, where Figure 8 shows the estimated average covariate effects for husband’s income and two child variables and Figure 9 shows these effects for two specific subsamples. 5. MODEL COMPARISON A central issue in the analysis of statistical data is model formulation, because the appropriate specification is rarely known and is subject to uncertainty. Among other considerations, the uncertainty may be due to the problem of variable selection (i.e., the specific covariates and lags to be included in the model), the functional specification (a parametric model vs. a semiparametric model), or the distributional assumptions. In general, given the data y, interest centers on a collection of models {M1 , . . . , ML } representing competing hypotheses about y. Each model Ml is characterized by a modelspecific parameter vector θ l and sampling density f (y|Ml , θ l ). Bayesian model selection proceeds by comparing the models in {Ml } through their posterior odds ratio, which for any two models Mi and Mj is written as Pr(Mi |y) Pr(Mi ) m(y|Mi ) = × , (15) Pr(Mj |y) Pr(Mj ) m(y|Mj ) where m(y|Ml ) = f (y|Ml , θ l )πl (θ l |Ml ) dθ l is the marginal likelihood of Ml . The first fraction on the right side of (15) is known as the prior odds; the second is known as the Bayes factor. To date, model comparisons in the semiparametric context have been based on such criteria as the AIC and BIC (e.g., Shively et al. 1999; Wood et al. 2002; DiMatteo et al. 2001; Hansen and Kooperberg 2002), because direct evaluation of the integral defining m(y|Ml ) is generally infeasible. But the AIC and BIC cannot be computed for the model in this article because of the difficulty (discussed in Sec. 3) of maximizing the likelihood in the semiparametric binary panel setting, whereas both the AIC and the BIC require the maximized value of the likelihood as input. Here we revisit the question of calculating the marginal likelihood of the semiparametric model and show that it can be managed through careful application of existing methods. In particular, Chib (1995) provided a method based on the recognition that the marginal likelihood can be expressed as m(y|Ml ) = f (y|Ml , θ l )π(θ l |Ml )/π(θ l |y, Ml ). This identity holds for any point θ l , so that calculation of the marginal likelihood is reduced to finding an estimate of the posterior ordinate π(θ ∗l |y, Ml ) at a single point θ ∗l . In what follows, we suppress the model index for notational convenience. Suppose that the parameter vector θ is split into B conveniently specified components or blocks (usually done on the basis of the natural groupings that emerge in constructing the MCMC sampler), so that
692
Journal of the American Statistical Association, June 2006
θ = (θ 1 , . . . , θ B ). Let ψ ∗i = (θ ∗1 , . . . , θ ∗i ) denote the blocks up to i, fixed at their values in θ ∗ , and let ψ i+1 = (θ i+1 , . . . , θ B ) denote the blocks beyond i. Then, by the law of total probability, we have π(θ ∗1 , . . . , θ ∗B |y) =
B
π(θ ∗i |y, θ ∗1 , . . . , θ ∗i−1 )
i=1
=
B
π(θ ∗i |y, ψ ∗i−1 ).
i=1
When the full-conditional densities are known, each ordinate π(θ ∗i |y, ψ ∗i−1 ) can be estimated by Rao–Blackwellization as π(θ ∗i |y, ψ ∗i−1 ) ≈ J −1 Jj=1 π(θ ∗i |y, ψ ∗i−1 , ψ i+1,( j) ), where ψ i,( j) ∼ π(ψ i |y, ψ ∗i−1 ), j = 1, . . . , J, come from a reduced run for 1 < i < B, where sampling is only over ψ i , with the blocks ψ ∗i−1 held fixed. The ordinate π(θ ∗1 |y) for the first block of parameters θ 1 is estimated with draws θ ∼ π(θ |y) from the main MCMC run, whereas the ordinate π(θ ∗B |y, ψ ∗B−1 ) is available directly. When one or more of the full conditional densities are not of a standard form and sampling requires the M–H algorithm, Chib and Jeliazkov (2001) used the local reversibility of the M–H chain to show that π(θ ∗i |y, ψ ∗i−1 ) =
E1 {α(θ i , θ ∗i |y, ψ ∗i−1 , ψ i+1 )q(θ i , θ ∗i |y, ψ ∗i−1 , ψ i+1 )} E2 {α(θ ∗i , θ i |y, ψ ∗i−1 , ψ i+1 )}
,
(16) where E1 is the expectation with respect to conditional posterior π(ψ i |y, ψ ∗i−1 ) and E2 is that with respect to the conditional product measure π(ψ i+1 |y, ψ ∗i )q(θ ∗i , θ i |y, ψ ∗i−1 , ψ i+1 ). In the preceding, q(θ , θ |y) denotes the candidate generating density of the M–H chain for moving from the current value θ to a proposed value θ , and α(θ i , θ i |y, ψ ∗i−1 , ψ i+1 ) denotes the M–H probability of moving from θ to θ . Each of these expectations can be computed from the output of appropriately chosen reduced runs. Three new and important considerations emerge when applying these methods to the model considered in this article. Our implementation, using Algorithm 1 to simulate the blocks {zi }, β, {bi }, {ui }, D, g, τ 2 , and ρ, is based on the following posterior decomposition (marginalized over the latent data {zi }, {bi }, {ui }): π(D∗ , τ 2∗ |y)π(β ∗ |y, D∗ , τ 2∗ ) × π(ρ ∗ |y, D∗ , τ 2∗ , β ∗ )π(g∗ |y, D∗ , τ 2∗ , β ∗ , ρ ∗ ). The first consideration is that the ordinate of g is estimated last, because this tends to improve the efficiency of the ordinate estimation in the Rao–Blackwellization step, where π(g∗ |y, D∗ , τ 2∗ , β ∗ , ρ ∗ ) ≈ J −1
J
π g∗ |y, D∗ , τ 2∗ , β ∗ , ρ ∗ , {zi }( j) , {bi }( j) , {ui }( j) , j=1
so that marginalization is only with respect to the conditional distribution of the latent data π({zi }, {bi }, {ui }|y, D∗ , τ 2∗ ,
β ∗ , ρ ∗ ), with all parameter blocks in the conditioning set fixed. Because the simulation algorithm for g is O(m), this particular choice comes at a manageable computational cost (from having to simulate g in all of the preceding reduced runs), whereas the statistical efficiency benefits may be substantial when m is large. Second, it should also be noted that the ordinate π(D∗ , τ 2∗ |y) can be estimated jointly because, conditional on {bi } and g, the full conditional densities of D and τ 2 are independent. This observation saves a reduced run. Third, in Algorithm 1 the proposal density q(ρ, ρ |y, ·) = q(ρ |y, ·) is a truncated normal density with an unknown normalizing constant except in the AR(1) case. Specifically, over the region of stationarity Sρ , we have q(ρ|y, ·) = h(ρ|y, ·)/ sρ h(ρ|y, ·) dρ, where h(ρ|y, ·) is an unrestricted normal density for ρ. We avoid the need to compute this unknown constant of integration by noting that when the reversibility condition used by Chib and Jeliazkov (2001) to obtain (16) is written in terms of q(ρ|y, ·), its unknown normalizing constant (being the same on both sides) will cancel out, so that, on integration, we have π(ρ ∗ |y, ψ ∗i−1 ) =
E1 {α(ρ, ρ ∗ |y, ψ ∗i−1 , ψ i+1 )h(ρ ∗ |y, ψ ∗i−1 , ψ i+1 )} E2 {α(ρi∗ , ρ i |y, ψ ∗i−1 , ψ i+1 )}
,
where ψ ∗i−1 = (D∗ , τ 2∗ , β ∗ ), ψ i+1 = (g, {zi }, {bi }, {ui }), E1 is the expectation with respect to π(ρ, ψ i+1 |y, ψ ∗i−1 ), and E2 is the expectation with respect to π(ψ i+1 |y, ψ ∗i )h(ρ|y, ψ ∗i−1 , ψ i+1 ). It is also useful to note that because h(ρ |y, ·) does not depend on the current value of ρ in the sampler, estimating the denominator quantity is done with draws available from the same run in which the numerator is estimated. Implementing these methods requires the likelihood ordinate f (y|D∗ , β ∗ , g∗ , ρ ∗ ), which we obtain by the method of Geweke, Hajivassiliou, and Keane (GHK) (Geweke 1991; Börsch-Supan and Hajivassiliou 1993; Keane 1994), using 10,000 Monte Carlo iterations. 6. SIMULATION STUDY We carried out a simulation study to examine the performance of the techniques proposed in Algorithm 1. For the estimates of the nonparametric function, we report mean squared errors (MSEs) for several designs. The posterior mean estimates E{g(v)|y}, are found from MCMC runs of length 5,000 after burn-ins of 1,000 cycles. For the parametric components of the model, we report the autocorrelations and inefficiency factors under alternative specifications and sample sizes. We find that the overall performance of the MCMC algorithm improves with larger sample sizes (increasing either the number of clusters n or the cluster sizes {Ti }), and that the random effects are simulated better when the increase in sample size comes from larger {Ti }. Data are simulated from the model in (1) and (7), without serial correlation, using one, two, and three state-dependence lags, ˜ and one or two individuala single fixed-effect covariate X, effect covariates W (including a random intercept) that are cor˜ and W related with (the average of) the initial conditions. X contain independent standard normal random variables, and we use δ = 1, γ = 1, φ = .5 × 1, and D = .2 × I. We generate
Chib and Jeliazkov: Inference in Semiparametric Dynamic Models
panels with 250, 500, and 1,000 clusters and with 10 time periods, using only the last 7 for estimation (Ti = 7, i = 1, . . . , n), because our largest models contain 3 lags. We consider three functional forms for g(s), presented in Figure 1, which capture a range of possible specifications in the literature. For the simulations, each function is evaluated on a regular grid of m = 51 points. We gauge the performance fitting these of the method in g(vj ) − g(vj )}2 . The average functions using MSE = m1 m j=1 {ˆ MSE, together with the standard errors based on 20 data samples, is reported in Table 1 for alternative designs and sample sizes. In all cases, we have used comparable mild priors β ∼ N (0, 10I), (g1 , g2 ) |τ 2 ∼ N (0, τ 2 100 × I), D−1 ∼ W(q + 4, 1.67 × Iq ), and τ 2 ∼ IG(3, .1); the priors on the variance parameters imply that E(D) = .2 × I, SD(diag(D)) = .283 × 1, E(τ 2 ) = .05, SD(τ 2 ) = .05, and Eτ 2 (var((g1 , g2 ) )) = 5 × I. (We discuss the role of the smoothness prior in more detail later.) Table 1 also shows that as the sample size grows, the functions are estimated more precisely, as expected. Also, in line with conventional wisdom, the general trend seems to be that fitting models with fewer parameters results in lower MSE estimates. We clarify that under this setup, increasing the number of lags J affects the simulation study in two ways: first, the dimension of the parameter space increases, and second, it affects the proportion of 1’s among the responses (because all elements in φ are positive). It is well known that the degree of asymmetry in the proportion of the responses affects the estimation precision. The proportion of 1’s is between .62 and .67 across the three functional specifications for our one-lag models, between .67 and .72 for the two-lag models, and between .71 and .76 for the three-lag models. As Table 1 shows, however, the method recovers the true functions well despite this asymmetry. The computational cost per 1,000 MCMC draws is approximately 16 seconds in the simplest case (n = 250, J = 1, q = 1). Adding more fixed effects (e.g., lags) does not increase the costs by more than 1/10 of a second. However, adding an additional random effect increases the computational cost to
Figure 1. The Three Functions Used in Generating Data for the Simulation Study: g(s) = sin(2π s), s ∈ [.6, 1.4]; g(s) = −1 + s + 1.6s2 + sin(5s), s ∈ [0, 1.1]; g(s) = −.8 + s + exp{ − 30(s − .5)2 }, s ∈ [0, 1].
693
Table 1. Average Mean Squared Errors (with estimated standard errors in parentheses)
Clusters
Lags
n = 250
J =1 J =2 J =3
n = 500
J =1 J =2 J =3
n = 1,000
J =1 J =2 J =3
Average MSE (×10 −2 )
Random effects
g1
g2
g3
q=1 q=2 q=1 q=2 q=1 q=2
1.60(.24) 1.92(.26) 2.21(.43) 2.72(.48) 2.89(.99) 3.42(.67)
1.40(.14) 1.85(.23) 1.61(.21) 2.82(.40) 1.98(.54) 2.12(.60)
1.77(.32) 1.41(.19) 1.78(.37) 2.62(.48) 2.93(.41) 3.02(.75)
q=1 q=2 q=1 q=2 q=1 q=2
.68(.08) .98(.15)
.88(.14) 1.16(.13)
1.04(.21) .98(.09)
.96(.15) 1.43(.17)
.91(.23) 1.69(.28)
1.05(.18) 1.57(.25)
2.57(.41) 1.82(.31)
1.65(.38) 2.49(.58)
1.81(.26) 1.76(.22)
q=1 q=2 q=1 q=2 q=1 q=2
.64(.12) .56(.07) .83(.13) .71(.11) 1.27(.29) .94(.19)
.63(.11) .58(.08) .52(.09) .69(.17) .60(.09) 1.19(.28)
.60(.10) .68(.10) .72(.11) .57(.08) .65(.11) .84(.12)
NOTE: The three functions used in generating data for the simulation study were g1 (s) = sin(2π s ), s ∈ [.6, 1.4]; g2 (s) = −1 + s + 1.6s 2 + sin(5s), s ∈ [0, 1.1]; g3 (s) = −.8 + s + exp{−30(s − .5)2 }, s ∈ [0, 1].
58 seconds per 1,000 draws, because of the expense of simulating a larger random-effects vector in each cluster. Finally, the computational times increased linearly in relation to the sample size n. Figure 2 presents three particular function fits for the case (n = 500, J = 2, q = 1). We now comment on the influence of the prior on our function estimates. The model as a whole has three important variance structures, and the relative informativeness of the priors for each of these structures should be viewed in the context of the other two structures, not in isolation. As already discussed, the variance of the errors and the variance τ 2 of the Markov process prior determine the trade-off between a good fit and a smooth function g(·). Similarly, the variance of the errors and
Figure 2. Simulation Study. Three examples of estimated functions ( —–), true functions ( -·-·-), and 95% pointwise confidence bands ( - - - -).
694
Journal of the American Statistical Association, June 2006 (a)
(b)
(c)
(d)
Figure 3. Effect of τ 2 on the Estimates of g: n = 250 [(a) and (c)] and n = 1,000 [(b) and (d)]. A larger τ 2 is used in (a) and (b) than in (c) and (d). The dashed lines represent pointwise confidence bands.
the variance of the random effects D determine a balance between intracluster and intercluster variation. Whereas in large samples the effects of the assumed priors on the parameter estimates is small (vanishing asymptotically), informative priors
do matter in small samples. Figure 3 illustrates this point by using two exaggeratedly different informative priors on τ 2 . In one case the prior on τ 2 is such that E(τ 2 ) = .5 and SD(τ 2 ) = .1; in the second case E(τ 2 ) = .001 and SD(τ 2 ) = .001. Figure 3 illustrates that the first prior leads to a function that becomes more wiggly as it curves to interpolate the data more closely, whereas the second prior leads to oversmoothing. When the sample size is increased from n = 250 to n = 1,000, the difference in the function estimates becomes smaller. Next we consider the performance of the MCMC algorithm in fitting the parametric part of the model. For example, the case where n = 500 (with Ti = 7) is illustrated in Figure 4, which shows histograms and kernel-smoothed marginal posterior densities of the parameters together with the corresponding autocorrelations from the sampled output. The linear effects, together with τ 2 , appear to be estimated well and the output is characterized by low autocorrelations. Although it can be seen that D is estimated well, its higher autocorrelation indicates slower mixing than that of the remaining parameters, so that longer MCMC runs may be needed to accurately describe the marginal posterior density of D. The slower mixing occurs because D is a parameter at the second level of the modeling hierarchy and depends on the data only indirectly through {bi }. Because the {bi } are not well identified in smaller clusters, when only a few observations are available to identify the cluster-specific effects, and because learning about D occurs from the intercluster variation of {bi }, D also suffers from weak identification when cluster sizes are small. To measure the efficiency of the MCMC parameter sampling scheme, we use the measures [1 + 2 Lk=1 ρk (l)], where ρk (l) is the sample
Figure 4. Posterior Samples and Autocorrelations for the Parameters of a Semiparametric Model With One Fixed Effect, One Random Effect, and One Lag (Ti = 7, i = 1, . . . , n).
Chib and Jeliazkov: Inference in Semiparametric Dynamic Models
Table 2. Examples of Estimated Inefficiency Factors (autocorrelation times) for the Parameters of a Model With One Lag, One Random Effect, and One Fixed Effect for n = 500 Inefficiency factor Parameter
Ti = 7
Ti = 12
Ti = 17
δ γ φ
13.68 9.62 12.11 24.00 10.65
10.18 6.16 8.31 13.28 5.26
12.55 7.25 9.03 9.41 8.74
D
τ2
autocorrelation at lag l for the kth parameter in the sampling with the summation truncated at values L at which the correlations taper off. The latter quantity, called the inefficiency factor, may be interpreted as the ratio of the numerical variance of the posterior mean from the MCMC chain to the variance of the posterior mean from hypothetical independent draws. Table 2 gives the inefficiency factors corresponding to the parameters for the same model as before, but now with different cluster sizes (7, 12, and 17 observations per cluster). In this setup the larger cluster sizes serve to identify {bi } better, allowing more precise capture of intercluster variation. Table 2 shows that the inefficiency factor for D drops considerably (but the other inefficiency factors stay within a similar range). The improved sampling of D becomes evident from comparing Figures 4 and 5, with the latter summarizing the MCMC output when Ti = 17. Similar to the cases given in Table 2, Table 3 presents results for the inefficiency factors for a model with two lags and two random effects. Because now not one, but two, random effects are estimated from the limited observations in each cluster, the
695
elements of D are sampled with somewhat higher inefficiency factors. Here again, however, Table 3 shows that as the cluster sizes increase, resulting in better identification of {bi }, the inefficiency factors for the elements of the heterogeneity matrix D drop noticeably. In the rest of this section, we report results from experiments involving a model with AR(1) correlated errors. Table 4 presents the inefficiency factors for three cluster sizes and two values of ρ. The parameters ρ and D are sampled well in all cases, but it is interesting to see that when ρ is positive, both ρ and D have higher inefficiency factors than otherwise. This is because both are estimated from the covariance of the errors, and decomposing that matrix into an equicorrelated part (with positive elements implied by the random intercept) and a Toeplitz part [implied by the AR(1) part, which also has positive elements when ρ > 0] is difficult in small samples. As the cluster sizes increase, D is identified better, so both ρ and D are estimated better. This does not appear to be a problem when ρ < 0, because then the two correlation structures implied by ρ and D are quite different. For the samplers and values of ρ considered here, the M–H acceptance rate in the sampling of ρ is in the range of (.87, .98). These results show how serial correlation in the errors affects the performance of the sampler, but it is also of interest to note that misspecifying the correlation structure has definite impact on the remaining components of the model. As mentioned earlier, several studies, including those of Diggle and Hutchinson (1989), Altman (1990), and Smith et al. (1998), have pointed out that serial correlation, if incorrectly ignored, can have substantial adverse consequences for the estimation of the nonparametric function. What is interesting in the context of panel data
Figure 5. Posterior Samples and Autocorrelations for the Parameters of a Semiparametric Model With One Fixed Effect, One Random Effect, and One Lag (Ti = 17, i = 1, . . . , n).
696
Journal of the American Statistical Association, June 2006
Table 3. Examples of Estimated Inefficiency Factors for the Parameters of a Model With Two Lags, Two Random Effects, and One Fixed Effect Inefficiency factor n = 250
n = 500
Ti = 7
Ti = 12
Ti = 17
Ti = 7
Ti = 12
Ti = 17
Ti = 7
Ti = 12
Ti = 17
δ γ1 γ2 γ3 φ1 φ2
21.16 12.19 11.40 9.06 9.29 8.39 27.01 29.73 26.24 10.58
18.09 12.20 8.48 10.36 9.12 7.60 22.46 23.16 26.37 12.45
18.88 11.51 9.61 8.41 8.00 10.14 18.87 18.16 17.63 10.17
19.89 14.73 15.64 11.75 7.01 7.69 31.56 29.88 34.72 14.74
16.07 13.00 8.84 10.48 10.96 7.52 23.88 25.81 27.87 9.47
14.72 12.82 7.33 9.19 11.07 8.48 18.72 14.91 18.09 7.80
20.79 12.16 14.64 12.59 14.36 9.15 34.10 26.31 34.56 8.02
22.46 13.57 10.07 12.90 10.23 11.14 25.95 20.25 24.67 6.05
20.65 12.34 9.56 8.44 10.81 10.01 18.08 18.38 21.80 6.02
D11 D12 D22
τ2
is that ignoring the serial correlation can lead to biases in the estimates of the heterogeneity matrix D. In line with the discussion in the foregoing paragraph, we have found that ignoring the serial correlation distorts the estimates of D, especially with smaller cluster sizes or positive and high serial correlation. Ignoring the serial correlation in the errors also produces estimates of the lag coefficients φ that differ widely from the true values used in generating the data (and in some settings having the opposite sign). One should also keep in mind that differences also can occur because of the usual identification restriction in binary data models, namely that the error variance is fixed. For example, if the errors follow the AR(1) process εit = ρεi,t−1 + vit with var(vit ) = 1, then the unconditional variance of εit is inflated to 1/(1 − ρ 2 ). This is one of the points raised by Smith et al. (1998) in the context of a continuous-data model, where this larger error variance was shown to produce less efficient nonparametric function estimates. In the context of a binary data problem, ignoring serial correlation and setting var(εit ) = 1 then corresponds to implicitly rescaling the regression parameters by a factor of 1 − ρ 2 . We take all of these results together as a strong signal not to ignore the modeling of the three causes of intertemporal dependence discussed in Section 1—state dependence due to lags of the dependent variable, serial correlation in the errors, and heterogeneity among clusters. In terms of computational intensity, our method was quite efficient. For example, in one of our smaller datasets (n = 250, q = 1, J = 1, and Ti = 7, i = 1, . . . , n), Algorithm 1 took approximately 22 seconds to produce 1,000 MCMC draws, varying by less than .2 second as we increased the dimension of g from m = 50 to m = 200. For comparison, we also tried a Table 4. Examples of Estimated Inefficiency Factors for the Parameters of the Model With One Lag, One Random Effect, One Fixed Effect, and AR(1) Serial Correlation for n = 500 Inefficiency factor (ρ = −.5)
Inefficiency factor (ρ = .5)
Parameter
Ti = 7
Ti = 12
Ti = 17
Ti = 7
Ti = 12
Ti = 17
δ γ φ
19.06 9.34 23.05 21.48 9.68 20.50
17.43 7.30 25.63 16.21 7.08 21.35
20.19 7.86 18.24 12.75 6.01 20.28
20.75 11.42 17.00 39.47 11.84 32.40
14.06 11.30 20.24 28.02 7.97 27.12
17.72 12.05 20.11 15.19 7.97 26.45
D
τ2 ρ
n = 1,000
Parameter
“brute-force” algorithm that did not exploit banded matrices or our approach for dealing with correlated errors. In this case the computational cost of 1,000 MCMC draws was 35 seconds for m = 50, 103 seconds for m = 100, and 405 seconds for m = 200. We suspect that the brute-force algorithm would become largely infeasible in even higher-dimensional problems because of its computational intensity and storage requirements. To summarize, the results suggest that the MCMC algorithm performs well, and that the estimation method recovers the parameters and functions used to generate the data. The performance of the method in recovering the nonparametric function g(·) and the model parameters improves with the sample size, when the model is better identified. Most noticeably, the sampling of D benefits strongly from larger cluster sizes. 7. INTERTEMPORAL LABOR FORCE PARTICIPATION OF MARRIED WOMEN In this section we consider an application to the annual labor force participation decisions of 1,545 married women in the age range 17–66. The dataset, obtained from the PSID, is based on the work of Hyslop (1999) and contains a panel of women’s working status indicators (1, working during the year; 0, not working) over a 7-year period (1979–1985), together with a set of seven covariates given in Table 5. The sample consists of continuously married couples where the husband is a labor force participant (reporting both positive earnings and hours worked) in each of the sample years. Similar data have been analyzed by Chib and Greenberg (1998), Avery, Hansen, and Hotz (1983), and Hyslop (1999) using other models and estimation techniques. A key feature of our modeling is that the effect of age on the conditional probability of working is specified nonparametrically. There are compelling reasons for doing this. Nonlinearities arise because of changes in tastes, trade-offs, and age-related health over a woman’s life cycle; due to the fact that age is indicative of the expected timing of events (e.g., graduation from school or college, planning for children); and because age may be revealing of social values and education type (cohort effect) or experience as a homemaker and in the labor market (productivity effect). To account for model uncertainty, we evaluated several competing models that differ in their state dependence, serial correlation, and heterogeneity using the following baseline
Chib and Jeliazkov: Inference in Semiparametric Dynamic Models
697
Table 5. Variables in the Women’s Labor Force Participation Application Variable
Explanation
WORK INT AGE RACE EDU CH2 CH5 INC
Wife’s labor force status (1, working; 0, not working) An intercept term (a column of 1’s) The woman’s age in years 1 if black, 0 otherwise Attained education (in years) at time of survey Number of children age 0–2 in that year Number of children age 3–5 in that year Total annual labor income of head of household
Mean
SD
.7097
.4539
36.0262 .1974 12.4858 .2655 .3120 31.7931
9.7737 .3981 2.1105 .4981 .5329 22.6417
NOTE: INC (in thousands of dollars) is measured as nominal earnings adjusted by the Consumer Price Index (base year 1987).
specification:
yit = 1 x˜ it δ + wit β i + g(sit )
+ φ1 yi,t−1 + · · · + φJ yi,t−J + εit > 0 ,
β i = A i γ + bi ,
bi ∼ N3 (0, D),
where yit = WORKit , x˜ it = (RACEi , EDUit , ln(INCit )), sit = AGEit , wit = (1, CH2it , CH5it ), εit are potentially serially correlated, and the effects of CH2 and CH5 are allowed to depend on husbands’ earnings and the initial conditions through y¯ i0 Ai = . 1 y¯ i0 ln(INCi ) 1 y¯ i0 ln(INCi ) Such issues as variable selection, lag determination, and correlation between the unobserved effects and covariates are handled as model selection problems by computing the marginal likelihoods of competing models. The semiparametric models are also compared against two parametric alternatives. The more important competing models are presented in Table 6. Models M1 –M4 have differing dynamics. Table 6 shows that allowing for AR(1) errors improves the performance of the one-lag state dependence model; serial correlation is present in the errors of that model (with ρ estimated as −.288, with a posterior standard deviation of .048). However, the data in this application strongly favor a model in which state dependence is incorporated through two lags of the dependent
variable. Single-lag models (M1 and M2 ) have marginal likelihoods much lower than the corresponding two-lag versions (M3 and M4 ). We also see that inclusion of the second lag removes the serial correlation in the errors and results in a marginal likelihood of M3 that is highest among the models considered in Table 6. For model M4 , the estimated value of ρ is −.047 with a 95% confidence interval (−.224, .113). Models with higher-order dynamic dependence were considered, but received less support than M3 . Regarding the heterogeneity in the model, we see from Table 6 that the single unobserved effect model (M5 ) is decisively less supported than M3 . We now discuss models M5 and M6 in the context of the estimate of the nonparametric function g(AGE) from model M3 . Figure 6 shows that the estimated effect of age departs notably from linearity. To examine whether a parametric model can adequately fit the data, we consider two parametric models (M5 and M6 ), where g(AGE) is restricted to be linear or quadratic. [Because now g(·) is not general, Ai is not restricted for identification purposes.] The comparisons are shown in Figure 7. The estimates suggest that the linear model can be deceiving, because it produced a negative coefficient estimate for age of −.0105 with 95% credibility region given by (−.018, −.003). The quadratic model comes closer to the semiparametric fit, but still leaves some excess nonlinearity undetected. Both parametric models miss the large increase in the probability of working in the early twenties and the nonlinearity around age 30. The marginal likelihood of the parametric
Table 6. Log Marginal Likelihoods for Alternative Models in the Women’s Labor Force Participation Application (estimated from MCMC runs of length 15,000) Model
Fixed effect
Random effect
Nonzero elements in Ai
Model with 1-lag state dependence and independent errors: x˜ it , y i,t −1 M1 wit (y¯ i 0 ; 1, y¯ i 0 , ln (INCi ); 1, y¯ i 0 , ln (INCi ) ) Model with 1-lag state dependence and AR(1) errors: x˜ it , y i,t −1 M2 wit Ai as in M1 Model with 2-lag state dependence and independent errors: x˜ it , y i,t −1 , y i,t −2 M3 wit Ai as in M1 Model with 2-lag state dependence and AR(1) errors: x˜ it , y i,t −1 , y i,t −2 wit Ai as in M1 M4 Model with random intercept only : x˜ it , y i,t −1 , y i,t −2 M5 1 (y¯ i 0 ) Parametric versions of M3 : x˜ it , y i,t −1 , y i,t −2 , M6 wit (1, y¯ i 0 ; 1, y¯ i 0 , ln (INCi ); AGEit 1, y¯ i 0 , ln (INCi ) ) x˜ it , y i,t −1 , y i,t −2 , M7 wit Ai as in M6 AGEit , AGE2it
ln(marginal likelihood) −2,610.7
−2,595.5 −2,563.8 −2,579.6 −2,580.1 −2,581.2 −2,574.6
698
Journal of the American Statistical Association, June 2006
Figure 6. The Effect of Age on the Probability of Working: NonparaE{g(AGE)|y}; metric Estimate and Pointwise Confidence Bands ( upper; lower).
Figure 7. Comparison of the Linear ( ), Quadratic ( ), and Nonparametric ( ) Estimates of g(AGE), With Pointwise Confidence Bands ( ).
models, especially that of the linear model, are lower confirming the conjecture from Figure 7 that these models do not fully account for the nonlinear effect of age. Table 7 contains the parameter estimates of the best-fitting model (M3 ). We see that, conditional on the covariates, black women, better-educated women, and women whose husbands have low earnings are more likely to work. The results also indicate the strong state dependence on two lags of a woman’s employment status. After controlling for state dependence and the remaining covariates, we see that the presence of children has a larger effect on the probability of working when the husband’s earnings increase. Finally, the positive correlation between the random intercept and the initial conditions is consistent with the notion that the initial observations are indicative of a woman’s tastes and human capital. The inefficiency factors in Table 7 (defined in Sec. 6) indicate a good overall performance of the MCMC sampler. But because of the large number of random
effects and small cluster sizes, the elements of D are sampled less efficiently than the other parameters. Interpretation of the estimates beyond the broad direction of impact, however, is complicated by the nonlinearity in the model and the interactions between the covariates. For example, the income and child variables enter the model in such a way as to make it difficult to disentangle and evaluate their effects. For this reason, Figure 8 presents the average effects for certain changes in the child and income covariates. More specifically, the figure presents the average effects of three hypothetical scenarios: first, the effect of an additional birth in period 1 (i.e., having an additional child age 0–2 in periods 1–3, who grows to become a child age 3–5 in periods 4 and 5), second, the effect of an additional child age 3–5 in periods 1–3, and third, doubling of the husband’s earnings. Figure 8 shows that there is a negative overall effect of preschool children on labor supply, which is noticeably stronger for children age 0–2 than
Table 7. Parameter Estimates for Model M3 , Along With 95% Confidence Intervals and Inefficiency Factors From 15,000 MCMC Iterations Parameter
Covariate
Mean
SD
Median
Lower
Upper
Ineff
δ
RACE EDU ln(INC) y¯ i 0 CH2 (CH2) (y¯ i 0 )
.170 .087 −.190 1.371 .142 −.245 −.135 .868 −.351 −.221 1.213 .445 .540 −.043 .137 −.151 .017 .158 .017
.080 .015 .048 .173 .312 .161 .093 .273 .127 .081 .071 .071 .129 .096 .071 .085 .049 .086 .006
.169 .086 −.189 1.365 .144 −.248 −.135 .867 −.350 −.221 1.213 .445 .528 −.043 .119 −.138 .011 .135 .016
.014 .057 −.286 1.047 −.479 −.556 −.318 .339 −.606 −.380 1.072 .308 .319 −.243 .046 −.347 −.066 .047 .009
.329 .117 −.098 1.724 .747 .077 .046 1.416 −.103 −.063 1.348 .581 .828 .133 .319 −.019 .136 .366 .030
7.012 23.189 16.484 27.802 5.414 19.356 6.230 8.358 14.530 8.139 15.863 11.470 38.481 45.999 45.617 43.454 45.551 46.355 5.473
γ
φ vech(D)
τ2
(CH2) ( ln (INCi ) ) CH5 (CH5) (y¯ i 0 ) (CH5) ( ln (INCi ) ) y i,t −1 y i,t −2
Chib and Jeliazkov: Inference in Semiparametric Dynamic Models (a)
699 (b)
(c)
Figure 8. The Average Effect of (a) Doubling a Husband’s Permanent Income, (b) Having an Additional Child Age 0–2 in Periods 1–3 and Age 3–5 in Periods 4 and 5, and (c) an Additional Child Age 3–5 in Periods 1–3 (c).
for children age 3–5 (cf. Hyslop 1999). The figure also shows that although husband’s earnings affect a woman’s probability of employment in a theoretically predicable direction, increases in earnings must be quite large to induce any economically significant reduction in participation. In some situations, we are interested in the effect of a given policy intervention on different segments of the population. To explore this issue, we can compute and compare the average covariate effects for various subsamples, as discussed in Section 4. Figure 9 shows the average covariate effects for two subsamples based on their initial conditions: women who have worked in both presample periods and women who have worked in the first, but not the second presample period (the other two categories are not shown). The figure reveals considerable differences in the impact of the covariates on labor force choices in these two groups and raises some interesting questions for future research. 8. CONCLUDING REMARKS This article has considered the Bayesian analysis of hierarchical semiparametric models for binary panel data with state dependence, serially correlated errors, and multidimensional
heterogeneity correlated with the covariates and initial conditions. New, computationally efficient MCMC algorithms have been developed for simulating the posterior distribution, estimating the marginal likelihood, and evaluating the average covariate effects. The techniques rely on the framework of Albert and Chib (1993) and a proper Markov process smoothness prior on the unknown function. A simulation study has shown that the methods performs well. An application involving a dynamic semiparametric model of women’s labor force participation illustrated that the model and the estimation methods are practical and can uncover interesting features in the data. Formal Bayesian model choice methods allowed us to distinguish among the effects of state dependence, serial correlation, and heterogeneity, and to compare parametric and semiparametric models. One benefit of the model considered herein is that it can be inserted as a component in a larger hierarchical model (e.g., a treatment model or a model with incidental truncation). The general method is also applicable to panels of continuous and censored data. We intend to explore the effectiveness of such approaches in future work. [Received November 2004. Revised June 2005.]
Figure 9. Average Covariate Effects for Two Subsamples: Women Who Worked in Both Presample Periods (first row) and Women Who Worked in the First, but Not the Second, Presample Period (second row).
700
Journal of the American Statistical Association, June 2006
REFERENCES Albert, J., and Chib, S. (1993), “Bayesian Analysis of Binary and Polychotomous Response Data,” Journal of the American Statistical Association, 88, 669–679. Altman, N. S. (1990), “Kernel Smoothing of Data With Correlated Errors,” Journal of the American Statistical Association, 85, 749–759. Avery, R., Hansen, L., and Hotz, V. (1983), “Multiperiod Probit Models and Orthogonality Condition Estimation,” International Economic Review, 24, 21–35. Besag, J., Green, P., Higdon, D., and Mengersen, K. (1995), “Bayesian Computation and Stochastic Systems” (with discussion), Statistical Science, 10, 3–66. Börsch-Supan, A., and Hajivassiliou, V. (1993), “Smooth Unbiased Multivariate Probability Simulators for Maximum Likelihood Estimation of Limited Dependent Variable Models,” Journal of Econometrics, 58, 347–368. Chib, S. (1993), “Bayes Estimation of Regressions With Autoregressive Errors: A Gibbs Sampling Approach,” Journal of Econometrics, 58, 275–294. (1995), “Marginal Likelihood From the Gibbs Output,” Journal of the American Statistical Association, 90, 1313–1321. Chib, S., and Carlin, B. (1999), “On MCMC Sampling in Hierarchical Longitudinal Models,” Statistics and Computing, 9, 17–26. Chib, S., and Greenberg, E. (1994), “Bayes Inference in Regression Models With ARMA ( p, q) Errors,” Journal of Econometrics, 64, 183–206. (1995), “Understanding the Metropolis–Hastings Algorithm,” The American Statistician, 49, 327–335. (1998), “Analysis of Multivariate Probit Models,” Biometrika, 85, 347–361. Chib, S., and Jeliazkov, I. (2001), “Marginal Likelihood From the Metropolis– Hastings Output,” Journal of the American Statistical Association, 96, 270–281. Diggle, P., and Hutchinson, M. (1989), “On Spline Smoothing With Autocorrelated Errors,” Australian Journal of Statistics, 31, 166–182. DiMatteo, I., Genovese, C. R., and Kass, R. E. (2001), “Bayesian Curve-Fitting With Free-Knot Splines,” Biometrika, 88, 1055–1071. Fahrmeir, L., and Lang, S. (2001), “Bayesian Inference for Generalized Additive Mixed Models Based on Markov Random Field Priors,” Journal of the Royal Statistical Society, Ser. C, 50, 201–220. Gelman, A., Carlin, B., Stern, H., and Rubin, D. (1995), Bayesian Data Analysis, New York: Chapman & Hall. Geweke, J. (1991), “Efficient Simulation From the Multivariate Normal and Student-t Distributions Subject to Linear Constraints,” in Computing Science and Statistics: Proceedings of the Twenty-Third Symposium on the Interface, ed. E. M. Keramidas, Fairfax, VA: Interface Foundation of North America, pp. 571–578. Hamilton, J. D. (1994), Time Series Analysis, Princeton, NJ: Princeton University Press. Hansen, M. H., and Kooperberg, C. (2002), “Spline Adaptation in Extended Linear Models” (with discussion), Statistical Science, 17, 2–51. Harvey, A. C. (1981), The Econometric Analysis of Time Series, Oxford, U.K.: Phillip Allen. Hastie, T., and Tibshirani, R. (1990), Generalized Additive Models, New York: Chapman & Hall. Hastings, W. K. (1970), “Monte Carlo Sampling Methods Using Markov Chains and Their Applications,” Biometrika, 57, 97–109. Heckman, J. (1981), “Heterogeneity and State Dependence,” in Studies in Labor Markets, ed. S. Rosen, Chicago: University of Chicago Press, pp. 91–131.
Hyslop, D. (1999), “State Dependence, Serial Correlation and Heterogeneity in Intertemporal Labor Force Participation of Married Women,” Econometrica, 67, 1255–1294. Kass, R., and Raftery, A. (1995), “Bayes Factors,” Journal of the American Statistical Association, 90, 773–795. Karcher, P., and Wang, Y. (2001), “Generalized Nonparametric Mixed Effects Models,” Journal of Computational and Graphical Statistics, 10, 641–655. Keane, M. (1994), “A Computationally Practical Simulation Estimator for Panel Data,” Econometrica, 62, 95–116. Kohn, R., Ansley, C. F., and Tharm, D. (1991), “The Performance of CrossValidation and Maximum Likelihood Estimators of Spline Smoothing Parameters,” Journal of the American Statistical Association, 66, 1042–1050. Koop, G., and Poirier, D. J. (2004), “Bayesian Variants of Some Classical Semiparametric Regression Techniques,” Journal of Econometrics, 123, 259–282. Lin, X., and Carroll, R. (2001), “Semiparametric Regression for Clustered Data Using Generalized Estimating Equations,” Journal of the American Statistical Association, 96, 1045–1056. Lin, X., and Zhang, D. (1999), “Inference in Generalized Additive Mixed Models Using Smoothing Splines,” Journal of the Royal Statistical Society, Ser. B, 61, 381–400. Mundlak, Y. (1978), “On the Pooling of Time Series and Cross-Section Data,” Econometrica, 46, 69–85. O’Hagan, A. (1994), Kendall’s Advanced Theory of Statistics: Bayesian Inference, Vol. 2B, London: Edward Arnold. Opsomer, J. D., Wang, Y., and Yang, Y. (2001), “Nonparametric Regression With Correlated Errors,” Statistical Science, 16, 134–153. Ortega, J. (1987), Matrix Theory, New York: Plenum Press. Shiller, R. (1984), “Smoothness Priors and Nonlinear Regression,” Journal of the American Statistical Association, 79, 609–615. Shively, T. S., Kohn, R., and Wood, S. (1999), “Variable Selection and Function Estimation in Additive Nonparametric Regression Using a Data-Based Prior” (with discussion), Journal of the American Statistical Association, 94, 777–806. Silverman, B. (1985), “Some Aspects of the Spline Smoothing Approach to Non-Parametric Regression Curve Fitting” (with discussion), Journal of the Royal Statistical Society, Ser. B, 47, 1–52. Smith, M., Wong, C.-M., and Kohn, R. (1998), “Additive Nonparametric Regression With Autocorrelated Errors,” Journal of the Royal Statistical Society, Ser. B, 311–331. Tierney, L. (1994), “Markov Chains for Exploring Posterior Distributions” (with discussion), The Annals of Statistics, 22, 1701–1762. Wahba, G. (1978), “Improper Priors, Spline Smoothing and the Problem of Guarding Against Model Errors in Regression,” Journal of the Royal Statistical Society, Ser. B, 40, 364–372. Wang, Y. (1998), “Smoothing Spline Models With Correlated Random Errors,” Journal of the American Statistical Association, 93, 341–348. Wood, S., and Kohn, R. (1998), “A Bayesian Approach to Robust Binary Nonparametric Regression,” Journal of the American Statistical Association, 93, 203–213. Wood, S., Kohn, R., Shively, T., and Jiang, W. (2002), “Model Selection in Spline Nonparametric Regression,” Journal of the Royal Statistical Society, Ser. B, 64, 119–139. Whittaker, E. (1923), “On a New Method of Graduation,” Proceedings of the Edinburgh Mathematical Society, 41, 63–75.