Applied Regression Models for Longitudinal Data

84 downloads 82453 Views 401KB Size Report
Applied Regression Models for. Longitudinal Data. Kosuke Imai. Princeton University. Fall 2013. POL 573 Quantitative Analysis III. Kosuke Imai (Princeton).
Applied Regression Models for Longitudinal Data Kosuke Imai Princeton University Fall 2016 POL 573 Quantitative Analysis III

Kosuke Imai (Princeton)

Longitudinal Data

POL573 (Fall 2016)

1 / 48

Readings

Hayashi, Econometrics, Chapter 5 “Dirty Pool” papers referenced in the slides Wooldrich, Econometric Analysis of Cross Section and Panel Data, Chapter 10 and relevant sections of Part IV Gelman and Hill, Data Analysis Using Regression and Multilevel/Hierarchical Models, Part 2A, Cambridge University Press.

Kosuke Imai (Princeton)

Longitudinal Data

POL573 (Fall 2016)

2 / 48

What Are Longitudinal Data? Repeated observations for each unit Also called panel data, cross-section time-series data Assume balanced data:   yi1  yi2    Yi =  .  and |{z}  ..  T ×1

   Xi =  |{z}  T ×K

yiT

xi11 xi21 .. .

xi12 · · · xi1K xi22 · · · xi2K .. ··· ··· .

    

xiT 1 xiT 2 · · · xiTK

“Stacked” form:    Y =  |{z} 

NT ×1

Kosuke Imai (Princeton)

Y1 Y2 .. .

    

 and

YN Longitudinal Data

  X =  |{z} 

NT ×K

X1 X2 .. .

    

XN POL573 (Fall 2016)

3 / 48

Varying Intercept Models Basic setup: yit = αi + β > xit + it Motivation: unobserved (time-invariant) heterogeneity yit = α + β > xit + δ > ui + it where αi = α + δ > ui (Strict) Exogeneity given Xi and ui : E(it | Xi , ui ) = 0 Constant slopes Static model: no lagged y in the right hand-side

Kosuke Imai (Princeton)

Longitudinal Data

POL573 (Fall 2016)

4 / 48

Fixed-Effects Model “Fixed” effects mean that αi are model parameters to be estimated (N + K ) parameters to be estimated: inefficient if T is small Homoskedasticity and independence across time (& units) V(i | X) = σ 2 IT Stacked vector representation:  1 0  .. ..  . .   1 0   0 1  Y = Zγ+ where D =  . .  .. ..   0 0   .. ..  . . 0 0 Kosuke Imai (Princeton)

 ··· 0 ..   .   α1  ··· 0   α2  ··· 0    .. , Z = [D X], γ =  . ..    .   αN  ··· 1  β ..  .  ··· 1

Longitudinal Data

POL573 (Fall 2016)

      

5 / 48

Estimation of Fixed Effects Model The least-squares estimator (also MLE): γˆFE = (Z> Z)−1 Z> Y Sampling distribution: γˆFE | Z ∼ N (γ, σ 2 (Z> Z)−1 ) The dimension of (Z> Z) is large when N is large Computation based on within-group variation:    (x11 − x¯1 )> y11 − y¯1    .. ..    . .    e e    ¯ Y =  yiT − yi  and X =  (xiT − x¯i )>    .. ..    . . ¯ yNT − yN (xNT − x¯N )>

       

e > X) e −1 X e>Y e and βˆW | X ∼ N (β, σ 2 (X e > X) e −1 ) Then, βˆW = (X

Kosuke Imai (Princeton)

Longitudinal Data

POL573 (Fall 2016)

6 / 48

Fixed Effects Estimator as Within-group Estimator Within-group variation = Residuals from regression of Y on D Recall the geometry of least squares (see Multiple Regression slides 39 and 43) Projection of Y onto S ⊥ (D) e = MD Y = where MD = INT − D(D> D)−1 D> Thus, Y e = MD X Also, X This is a partitioned regression! Then, βˆFE = βˆW (see Question 1 of POL572 Problem Set 4 ©) e−X e βˆW Also, ˆ = Y − Zˆ γFE = Y e > X) e −1 The lower-right block of (Z> Z)−1 equals (X e > X) e −1 ) Thus, βˆW | X ∼ N (β, σ 2 (X

Kosuke Imai (Princeton)

Longitudinal Data

POL573 (Fall 2016)

7 / 48

Kronecker Product Definition:    An×m ⊗ Bk ×l =  

a11 B a12 B . . . a1m B a21 B a22 B . . . a2m B .. .. .. .. . . . . an1 B an2 B . . . anm B

     nk ×ml

Some rules (assuming they are conformable): (A ⊗ B)> = A> ⊗ B> A ⊗ (B + C) = A ⊗ B + A ⊗ C (A ⊗ B) ⊗ C = A ⊗ (B ⊗ C) (A ⊗ B)(C ⊗ D) = (AC) ⊗ (BD) (A ⊗ B)−1 = A−1 ⊗ B−1 |A ⊗ B| = |A|m |B|n where A and B are n × n and m × m matrices, respectively.

Fixed effects calculation made easy: D = IN ⊗ 1T which implies D> D = IN ⊗ (1> T 1 T ) = T IN PD = D(D> D)−1 D> = T1 IN ⊗ (1T 1> T) Kosuke Imai (Princeton)

Longitudinal Data

POL573 (Fall 2016)

8 / 48

Serial Correlation and Heteroskedasticity Even after conditioning on xit and ui , yit may still be serially correlated Corr(it , it 0 ) 6= 0 for t 6= t 0 Independence across units is assumed We are still assuming we got the conditional mean specification correct and so just need to “fix” standard errors Robust standard errors (see Multiple Regression slide 24): e > X) e −1 {XE(˜ e −1 e ˜> | X)X} e (X e > X) V(βˆ | X) = (X | {z } | {z } | {z } meat

bread

where the “meat” is estimated by

bread

PN

e > ˆi ˆ> X ei i=1 Xi  i

Asymptotically consistent with any form of heteroskedasticity and serial correlation Can also use Feasible GLS Kosuke Imai (Princeton)

Longitudinal Data

POL573 (Fall 2016)

9 / 48

Panel Corrected Standard Error A popular procedure proposed by Beck and Katz (1995) Basic idea: take into account contemporaneous (or spatial) correlation when calculating standard errors Autocorrelation is assumed to be non-existent E(˜ it ˜it 0 | X) = E(˜ it ˜i 0 t 0 | X) = 0 for i 6= i 0 and t 6= t 0 Inclusion of lagged dependent variable (more on this later) Spatial correlation is assumed to be time-invariant ρˆii 0

T 1 X \ = E(˜ it ˜i 0 t | X) = ˆit 0 ˆi 0 t 0 T 0

for i 6= i 0

t =1

These robust standard errors can be applied to any linear models (with or without fixed effects) Kosuke Imai (Princeton)

Longitudinal Data

POL573 (Fall 2016)

10 / 48

A Variety of Exogeneity Assumptions Contemporaneous exogeneity: E(it | xit , αi ) = 0 Not sufficient for identification of β under fixed effects model Strict exogeneity: E(it | Xi , αi ) = 0 Sufficient for identification of β under fixed effects model error does not correlate with x at another time Sequential exogeneity: E(it | X it , αi ) = 0 where X it = {xi1 , xi2 , . . . , xit } past error can correlate with future x Dynamic sequential exogeneity: E(it | X it , Y i,t−1 ) = 0 where Y it = {yi0 , yi1 , . . . , yit } analogous to sequential ignorability Sequential ignorability: {yit (1), yit (0)} ⊥ ⊥ xit | X i,t−1 , Y i,t−1 sequential randomization, marginal structural models Kosuke Imai (Princeton)

Longitudinal Data

POL573 (Fall 2016)

11 / 48

Fixed Effects and Lagged Dependent Variable One of the simplest fixed effects model that incorporates dynamics is the following AR(1) model: yit = αi + ρyi,t−1 + β > xit + it

where |ρ| < 1

Strict exogeneity does not hold: E(it | X iT , Y i,T −1 , αi ) 6= 0 Bias (Nickell) as N goes to ∞ with fixed T : −1 1 e> 1 e> e −1 Y−1 MXe Y Y ˜ NT NT | {z }

 ρˆ − ρ

=

AT

p

−→ βˆ − β

= p

−→

  1 − ρT σ2 1− T (1 − ρ) T (1 − ρ) > e −1 e > e e e > X) e −1 X e > ˜ (ρ − ρˆ)(X X) X Y−1 + (X   σ2 1 − ρT −A−1 1− ·δ T · T (1 − ρ) T (1 − ρ) −A−1 T ·

p e > X) e −1 X e>Y e −1 −→ where (X δ Kosuke Imai (Princeton)

Longitudinal Data

POL573 (Fall 2016)

12 / 48

First Differencing for Identification The model: yit = αi + ρyi,t−1 + β > xit + it Assumption: dynamic sequential exogeneity E(it | X it , Y i,t−1 ) = 0 Implies E(it ) = E(it it 0 ) = 0 for t 6= t 0 First differencing: ∆yit = ρ∆yi,t−1 + β > ∆xit + ∆it where ∆yit = yit − yi,t−1 etc. Instrumental variables: Anderson and Hsiao: yi,t−2 or ∆yi,t−2 Arellano and Bond: use all instruments with GMM t = 3: E(∆i3 yi1 ) = 0 t = 4: E(∆i4 yi2 ) = E(∆i4 yi1 ) = 0 t = 5: E(∆i5 yi3 ) = E(∆i5 yi2 ) = E(∆i5 yi1 ) = 0 and so on; a total of (T − 2)(T − 1)/2 instruments

Instruments derived from the model rather than from the qualitative information Kosuke Imai (Princeton)

Longitudinal Data

POL573 (Fall 2016)

13 / 48

Random Effects Model Dimension reduction is desirable when N is large relative to T “Random” intercepts: a prior distribution on αi i.i.d.

αi ∼ N (α, ω 2 ) Additional assumptions: 1 2

A family of distributions for αi Independence between αi and X i.i.d.

Reduced form when it ∼ N (0, σ 2 ): Yi | X

indep.



N (α1T + Xi β, σ 2 Στ )

where Στ = IT + τ 1T 1> T

and τ = ω 2 /σ 2 or using the stacked form Y | X ∼ N (Zγ, σ 2 Ωτ )

where Ωτ = IN ⊗ Στ

and Z = [1 X] and γ = (α, β) Kosuke Imai (Princeton)

Longitudinal Data

POL573 (Fall 2016)

14 / 48

Maximum Likelihood Estimation of RE Model Likelihood function:   1 (2π)−NT /2 |σ 2 Ωτ |−1/2 exp − 2 (Y − Zγ)> Ω−1 (Y − Zγ) τ 2σ

The same trick as in linear regression: (Y − Zγ)> Ω−1 τ (Y − Zγ) = (Y − Zγ − Zˆ γ + Zˆ γ )> Ω−1 γ + Zˆ γ) τ (Y − Zγ − Zˆ = (Y − Zˆ γ )> Ω−1 γ ) + (γ − γˆ )> Z> Ω−1 ˆ) τ (Y − Zˆ τ Z(γ − γ −1 > −1 where γˆ = (Z> Ω−1 Z Ωτ Y τ Z)

Log-likelihood function: −

TN N 1  log(2πσ 2 ) − log |Στ | − NT σ ˆ 2 + (γ − γˆ )> Z> Ω−1 ˆ) τ Z(γ − γ 2 2 2 2σ

where σ ˆ 2 = (Y − Zˆ γ )> Ω−1 γ )/(NT ) τ (Y − Zˆ

Given any value of τ , (ˆ γ, σ ˆ 2 ) is the MLE of (γ, σ 2 ) Kosuke Imai (Princeton)

Longitudinal Data

POL573 (Fall 2016)

15 / 48

Maximum Likelihood Estimation of τ Useful identities: τ 1 1T 1> 1T 1> T = MDi + T 1 + τT T (1 + τ T ) g(τ ) Ω−1 = IN ⊗ Σ−1 = MD + IN ⊗ 1T 1> τ τ T T where MD = IN ⊗ MDi and g(τ ) = 1/(1 + τ T ) Thus,   g(τ ) > −1 >e > > e Z Ωτ Z = Z Z + Z IN ⊗ 1T 1T Z T Σ−1 = IT − τ

where the second term equals N

g(τ ) X > > Zi 1T 1> T Zi = g(τ )T Z Z T i=1

where Z is a stacked matrix based on Zi = Kosuke Imai (Princeton)

Longitudinal Data

1 > T Zi 1T POL573 (Fall 2016)

16 / 48

Now the MLE of γ and σ 2 can be written as, e> Y e> Z e + g(τ )T Z> Y e + g(τ )T Z> Z)−1 (Z ¯) = (Z 1 ˆ> ˆ = (˜ ˜ + g(τ )T ˆ ¯> ˆ ¯) NT

γˆτ σ ˆτ2

e − Zˆ eγτ and ˆ ¯ − Zˆ γτ where ˆ˜ = Y ¯= Y Maximize the “concentrated” log-likelihood with respect to τ : τˆ = argmax − τ

N {T log σ ˆτ2 + log(1 + τ T )} 2

where |Στ | = 1 + τ T

Kosuke Imai (Princeton)

Longitudinal Data

POL573 (Fall 2016)

17 / 48

Within-group and Between-group Interpretation e > X) e −1 X e>Y e Recall the within-group estimator: βˆW = (X > −1 > ¨ ¨ ¨ ¨ ¨ i = y¯i − y¯ , Between-group estimator: βˆB = (X X) X Y where Y ¨ i = (x¯i − x¯ )> , and X     1T (y¯1 − y¯ ) 1T (x¯1 − x¯ )>   .. .. ¨ =  ¨ =  Y X   and |{z}   . . |{z} > NT ×1 NT ×K ¯ ¯ ¯ ¯ 1T (yN − y ) 1T (xN − x ) Random effects estimator as the weighted average: α ˆ RE βˆRE

> ¯ x = y¯ − βˆRE e>X e + g(τ )X e>X e βˆW + g(τ )X ¨ > X) ¨ −1 (X ¨ >X ¨ βˆB ) = (X

g(τ ) → 0 when T → ∞ or τ → ∞ e>X e + g(τ )X ¨ > X) ¨ −1 ) Asymptotically, βˆRE ∼ N (β, σ 2 (X Under random effects model e>X e + g(τ )X e > X) e −1 ¨ > X) ¨ −1 ≤ V(βˆW | X) ≈ σ 2 (X V(βˆRE | X) ≈ σ 2 (X Kosuke Imai (Princeton)

Longitudinal Data

POL573 (Fall 2016)

18 / 48

Shrinkage and BLUP Estimation of varying intercepts under the random effects model Empirical Bayes approach: Bayes: likelihood + subjective prior Empirical Bayes: likelihood + “objective” prior (estimated from data)

The posterior of αi   σ2τ 1 τT > (y¯i − β x¯i ) + α, N 1 + τT 1 + τT 1 + τT Plug in α ˆ RE , βˆRE , τˆ, σ ˆ 2 to obtain α ˆ i and its confidence interval α ˆ i,RE is the BLUP (Best Linear Unbiased Predictor) without the distributional assumption for αi Shrinkage (partial pooling): weighted average of within-group mean and overall mean where the weight is a function of T and τ Borrowing strength: key idea for multilevel/hierarchical models, variable selection (ridge regression, LASSO, etc.) Bias-variance tradeoff Kosuke Imai (Princeton)

Longitudinal Data

POL573 (Fall 2016)

19 / 48

Fixed or Random Intercepts? Dilemma: Random effects impose additional assumptions but can be more efficient if the assumptions are correct Hausman specification test 1

Test statistic H

2 3

(βˆw − βˆRE )> V(βˆW − βˆRE | X)−1 (βˆw − βˆRE ) = (βˆw − βˆRE )> {V(βˆW | X) − V(βˆRE | X)}−1 (βˆw − βˆRE ) ≡

Hausman shows that asymptotically (βˆW − βˆRE ) ⊥ ⊥ βˆRE Null hypothesis: random effects model Asymptotic reference distribution: H ∼ χ2K

Warning: the alternative hypothesis is that random effects model is wrong but fixed effects model is correct, but in practice both models could be wrong!

Kosuke Imai (Princeton)

Longitudinal Data

POL573 (Fall 2016)

20 / 48

Hausman Test Proof Lemma: Consider two consistent and asymptotically normal estimators of β and call them βˆ0 and βˆ1 . Suppose βˆ0 attains the ˆ ) converges asymptotically Cramer-Rao lower bound. Then, Cov(βˆ0 , q ˆ = βˆ0 − βˆ1 . to zero where q 1

2

ˆ where r is a scalar and A is Define a new estimator βˆ2 = βˆ0 + rAq an arbitrary matrix to be chosen later p

Show that βˆ2 → β with the asymptotic variance ˆ , βˆ0 ) + r 2 AV(q ˆ )A> V(βˆ2 ) = V(βˆ0 ) + 2rACov(q

3

4

5

Consider F (r ) = V(βˆ2 ) − V(βˆ0 ) ≥ 0 ˆ , βˆ0 )> and show that F (r ) < 0 for a small Choose A = −Cov(q ˆ , βˆ0 ) = 0 value of r , which yields a contradiction unless Cov(q ˆ + βˆ0 ) = V(q ˆ ) + V(βˆ0 ) Finally, V(βˆ1 ) = V(q Kosuke Imai (Princeton)

Longitudinal Data

POL573 (Fall 2016)

21 / 48

An Example: Democratic Peace Debate TABLE 2. Alternative Variable"

International Organization special issue

GDP

Green et al., Oneal & Russett, Beck & Katz, King

Population Distance

Dyadic analysis Effect of Democracy on bilateral trade (given here) and conflict (see later slide) Hausman test for pooled analysis vs. fixed effects

Alliance Democracyb Lagged bilateraltrade Constant

AdjustedR2

regression analyses of bilat

Pooled

Fixed effects

1.182** (0.008) -0.386** (0.010) -1.342**

0.810** (0.015) 0.752** (0.082) Dropped:no within-group variation

(0.018) -0.745** (0.042) 0.075** (0.002) -17.331** (0.265) N = 93,924 0.36

0.777** (0.136) -0.039** (0.003) -47.994** (1.999) NT = 93,924 N = 3,079 T 20 0.63

Note: Estimates obtainedusing areg and xtreg proceduresi Kosuke Imai (Princeton)

bilateral trade are naturalLongitudinalaGDP, Data population,distance, and POL573 (Fall 2016) 22 / 48

A Generalization of Random Effects Model Random effects model assumes αi ⊥ ⊥ Xi “Correlated” random effects: αi | Xi

indep.



N (α + ξ > x¯i , ω 2 )

Then, β > xit + ξ > x¯i = β > (xit − x¯i ) + (β + ξ)> x¯i implies     1 (xi1 − x¯i )> x¯i> α  .. ..  and γ =  β  Zi =  ... . .  > λ 1 (xiT − x¯i ) x¯ > i

where λ = β + ξ This leads to the surprising result: βˆCRE = βˆW

and

ˆ CRE = βˆB λ

Empirical Bayes estimate of αi : τˆT 1 α ˆ i,FE + (y¯ − βˆB> x¯ + (βˆB − βˆW )> x¯i ) 1 + τˆT 1 + τˆT Kosuke Imai (Princeton)

Longitudinal Data

POL573 (Fall 2016)

23 / 48

Models with Varying Slopes and Intercepts Random effects model can be made richer and more flexible Linear mixed effects models: Yi | X, Z, ζ ζi | X, Z

indep.



i.i.d.



N (Xi β + Zi ζi , Σi ) N (0, Ω)

where Z is typically a subset of X Estimating ζi without partial pooling is unrealistic when the dimension of Zi is large Useful if intercepts/slopes differ across units and are of interest Multilevel/hierarchical models are extensions of this basic model (see Gelman and Hill (2007) for many interesting examples) Reduced form: indep. Yi | X, Z ∼ N (Xi β, Λi ) where Λi = Zi ΩZi> + Σi Kosuke Imai (Princeton)

Longitudinal Data

POL573 (Fall 2016)

24 / 48

Estimation of Linear Mixed Effects Models With known variance, the GLS is the MLE: !−1 N N X X −1 > βˆ = X Λ Xi X > Λ−1 Yi i

i

i

i=1

i

i=1

Empirical Bayes estimate for ζi : −1 −1 > −1 ˆ ζˆi = (Zi> Σ−1 i Zi + Ω ) Zi Σi (Yi − Xi β) ˆ = ΩZ > Λ−1 (Yi − Xi β) i

i

For known variance, ζˆi attains the smallest MSE Restricted Maximum Likelihood (REML) to estimate variance where β is integrated out from the likelihood over improper prior lmer() in the lme4 package Possible to obtain the MLE of all parameters at once via the EM algorithm treating ζi as missing data Fully Bayesian approach via Gibbs sampling Kosuke Imai (Princeton)

Longitudinal Data

POL573 (Fall 2016)

25 / 48

Generalized Linear Mixed Effects Models (GLMM) Extension of Linear Mixed Effects Models 1 2 3

Linear predictor: ηit = β > xit + ζi> zit Link function: g(µit ) = ηit where µi = E(yit | X, Z, ζi ) Random components: indep.

yit | X, Z, ζi ∼ f (y | ηit ) an exponential-family distribution i.i.d. ζi ∼ N (0, Ω)

All diagnostics etc. for GLM can be applied The likelihood function: "Z T N Y Y i=1

# f (yit | ηit )h(ζi )dζi

t=1

In most cases, no analytical solution to the integral exists

Kosuke Imai (Princeton)

Longitudinal Data

POL573 (Fall 2016)

26 / 48

Estimation of GLMM Decomposition: yit = g −1 (ηit ) + it Taylor expansion around current estimates (β (t) , ζ (t) ): (t)

Yi

(t)

≈ Xi β+Zi ζi +i

(t)

where Yi

(t) b (t) )−1 (Yi −ˆ = (V µi )+Xi βˆ(t) +Zi ζˆi i

bi is the diagonal matrix whose diagonal element where V corresponds the variance function b00 (ˆ µit ) Iterated Weighted Least Squares where each iteration involves the optimization problem under a linear mixed effects model The resulting estimator can be justified as the penalized quasi-likelihood estimator The approximation can be poor Alternative approximation: Gaussian quadrature (glmer()) MCEM and MCMC Kosuke Imai (Princeton)

Longitudinal Data

POL573 (Fall 2016)

27 / 48

An Example: Modeling Latent Social Networks Hoff and Ward (Political Analysis, 2004) Modeling bilateral trade using cross-section dyadic data Model (export from i to j) yij = β > xij + ai + bj + γij + zi> zj {z } | ij

where ai is the sender effect, bj is the receiver effect, γij is the dyadic effect, zi is a vector in the latent network space Random effects specification 1 2 3

(ai , bi ) ∼ N (0, Σ) γij = γji ∼ N (0, Φ) zi ∼ N (0, σ 2 I)

Kosuke Imai (Princeton)

Longitudinal Data

POL573 (Fall 2016)

28 / 48

Generalized Estimating Equations If you are not interested in varying intercepts/slopes themselves, you need not estimate ζi in GLMM Model E(Yi | X) = g −1 (Xi β) rather than E(Yi | X, Z, ζi ) (where Z is a subset of X though this does not have to be the case) Advantage: no need to assume a particular covariance of Yi Vi = V(Yi | X) = φAi (β)1/2 R(α)Ai (β)1/2 where R is the “working” correlation matrix, Ai is a diagonal matrix of variances, and φ is a dispersion parameter Generalized estimating equation (GEE): U(β) =

 N  X ∂µi > i=1

∂β

Vi−1 (Yi − µi ) = 0

A review article: Zorn (AJPS, 2001) Kosuke Imai (Princeton)

Longitudinal Data

POL573 (Fall 2016)

29 / 48

Properties of GEE Close connection to the Method of Moments Asymptotic properties hold even when R(α) is misspecified (consistency and normality with robust standard error) √

N(βˆN − β0 )   D −→ N 0, E(Di> Vi−1 Di )−1 E(Di> Vi−1 V(Yi | X)Vi−1 Di )E(Di> Vi−1 Di )−1

where Di is ∂µi /∂β Advantages of GEE 1

2

3

Only need to get the mean right! (most efficient when you get the variance structure correct) Unlike the independence model with robust standard error, you get consistency as well as asymptotic normality Possible efficiency gain by accounting for correlation in the data

GEE does assume exogeneity: you need to get the mean correct Kosuke Imai (Princeton)

Longitudinal Data

POL573 (Fall 2016)

30 / 48

Working Correlation Matrix and Estimation of GEE Popular choices: 1 2 3

Independence: Ri (t, t 0 ) = 0 Exchangeability: Ri (t, t 0 ) = α 0 AR(1): Ri (t, t 0 ) = α|t−t |

4

Stationary m-dependence: Ri (t, t 0 ) =

5

Unstructured: Ri (t, t 0 ) = αt,t 0



if |t − t 0 | ≤ m otherwise

αt,t 0 0

Iterative estimation procedure 1 2 3

4

Use β (t) to update Di and Ai PN Estimate V(Yi | X)(t) = i=1 (Yi − µ(t) )> (Yi − µ(t) )/N Compute Pearson residuals ˆPit and estimate PN PT (t) φ(t) = i=1 t=1 (ˆ Pit )2 /NT and R(ˆ α(t) ), which give Vi Update β using one iteration of Fisher-scoring algorithm ( β (t+1)

=

β (t) −

n X (t) (t) (t) (Di )> (Vi )−1 Di i=1

Kosuke Imai (Princeton)

)−1 ( n X

) Di (Vi )−1 (Yi − µ(t) ) (t)

(t)

i=1

Longitudinal Data

POL573 (Fall 2016)

31 / 48

Fixed Effects in GLM Model: E(yit | X) = g −1 (β > xit + αi ) Incidental parameter problem (Neyman and Scott): # of parameters goes to infinity as N tends to infinity (for fixed T ) Asymptotic properties of MLE are no longer guaranteed to hold Especially problematic for small T and large N

In the linear case, everything is fine because the sampling distribution of βˆ does not depend on αi In the nonlinear case, this is not generally the case Canonical example: fixed effects logistic regression with T = 2 and xit = time dummy exp(αi +βxit ) Model: Pr(yit = 1 | X) = 1+exp(α i +βxit )  ∞ if (yi1 , yi2 ) = (1, 1) α ˆi = −∞ if (yi1 , yi2 ) = (0, 0) P ˆ β → 2β Kosuke Imai (Princeton)

Longitudinal Data

POL573 (Fall 2016)

32 / 48

Conditional Likelihood Inference Maximize conditional likelihood given a sufficient statistic for αi S(Y ) is said to be a sufficient statistic for θ if the conditional distribution of Y given S(Y ) does not depend on θ Properties (Andersen): asymptotically consistent and normal, less efficient than MLE A simple example: i.i.d.

yit = β > xit + αi + it where it ∼ N (0, σ 2 ) P Sufficient statistic for αi is Tt=1 yit Conditional likelihood function: ! N T Y X f yi1 , . . . , yiT xit , yit i=1

t=1

N T o2 1 X Xn ∝ exp − 2 (yit − y¯i ) − β > (xit − x¯i ) 2σ

"

i=1

Kosuke Imai (Princeton)

!#

t=1

Longitudinal Data

POL573 (Fall 2016)

33 / 48

Logistic Regression with Fixed Effects A sufficient statistic for αi is again A special case: T = 2 0 1

wi =

t=1 yit

if (yi1 , yi2 ) = (1, 0) if (yi1 , yi2 ) = (0, 1)

Pr(wi = 1 | yi1 + yi2 = 1) = Conditional likelihood: N Y

PT

exp(β > (xi2 −xi1 )) 1+exp(β > (xi2 −xi1 ))

logit−1 (β > (xi2 − xi1 ))wi {1 − logit−1 (β > (xi2 − xi1 ))}1−wi

i=1

The general case: N Y i=1

exp(β >

PT

t=1 xit yit ) PT > d ∈ Bi exp(β t=1 xit dt )

P

where Bi = {d = (d1 , . . . , dT ) | dt = 0 or 1, and Kosuke Imai (Princeton)

Longitudinal Data

PT

t=1

dt =

PT

t=1

yit }

POL573 (Fall 2016)

34 / 48

Estimation and Conditional Likelihood

Equivalent to the partial likelihood function of the stratified Cox model in discrete time Single discrete time period (rather than continuous) All observations with yit = 1 are considered as failures at time 1 All observations with yit = 0 are considered as censored at time 1 Units correspond to strata

In R, you use the following syntax or clogit(): coxph(Surv(time = rep(1, N*T), status = y) ~ x + strata(units), method = "exact") The exact calculation can be difficult when the number of time periods is large; use approximation

Kosuke Imai (Princeton)

Longitudinal Data

POL573 (Fall 2016)

35 / 48

Limitations of Conditional Likelihood Approach Loss of information: e.g., observations with Sufficient statistics are not easily found

PT

t=1 yit

= 0 or T

It works for logit, multinomial logit, Weibull etc. But it does not work for probit, etc.

Cannot estimate the quantities of interest Only β can be estimated Predicted probabilities, risk difference etc. cannot be estimated Risk ratio, odds ratio can be estimated

An alternative approach: correlated random effects Recall that in the linear case these two approaches give the same estimate of β In the nonlinear case, this does not hold but random effects can still be correlated with covariates All quantities of interest can be estimated A special case of GLMM Kosuke Imai (Princeton)

Longitudinal Data

POL573 (Fall 2016)

36 / 48

TABLE 3. Alternative logistic regression analyses disputes (1951-92)

Back to Dirty Pool Democratic Peace: Effect of Democracy on militarized disputes

Variable Contiguity

Hausman test for pooled analysis vs. fixed effects

Capabilityratio (log) Growtha

Conditional likelihood: Peaceful dyads are dropped

Alliance

What is the effect size?

Bilateral trade/GDPa Lagged dispute

Oneal and Russett estimate positive effects using fixed effects model for the data from 1885 (instead of 1952)

Democracya

Constant N Log likelihood x2

Heterogeneous effects?

Kosuke Imai (Princeton)

Degrees of freedom Prob > 2

Pooled

Fixed effects

3.042** (0.092) 0.102** (0.024) -0.017 (0.011) -0.234* (0.097) -0.057** (0.007) -0.194* (0.087)

1.902** (0.336) 0.387** (0.139) -0.059** (0.012) -1.066* (0.426) -0.003 (0.015) -0.072 (0.186)

-5.809** (0.090) 93,755 -3,688.06 1,186.43 6 γt , σ 2 )

where time-varying coefficients γt has a random-walk prior, γt

indep.



N (γt−1 , Σ)

for t = 2, . . . , T and γ1 ∼ N (µ0 , Ω0 ) indep.

Can add the second level covariates: γt ∼ N (Vt γt−1 , Σ) A special case of state-space models and Markov models

Kosuke Imai (Princeton)

Longitudinal Data

POL573 (Fall 2016)

38 / 48

Dependence through the Simple Markov Structure

γ1

γ2

γ3

Y1

Y2

Y3

Kosuke Imai (Princeton)

Longitudinal Data

POL573 (Fall 2016)

39 / 48

Example: Ideal Point Models Ideal point model (Clinton, Jackman & Rivers, 2004; aka Item Response Theory): yij∗ = αj + βj xi + ij where yij∗ ≥ 0 if yij = 1 (“yea”) and yij∗ < 0 if yij = 0 (“nay”) αj : item difficulty βj : item discrimination xi : ideological position, i.e., ideal point i.i.d. i : spatial voting model error, typically assumed to be i ∼ N (0, 1)

Dynamic ideal point model (Martin and Quinn, 2002): yijt∗

= αjt + βjt xit + ijt

xit

= xi,t−1 + ηit

i.i.d.

where ηit ∼ N (0, ωi2 ) What’s the key identification assumption for the dynamic model? Kosuke Imai (Princeton)

Longitudinal Data

POL573 (Fall 2016)

40 / 48

Estimated Ideal Points for Supreme Court Justices

Fig. 1 Posterior density summary of the ideal points of selected justices for the terms in which they served for the dynamic ideal point model.

Kosuke Imai (Princeton)

Longitudinal Data

POL573 (Fall 2016)

41 / 48

EM Algorithm for DLM Complete-data likelihood function: p(Y, γ | X, Z; β, σ 2 , Σ, µ0 , Ω0 ) = p(γ1 ; µ0 , Ω0 )

T Y

p(γt | γt−1 ; Σ)

t=2

T Y

p(Yt | Xt , Zt , γt ; β, σ 2 )

t=1

indep.

where Yt ∼ N (Xt β + Zt γt , σ 2 IN ) EM updates: 1 2 3 4

µ0 = E(γ1 ) and Ω0 = E(γ1 γ1> ) − E(γ1 )E(γ1 )> PT  1 > > > > Σ = T −1 t=2 E(γt γt ) − E(γt γt−1 ) − E(γt−1 γt ) + E(γt−1 γt−1 ) PT PT −1 > β = ( t=1 X> t Xt ) t=1 Xt {Yt − Zt E(γt )} P P T N σ 2 = T1 t=1 i=1 {y˜it2 − 2y˜it Zit> E(γt ) + Zit> E(γt γt> )Zit } where y˜it = yit − Xit> β

Kosuke Imai (Princeton)

Longitudinal Data

POL573 (Fall 2016)

42 / 48

Computation for E-step > ) conditional on all data Must compute E(γt ), E(γt γt> ) and E(γt γt−1

Define α(γt ) = p(γt | Y1 , . . . , Yt ) δ(γt ) = p(Yt+1 , . . . , YT | γt ) The posterior is given by, p(γt | Y1 , . . . , YT ) = ct α(γt ) δ(γt ) where ct = 1/p(Yt+1 , . . . , YT | Y1 , . . . , Yt ) is a normalizing constant The joint distribution of (Y1 , . . . , YT ) and (γ1 , . . . , γT ) is Gaussian α(γt ), δ(γt ), and α(γt )δ(γt ) are all Gaussian Forward-backward algorithm thorough Kalman filtering Kosuke Imai (Princeton)

Longitudinal Data

POL573 (Fall 2016)

43 / 48

Forward Recursion Z p(γt , γt−1 | Y1 , . . . , Yt ) dγt−1

α(γt ) = Z =

p(γt , γt−1 , Yt | Y1 , . . . , Yt−1 ) dγt−1 p(Yt | Y1 , . . . , Yt−1 )

Z ∝

p(Yt , γt | γt−1 , Y1 , . . . , Yt−1 ) p(γt−1 | Y1 , . . . , Yt−1 ) dγt−1 Z ∝ p(Yt | γt ) p(γt | γt−1 ) α(γt−1 ) dγt−1 Thus, α(γt ) = N (µt , Ωt ) is obtained as, φ(γt ; µt , Ωt ) 2

∝ φ(Yt ; Xt β + Zt γt , σ I)

Z φ(γt ; γt−1 , Σ)φ(γt−1 ; µt−1 , Ωt−1 ) dγt−1

∝ φ(Yt ; Xt β + Zt γt , σ 2 I) φ(γt ; µt−1 , Qt−1 ) where Qt−1 = Ωt−1 + Σ. Kosuke Imai (Princeton)

Longitudinal Data

POL573 (Fall 2016)

44 / 48

Finally, Bayes rule gives, µt

−2 > e = Ωt (Q−1 t−1 µt−1 + σ Zt Yt )

Ωt

−2 > −1 = (Q−1 t−1 + σ Zt Zt )

We can further simplify using the Woodbury formula, (A + BD−1 C)−1 = A−1 − A−1 B(D + CA−1 B)−1 CA−1 which implies Ωt

2 > −1 = Qt−1 − Qt−1 Z> t (σ I + Zt Qt−1 Zt ) Zt Qt−1

= (I − Rt Zt )Qt−1 2 > −1 where Rt = Qt−1 Z> t (σ I + Zt Qt−1 Zt ) . Finally, note another rule:

(A−1 + B> C−1 B)−1 B> C−1 = AB> (BAB> + C)−1 Then, we have −2 > > 2 −1 Ωt Z> = Rt t σ I = Qt−1 Zt (Zt Qt−1 Zt + σ I) e t − Zt µt−1 ) µt = µt−1 + Rt (Y Kosuke Imai (Princeton)

Longitudinal Data

POL573 (Fall 2016)

45 / 48

Backward Recursion

Z p(γt | Y1 , . . . , YT ) =

p(γt , γt+1 | Y1 , . . . , YT ) dγt+1 Z

=

p(γt | γt+1 , Y1 , . . . , Yt )p(γt+1 | Y1 , . . . , YT ) dγt+1

From the forward recursion, we have,       γt+1 µt Γt + ωx2 Γt | Y1 , . . . , Yt ∼ N , γt µt Γt Γt Then, the conditional distribution is given by, γt | γt+1 , Y1 , . . . , YT ∼ N (µt + Γt (Γt + ωx2 )−1 (γt+1 − µt ), Γt − Γt (Γt + ωx2 )−1 Γt ) Kosuke Imai (Princeton)

Longitudinal Data

POL573 (Fall 2016)

46 / 48

Let’s assume p(γt+1 | Y1 , . . . , YT ) = N (mt+1 , St+1 ). Then, we have, mt

= E(γt | Y1 , . . . , YT ) = E{E(γt | γt+1 , Y1 , . . . , Yt ) | Y1 , . . . , YT } = µt + Γt (Γt + ωx2 )−1 (mt+1 − µt )

Finally, St

= V(γt | Y1 , . . . , YT ) = V{E(γt | γt+1 , Y1 , . . . , Yt ) | Y1 , . . . , YT } +E{V(γt | γt+1 , Y1 , . . . , Yt ) | Y1 , . . . , YT } = V{µt + Γt (Γt + ωx2 )−1 (γt+1 − µt ) | Y1 , . . . , YT } +Γt − Γt (Γt + ωx2 )−1 Γt = Γt + Γt (Γt + ωx2 )−1 {St+1 − (Γt + ωx2 )}(Γt + ωx2 )−1 Γt

Kosuke Imai (Princeton)

Longitudinal Data

POL573 (Fall 2016)

47 / 48

Concluding Remarks

Longitudinal data: opportunities to model within-unit and across-time variation as well as across-unit variation Old debates: fixed or random intercepts GLMM: a generalization of random effects models Model slopes as well as intercepts: importance of substantive theory Structural modeling and causal inference For causal inference, getting the conditional mean right is essential Static models ignore the systematic dependence on past outcomes: all dependence is in the error term

Kosuke Imai (Princeton)

Longitudinal Data

POL573 (Fall 2016)

48 / 48