Additive Regression Models - Michigan State University

28 downloads 71 Views 340KB Size Report
Lecture 16: Generalized Additive Models http://polisci.msu.edu/jacoby/icpsr/ regress3. Regression III: Advanced Methods. Bill Jacoby. Michigan State University ...
Lecture 16: Generalized Additive Models

Regression III: Advanced Methods Bill Jacoby Michigan State University

http://polisci.msu.edu/jacoby/icpsr/regress3

Goals of the Lecture • Introduce Additive Models – Explain how they extend from simple nonparametric regression (i.e., local polynomial regression) – Discuss estimation using backfitting – Explain how to interpret their results • Conclude with some examples of Additive Models applied to real social science data

2

Limitations of the Multiple Nonparametric Models • Recall that the general nonparametric model (both the lowess smooth and the smoothing spline) takes the following form:

• As we see here, the multiple nonparametric model allows all possible interactions between the independent variables in their effects on Y—we specify a jointly conditional functional form • This model is ideal under the following circumstances: 1. There are no more than two predictors 2. The pattern of nonlinearity is complicated and thus cannot be easily modelled with a simple transformation or polynomial regression 3. The sample size is sufficiently large 3

Limitations of the Multiple Nonparametric Models (2) • The general nonparametric model becomes impossible to interpret and unstable as we add more explanatory variables, however 1. For example, in the lowess case, as the number of variables increases, the window span must become wider in order to ensure that each local regression has enough cases This process can create significant bias (the curve becomes too smooth) 2. It is impossible to interpret general nonparametric regression when there are more than two variables— there are no coefficients, and we cannot graph effects more than three dimensions • These limitations lead us to the Additive Models

4

Additive Regression Models • Additive regression models essentially apply local regression to low dimensional projections of the data • The nonparametric additive regression model is

The fi are arbitrary functions estimated from the data; the errors ε are assumed to have constant variance and a mean of 0 • Additive models create an estimate of the regression surface by a combination of a collection of onedimensional functions • The estimated functions fi are the analogues of the coefficients in linear regression

5

Additive Regression Models (2) • The assumption that the contribution of each covariate is additive is analogous to the assumption in linear regression that each component is estimated separately • Recall that the linear regression model is

where the Bj represent linear effects • For the additive model we model Y as an additive combination of arbitrary functions of the Xs

• The fj represent arbitrary functions that can be estimated by lowess or smoothing splines 6

Additive Regression Models (3) • Now comes the question: How do we find these arbitrary functions? • If the X’s were completely independent—which will not be the case—we could simply estimate each functional form using a nonparametric regression of Y on each of the X’s separately – Similarly in linear regression when the X’s are completely uncorrelated the partial regression slopes are identical to the marginal regression slopes • Since the X’s are related, however, we need to proceed in another way, in effect removing the effects of other predictors—which are unknown before we begin • We use a procedure called backfitting to find each curve, controlling for the effects of the others 7

Estimation and Backfitting • Suppose that we had a two predictor additive model:

• If we unrealistically knew the partial regression function f2 but not f1 we could rearrange the equation in order to solve for f1

• In other words, smoothing Yi-f2(xi2) against xi1 produces an estimate of α+f1(xi1). • Simply put, knowing one function allows us to find the other—in the real world, however we don’t know either so we must proceed initially with estimates 8

Estimation and Backfitting (2) 1. We start by expressing the variables in mean deviation form so that the partial regressions sum to zero, thus eliminating the individual intercepts 2. We then take preliminary estimates of each function from a least-squares regression of Y on the X’s

3. These estimates are then used as step (0) in an iterative estimation process

4. We then find the partial residuals for X1, which removes Y from its linear relationship to X2 but retains the relationship between Y and X1 9

Estimation and Backfitting (3) The partial residuals for X1 are then

5. The same procedure in step 4 is done for X2 6. Next we smooth these partial residuals against their respective X’s, providing a new estimate of f

where S is the (n × n) smoother transformation matrix for Xj that depends only on the configuration of Xij for the jth predictor 10

Estimation and Backfitting (4) • This process of finding new estimates of the functions by smoothing the partial residuals is reiterated until the partial functions converge – That is, when the estimates of the smooth functions stabilize from one iteration to the next we stop • When this process is done, we obtain estimates of sj(Xij) for every value of Xj • More importantly, we will have reduced a multiple regression to a series of two-dimensional partial regression problems, making interpretation easy: – Since each partial regression is only two-dimensional, the functional forms can be plotted on two-dimensional plots showing the partial effects of each Xj on Y – In other words, perspective plots are no longer necessary unless we include an interaction between two smoother terms 11

Interpreting the Effects • A plot of of Xj versus sj(Xj) shows the relationship between Xj and Y holding constant the other variables in the model • Since Y is expressed in mean deviation form, the smooth term sj(Xj) is also centered and thus each plot represents how Y changes relative to its mean with changes in X • Interpreting the scale of the graphs then becomes easy: – The value of 0 on the Y-axis is the mean of Y – As the line moves away from 0 in a negative direction we subtract the distance from the mean when determining the fitted value. For example, if the mean is 45, and for a particular X-value (say x=15) the curve is at sj(Xj)=4, this means the fitted value of Y controlling for all other explanatory variables is 45+4=49. – If there are several nonparametric relationships, we can add together the effects on the two graphs for any particular observation to find its fitted value of Y 12

Additive Regression Models in R: Example: Canadian prestige data • Here we use the Canadian Prestige data to fit an additive model to prestige regressed on income and occupation • In R we use the gam function (for generalized additive models) that is found in mgcv package – The gam function in mgcv fits only smoothing splines (local polynomial regression can be done in S-PLUS) – The formula takes the same form as the glm function except now we have the option of having parametric terms and smoothed estimates – Smooths will be fit to any variable specified with the s(variable) argument • The simple R-script is as follows:

13

Additive Regression Models in R: Example: Canadian prestige data (2) • The summary function returns tests for each smooth, the degrees of freedom for each smooth, and an adjusted Rsquare for the model. The deviance can be obtained from the deviance(model) command

14

Additive Regression Models in R: Example: Canadian prestige data (3) • Again, as with other nonparametric models, we have no slope parameters to investigate (we do have an intercept, however) • A plot of the regression surface is necessary

15

Additive Regression Models in R: Example: Canadian prestige data (4)

80 60

ige Prest

Additive Model: • We can see the nonlinear relationship for both education and Income with Prestige but there is no interaction between them—i.e., the slope for income is the same at every value of education • We can compare this model to the general nonparametric regression model

40 20 5000 10000 In 15000 co m e 20000

14

8

12 n tio a 10 uc Ed

25000

16

Additive Regression Models in R: Example: Canadian prestige data (5) General Nonparametric Model: 80

60 40 20

14 12

on

5000 10000

Ed uc ati

Prestige

• This model is quite similar to the additive model, but there are some nuances— particularly in the midrange of income—that are not picked up by the additive model because the X’s do not interact

10

15000 Inco me 20000

8 25000

17

Additive Regression Models in R: Example: Canadian prestige data (6) • Perspective plots can also be made automatically using the persp.gam function. These graphs include a 95% confidence region

80 60 40 20 5000 10000 in c15000 om e 20000

14

8

12 on i t 10 ca u ed

25000

red/green are +/-2 se

18

Additive Regression Models in R: Example: Canadian prestige data (7) • Since the slices of the additive regression in the direction of one predictor (holding the other constant) are parallel, we can graph each partialregression function separately • This is the benefit of the additive model—we can graph as many plots as there are variables, and allowing us to easily visualize the relationships • In other words, a multidimensional regression has been reduced to a series of two-dimensional partial-regression plots • To get these in R:

19

0 10 -20

s(income,3.12)

Additive Regression Models in R: Example: Canadian prestige data (8)

0

5000

10000

15000

20000

25000

14

16

10 0 -20

s(education,3.18)

income

6

8

10

12 education

20

20 10 -20

-10

0

s(education,3.18)

10 0 -10 -20

s(income,3.12)

20

Additive Regression Models in R: Example: Canadian prestige data (9)

0 5000

15000 income

25000

6

8

10

12

education

14

16 21

R-script for previous slide

22

Residual Sum of Squares • As was the case for smoothing splines and lowess smooths, statistical inference and hypothesis testing is based on the residual sum of squares (or deviance in the case of generalized additive models) and the degrees of freedom • The RSS for an additive model is easily defined in the usual manner:

• The approximate degrees of freedom, however, need to be adjusted from the regular nonparametric case, however, because we are no longer specifying a jointly-conditional functional form

23

Degrees of Freedom • Recall that for nonparametric regression, the approximate degrees of freedom are equal to the trace of the smoother matrix (the matrix that projects Y onto Y-hat) • We extend this to the additive model: 1 is subtracted from each df reflecting the constraint that each partial regression function sums to zero (the individual intercept have been removed) • Parametric terms entered in the model each occupy a single degree of freedom as in the linear regression case • The individual degrees of freedom are then combined for a single measure:

1 is added to the final degrees of freedom to account for the overall constant in the model 24

Testing for Linearity • I can compare the linear model of prestige regressed on income and education with the additive model by carrying out an analysis of deviance • I begin by fitting the linear model using the gam function

• Next I want the residual degrees of freedom from the additive model

25

Testing for Linearity (2) • Now I simply calculate the difference in the deviance between the two model relative to the difference in degrees of freedom (difference in df=7.3-2=5) • This gives a Chi-square test for linearity

• The difference between the models is highly statistically significant—the additive model describe the relationship between prestige and education and income much better

26

Testing for Linearity • An anova function written by John Fox (see the R-script for this class) makes the analysis of deviance simpler to implement:

• As we see here, the results are identical to those found on the previous slide 27