Poisson Regression Overview - atlas

148 downloads 1040 Views 66KB Size Report
The Deviance and Pearson Chi-Square have approximately Chi-square distribution with the number of degrees of freedom printed in column titled DF.
Poisson Distribution Function Distribution Functional Form

Mean Standard Deviation

Poisson

Poisson Regression Overview Poisson regression is often used to analyze count data. It can be used to model the number of occurrences of an event of interest or the rate of occurrence of an event of interest, as a function of some independent variables. For example, rate of insurance claims, number of doctor visits, incidence of diseases, crime incidence, number of days a child is absent from school, colony counts for bacteria, can be modeled using Poisson regression. In Poisson regression it is assumed that the dependent variable Y, number of occurrences of an event, has a Poisson distribution given the independent variables X1, X2, ...., Xm, k=0, 1, 2, ......, P(Y=k| x1, x2, ..., xm) = e- µ µk / k!, where the log of the mean µ is assumed to be a linear function of the independent variables. That is, log(µ) = intercept + b1*X1 +b2*X2 + ....+ b3*Xm, which implies that µ is the exponential function of independent variables, µ = exp(intercept + b1*X1 +b2*X2 + ....+ b3*Xm). In many situations the rate or incidence of an event needs to be modeled instead of the number of occurrences. For example, suppose that we know the number of occurrences of certain disease by county and we want to find out if frequency of occurrence depends on certain demographic variables and health policy programs also recorded by county. Since more at risk subjects result in more occurrences of the disease, we need to adjust for the number of subjects at risk in each county. For such data, we can write a Poisson regression model in the following form: log(µ) = log(N) + intercept + b1*X1 +b2*X2 + ....+ b3*Xm, where N is the total number of subjects at risk by county. The logarithm of variable N is used as an offset, that is, a regression variable with a constant coefficient of 1 for each observation. The log of the incidence, log (µ / N), is modeled now as a linear function of independent variables. The maximum likelihood method is used to estimate the parameters of Poisson regression models. In SAS, the GENMOD procedure can fit Poisson regression models. Fitting Poisson Regression in SAS We will use PROC GENMOD to fit Poisson regression models in two examples. In the first example, the number of occurrences of an event and in the second example, the incidence, will be modeled.

Modeling number of occurrences of an event The data used in this example was generated from a Poisson distribution using SAS function RANPOI. There are four variables, n_events, c1, x2, x3, and 240 subjects in the data set. For each subject the number of occurrences of an event is saved in the variable called n_events. Variables c1 (nominal), x2 and x3 (ordinal with four categories) will be used as independent variables in the model. We can think of this data analysis as, for example, modeling the number of books read by a student in 5th grade (n_events) as a function of student’s grade in math (x2), language arts (x3) and a reading incentive program (c1 with three categories, three different reading incentive programs). A table below shows 10 observations from the data set.

c1 x2 x3 n_events 1

2

1

9

1

3

3

13

1

3

4

22

1

4

4

12

2

2

3

12

2

2

4

13

2

3

2

17

2

3

3

11

2

4

4

12

3

3

4

22

The following sas program runs the Poisson regression analysis (in the program the data set is called one). proc genmod data=one; class c1; model n_events = c1 x2 x3 / dist = poisson link = log type3; contrast 'c1, 1 vs 2' c1 1 -1 0; contrast 'c1, 1 vs 3' c1 1 0 -1; contrast 'c1, 2 vs 3' c1 0 1 -1; estimate 'c1=1, x2=4, x3=4' int 1 c1 1 0 0 x2 4 x3 4 /exp; estimate 'c1=2, x2=4, x3=4' int 1 c1 0 1 0

x2 4 x3 4 /exp; estimate 'c1=3, x2=4, x3=4' int 1 c1 0 0 1 x2 4 x3 4 /exp; run; Explanation of the statements and options used in the program class c1; Specifies c1 as a classification variable. The remaining variables will be treated as continuous variables in the analysis. dist = poisson link = log Specifies Poisson distribution to be used in the model, that is, declares n_events to have a Poisson distribution, and requests modeling log of the mean of n_events as a linear function of the independent variables. type3 requests type 3 tests (overall tests for the effects in the model). By default, likelihood ratio tests are printed. If Wald tests are desired, change type3 to type3 wald . contrast 'c1, 1 vs 2' c1 1 -1 0; compares level 1 of class variable c1 with level 2. Contrast specification is the same as in proc glm; estimate 'c1=1, x2=4, x3=4' int 1 c1 1 0 0 x2 4 x3 4 /exp; Estimates the mean number of events for subjects in category 1 of variable c1, with values 4 for x2 and 4 for x3. If exp option is not used then only the log of the mean number of events is estimated. Interpretation of results The following table contains information on assessment of fit.

Criteria For Assessing Goodness Of Fit Criterion

DF

Value

Value/DF

Deviance

235

241.3872

1.0272

Scaled Deviance

235

241.3872

1.0272

Pearson Chi-Square 235 234.5383

0.9980

Scaled Pearson X2

0.9980

Log Likelihood

235

234.5383 4149.0588

The Deviance and Pearson Chi-Square have approximately Chi-square distribution with the number of degrees of freedom printed in column titled DF. Both indicate adequate fit. Scaled Deviance and Scaled Pearson X2 are Deviance and Pearson Chi-Square, respectively, divided by the dispersion parameter. Here they are the same because for the Poisson distribution the dispersion parameter is 1. Deviance and Pearson Chi-Square divided by the degrees of freedom are often used to detect overdispersion or

underdispersion. For Poisson distribution the mean and the variance are equal, which implies that the deviance and the Pearson statistic divided by the degrees of freedom should be approximately one. Values greater than 1 indicate overdispersion, that is, the true variance is bigger than the mean, values smaller than 1 indicate underdispersion, the true variance is smaller than the mean. Evidence of underdispersion or overdispersion indicates inadequate fit of the Poisson model. Corrective measures include using the deviance or Pearson Chi-Square divided by degrees of freedom as an estimate of the dispersion parameter instead of setting it to 1 (options DSCALE and PSCALE in the MODEL statement) or, in the case of overdispersion, running the negative binomial regression instead of the Poisson regression (options DIST=NB instead of DIST=POISSON in the MODEL statement). For our example, the ratios are close to 1, so we can conclude that the fit of the Poisson model is adequate. The following table presents the significance tests of the model parameters.

Analysis Of Parameter Estimates Wald 95% Standard Confidence DF Estimate Error Limits

Parameter Intercept

ChiSquare Pr > ChiSq

1

1.4096

0.0749

1.2628

1.5563

354.51