Modeling Count Data with Excess Zeroes: An ...

0 downloads 0 Views 104KB Size Report
156 Kingsley Road. Hounslow, Middlesex. TW3 4AD. KEY WORDS: TRAFFIC ACCIDENTS, ZERO-INFLATED MODELS, POISSON REGRESSION,. NEGATIVE ...
Modeling Count Data with Excess Zeroes: An Empirical Application to Traffic Accidents

Hoong Chor Chin Associate Professor Dept of Civil Engineering & Centre for Transportation Research National University of Singapore 10 Kent Ridge Crescent, SINGAPORE 119260 Tel: (65)-874-2550 Fax: (65)-779-1635 E-mail: [email protected] HOME ADDRESS 21 Faber Drive SINGAPORE 129351

Mohammed Abdul Quddus Assistant Research Officer Center for Transport Studies Department of Civil and Environmental Engineering Imperial College of Science, Technology and Medicine LONDON, SW7 2 BU Tel: (44)-207-594-6153 Fax: (44)-207-594-6102 E-mail: [email protected] HOME ADDRESS 156 Kingsley Road Hounslow, Middlesex TW3 4AD

KEY WORDS: TRAFFIC ACCIDENTS, ZERO-INFLATED MODELS, POISSON REGRESSION, NEGATIVE BINOMIAL REGRESSION

AUTHOR BIOGRAPHIES

Chin Hoong Chor is an Associate Professor in the Department of Civil Engineering at National University of Singapore. Professor Chin Hoong Chor holds a BEng and MEng in Civil Engineering from the National University of Singapore and a PhD in Transportation Engineering from the University of Southampton. He is the Vice Chairman of the Chartered Institute of Logistics and Transport and has published papers on traffic safety assessment, transportation system modeling and congestion management.

Mohammed Abdul Quddus holds a Master of Engineering (Civil) from the National University of Singapore. His main research interests are traffic safety, modelling of count data, GIS in traffic accident analysis and ITS. He is currently working as a research officer in the Department of Civil and Environmental Engineering at Imperial College, London.

2

ABSTRACT

There are many studies in social sciences, such as traffic accident analysis, where the event counts may be characterized by a large number of zero observations. In this case, the traditional application of Poisson or negative binomial model to describe the events may not be suitable. A better alternative is to model the process using a zero-altered probability function. In this paper, a proposed model which takes into account both the zero count state (where no event is observed over a given time period), and the non-zero count state (where event counts follow some known distributions) is used to describe the traffic accident phenomenon. The probability of the zero count state ( p ) and the mean number of event counts ( ? ) in the non-zero count state may depend on the covariates. Sometimes p and ? are unrelated while at other times p may assume a simple function of ? . A procedure of establishing the best model and evaluating its suitability is also proposed. In proposing the model, different types of traffic accidents at signalized intersections in Singapore were investigated. The results demonstrate that the zero-altered probability process is an appropriate technique for modeling specific types of accidents where the data contain many zero counts.

3

INTRODUCTION

A number of studies have investigated the suitability of the Poisson regression model to predict accident frequencies at intersections or roadways (Maycock and Hall 1984; Jovanis and Chang 1986; Joshua and Garber 1990; Jones, Janseen and Mannering 1991; Miaou and Lum 1993). These studies have highlighted the fact that because accident occurrences are necessarily discrete, often sporadic and more likely random events, it is better to use Poisson regression models than multiple linear regression models.

One important constraint in the Poisson regression models is that the mean of the distribution must be equal to the variance. If this assumption is not valid, then the standard errors, usually estimated by the maximum likelihood (ML) method, will be biased and the test statistics derived from the models will be incorrect. Indeed, in a number of studies, (e.g., Miaou 1994; Shankar, Mannering and Barfield 1995; Vogt and Bared 1998), the accident data were found to be significantly overdispersed, i.e., the variance was much greater than the mean, suggesting that the likelihood of accident occurrence might be incorrectly estimated.

In overcoming the problem of overdispersion, several researchers (Lawless 1987; Miaou 1994; Shankar et al. 1995; Poch and Mannering 1996; Barron 1998) employed the negative binomial (NB) distribution instead of the Poisson. By relaxing the condition of the mean being equal to the variance, the NB regression model was found to be more suitable in describing discrete and nonnegative events. To establish the NB regression

4

model, a stochastic component was introduced into the relationship between traffic accident and the covariates.

However, Shankar, Milton and Mannering, (1997) found that the distribution of annual accident frequencies may be qualitatively different from the simple Poisson and parent NB distributions if there exists excess zero counts in the data. To better reflect the situation, a dual-state system may be assumed. In this, one of the states is the zeroaccident state, where the intersections can be regarded as virtually safe 1 while the other state is the non-zero accident state, where the accident frequencies are assumed to follow some known distributions such as the Poisson and NB. The traditional Poisson and NB models are applied only for intersections in non-zero accident state, i.e., unsafe intersections. Because of the presence of zero-accident state, the estimated model will be inherently biased as there will be an over representation of zero-accident observations in the data. Due to these excess zero counts in the accident data, this estimation may be mistakenly regarded as evidence of overdispersion in the data set whereas in fact, the overdispersion may be merely due to the way the model is specified.

To handle count data with excess zeroes, Lambert (1992) in his study on defects in manufacturing proposed a technique called zero-inflated Poisson (ZIP) regression. He pointed out that the probability of a perfect state (i.e., zero-defect state) and the mean of the imperfect state (i.e., non-zero defect state) depends on the covariates. A number of other studies (Mullay 1986; King 1989; Land, McCall and Nagin 1996, Long 1997; Zorn

5

1998) have concluded that the ZIP regression is a practical way to model count data with excess zeroes, especially because it is straightforward to fit and not difficult to interpret.

This paper proposes an evaluation framework to determine the suitability of applying different count models, (i.e., the classical Poisson and the NB as well as their zeromodified count models), in accident studies where the count data may exhibit evidence of excess zeroes. In selecting the appropriate model, the Vuong statistic and the overdispersion parameter criteria were used. Accident data from four-legged signalized intersections in the Southwestern part of Singapore were used in this study. Furthermore, to illustrate the choice between single-state and dual-state models, data for the different types of accidents, which gave different proportions of zero counts were examined in turn.

METHODOLOGY

Before describing the evaluation framework, it is necessary first to describe the evolution and emergence of the methodology in modeling count data with excess zeroes, giving special emphasis on the zero-altered probability process in traffic accident analyses.

THE POISSON REGRESSION MODEL

Accidents, for example, at an intersection, may be considered to occur both randomly and independently in time. Consequently, the Poisson distribution may be a reasonable

6

description of the accident process. This distribution has only one adjustable parameter, i.e., the mean event counts, which must be positive. This requirement may become unsatisfactory in the case of an additive model2 . A more common formulation to overcome this limitation is the log-linear relationship between the expected number of events and the covariates, i.e.,

? i ? E (Yi | X i ) ? exp( ßX i ),

Yi =0,1, 2,…

(1)

in which ? i is the expected number of events for observation i, Yi is the number of events (in our case, accidents) occurring in an observation unit i (in our case, a major or minor road at intersection) over a given time period (in our case, one year), X i (1? k) is a vector of covariates that describes the characteristics of an observation unit i in a given time period and ß (k? 1) is a vector of coefficients representing the effects of the covariates. This specification will ensure that ? i is positive for all i. Therefore, the probability density function of observing Yi accidents can be expressed as

exp( ? ? i ) ? i Yi Pr(Yi | X i ) ? Yi !

(2)

in which ? i is a deterministic function of Xi, such that E (Yi ) ? ? i and Var (Yi ) ? ? i . The joint distribution of all random variables Yi, is the likelihood function

7

exp( ? ? i ) ? i Yi i ?1 Yi ! n

L( X i | Yi ) ? ?

(3)

where n is the total number of observations. The corresponding log-likelihood function is

l ( ß ) ? ? ?Yi ln( ? i ) ? ? t ? ln( Yi ! )? n

i? 1

(4)

where the coefficient vector ß can be estimated using any maximum likelihood (ML) algorithm (Greene 1995).

On the other hand, the Poisson regression model may not be a reasonable description of the accident process because it has some potential inconsistencies. A critical assumption of a Poisson process is that events are independent. This means that when an event occurs, it does not affect the probabilities of other events occurring in the future. Violation of this assumption implies that the Poisson regression will not be fully efficient, and the estimated standard errors will be biased and inconsistent. Another problem associated with Poisson regression is the implicit assumption that there is no unobserved heterogeneity in the data. As there is no error term in the relationship between ? i and Xi in equation (1), the rate is completely determined by the observed covariates i.e., no allowance is provided for random errors or unobserved variables.

NEGATIVE BINOMIAL REGRESSION MODEL

8

If there is time variation in the mean accident rate or if heterogeneity is present in the observations, overdispersion may occur. Without identifying its source, the presence of overdispersion in data can be adjusted by introducing a stochastic component in the model so that equation (1) can be rewritten as

? i ? exp( ßX i )? i

(5)

where ? i is a non-negative random effect that represents the unobserved variation across observations and is assumed to be uncorrelated with Xi. Denoting the probability density function of ? i by g (? i ) , the marginal distribution of ? i yields

Pr(Yi | ? i ) ? ?

exp( ? ? i ) ? i Yi !

Yi

g (? i ) d? i

(6)

The solution to this integral depends on the form of g (? i ) . By assuming that g (? i ) is a gamma distribution such that E (? i ) ? 1 and Var (? i ) ? ? , it has been shown by Cameron and Trivedi (1986) that the solution to equation (6) will take the form of

? (Yi ? ? ?1 ) ? ? ?1 Pr(Yi | ? i ,? ) ? ? ? (Yi ? 1)? (? ?1 ) ??? ?1 ? ? i

? ? ?? ?

?1

? ?i ? ?? ? 1 ? ? i ?

Y

?i ? ? ?

(7)

9

in which ? is the dispersion parameter to be estimated. For this specification, the mean and variance will respectively be

E (Yi | X i ) ? ? i

(8)

and Var (Yi | X i ) ? ? i ? ? ? i

2

(9)

The choice between the Poisson and negative binomial (NB) models can largely be determined by the statistical significance of the estimated coefficient ? . If ? is not significantly different from zero (as measured by the t-statistics), the NB model simply reduces to a Poisson regression model withVar (Yi | X i ) ? E (Yi | X i ) . If ? is significantly different from zero, then NB is the correct approach. Furthermore, since ? i ? 0 for all values of i, the variance of the NB distribution will always exceed its mean, the NB model can take into account any overdispersion in the data. ML algorithms may be used to estimate the coefficients ß and parameter ? . The joint distribution of Yi for the NB will be

n ? (Yi ? ? ?1 ) ?? ? ? 1 L( ß , ? ) ? ? i ?1? (Yi ? 1)? (? ?1 ) ?? ? ? 1 ? ? i

? ? ? ? ?

?1

Yi

? ?i ? ? ? ?? ?1 ? ? ? i? ?

(10)

from which the log-likelihood function is

10

l ( ß , ? ) ? ? ?ln ? (Yi ? ? n

i ?1

?1

) ? ln ? (?

?1

) ? ln ? (Yi ? 1) ? Yi ln( ? ? i ) ? (Yi ? ?

?1

) ln( 1 ? ? ? i ) ? (11)

ZERO-INFLATED COUNT MODELS

When accident count data have an over representation of zero-accident observations, e.g., in the case of a specific accident type where the number of accidents may be low, then the distribution of accident frequencies including zero counts may not follow the traditional Poisson and NB distributions. In this case, the data can be fitted into the zeroinflated models by assuming a dual-state process involving a zero-accident state with probability pi and a non-zero accident state with probability 1- pi where pi is an unknown parameter to be estimated. The first state is for those intersections that always have zero counts while the second is for other intersections whose accident frequencies follow some distributions, such as Poisson or NB. In this dual-state system, it is difficult to judge whether an intersection observed with zero count for a particular year is in the first or second state. Therefore, the overall probability of zero count is a combination of the probabilities of zeroes from each state, weighted by the probability of being in that state, i.e.,

?

?

Pr Yi ? 0 | X i ? pi ? (1 ? pi ) Ri ( 0)

(12)

where Ri (0) is a Poisson or NB probability with zero accident (i.e., Yi ? 0 ) that occurs by chance in the second state. On the other hand, the probability of positive counts is given by

11

Pr ?Yi ? 0 | X i ?? (1 ? pi ) Ri (Yi )

(13)

where Ri (Yi ) is the Poisson or NB probability with positive counts (Yi>0). Hence, by combining equations (12) and (13), the zero-inflated Poisson (ZIP) regression model can be expressed as

? pi ? (1 ? p i ) exp( ? ? i ), ? Y Pr(Yi | X i ) ? ? exp( ? ? i ) ? i i (1 ? pi ) , ?? Yi !

Yi ? 0

(14)

Yi ? 0

By introducing an indicator variable li in which li=1 when Yi=0 and li=0, otherwise, the maximization of the log-likelihood function for the ZIP model can be simplified by considering the probability density function ? exp(? ? i )? i Yi Pr(Yi | X i ) ? li pi ? (1 ? p i )? ? Yi ! ?

? ? ? ?

(15)

Following the same procedure, the zero-inflated negative binomial (ZINB) regression model will give the probability density function

? (Yi ? ? ? 1 ) ? ? ? 1 Pr(Yi | X i ) ? li pi ? (1 ? pi ) ? ? (Yi ? 1)? (? ? 1 ) ??? ?1 ? ? i

? ? ?? ?

?1

Y

? ?i ?i ? ? ?? ?1 ? ? i ? ? ?

(16)

Lambert (1992) proposed that pi be formulated as a logistic distribution such that

12

? p logit ( pi ) ? ln ?? i ? 1 ? pi

? ?? ? ?Ai ?

(17)

where Ai (1 ? k) is the covariates vector that determines the likelihood of the zeroaccident state and ? (k ? 1) is an estimated parameter vector. The mean ? i in the nonzero accident state satisfies a log-linear relationship with covariates such that

ln( ? i ) ? ßX i

(18)

The covariates X i that affect ? i of accident state may or may not be the same as the covariates Ai that affect pi of zero-accident state. In the case of dissimilar covariates i.e., X i ? Ai , then pi and ? i of the ZIP or ZINB model can be expressed as

? i ? exp( ßX i )

(19)

pi ? exp( ?Ai ) /(1 ? exp( ?Ai ))

(20)

in which pi is allowed to vary independently of the variables that affect the non-zero accident state.

13

In the case of similar covariates, i.e., X i ? Ai , affecting both pi and ? i , the number of parameters can be reduced by treating pi as a function of ? i . Lambert (1992) also proposed a natural parameterization

ln( ? i ) ? ? X i

and logit ( pi ) ? ??? X i

(21)

where ? is an unknown, real-valued shape parameter. To distinguish this case from the case of non-similar covariates, Lambert (1992) denoted this ZIP model as ZIP ( ? )and the ZINB model as ZINB ( ? ) . In the ZIP ( ? ) or ZINB ( ? ) model, pi of the zero accident state is a simple multiplicative function of the variables that explains the non-zero accident counts, i.e., pi ? 1 /[ 1 ? exp(?ßX i )]

(22)

Regardless of whether the covariates are similar or dissimilar, the log-likelihood function of the ZIP or ZINB model will simply be

l ( ß, ? ) ? ? ln ?Pr(Yi | X i )? n

i? 1

(23)

The model parameters ß and ? may be estimated by using the BHHH (Berndt, Hall B., Hall R. and Hausman 1974) algorithm proposed by Greene (1995).

14

King (1989) and Zorn (1998) noted that the extent of overdispersion in a NB model would vary as a function of the covariates i.e., ? i ? exp( ?Z i ) where Z i (1 ? k) is a set of covariates. If this is the case, then it becomes very difficult to distinguish, on purely statistical grounds, this parameterized negative binomial model from a ZIP model with

p i ? exp( ?Ai ) .

It is noteworthy to state that an alternative to the above method in modeling excess zero counts is the hurdle model suggested by Mullahy (1986), Pohlmeier and Ulrich (1995) and Gurmu and Trevedi (1996). The basic idea is that a binomial probability governs the binary outcomes of whether a count variate has a zero or a positive realization. If the realization is positive, the “hurdle is crossed”, and the conditional distribution of the positive is governed by a truncated-at-zero count data model.

SELECTING THE RELEVANT COUNT MODEL

The mean in both the ZIP and ZINB models is given by E (Yi | X i , Ai ) ? ? i ? ? i pi

(25)

The variance of the ZIP model is

Var (Yi | X i , Ai ) ? ? i (1 ? p i )(1 ? ? i pi )

(26)

15

Notice that if pi =0, the ZIP specification results in the standard Poisson, but otherwise, the variance exceeds the mean. Hence, the ZIP induces overdispersion giving a varianceto-mean ratio as

Var (Yi | X i , Ai ) / E (Yi | X i , Ai ) ? 1 ? ?pi /(1 ? pi ) ?E (Yi )

(27)

This shows how the dual-state nature of the model generates overdispersion. Similarly, Zorn (1998) discussed how a dual-state data-generating regime would induce overdispersion in counts. This implies that there exists a problem in distinguishing the ZIP model from an underlying NB specification. Since the variance-to-mean ratio in the NB model is

Var (Yi | X i ) / E (Yi | X i ) ? 1 ? ? E (Yi )

(28)

the term, p i /(1 ? p i ) in equation (27) is analogous to the overdispersion parameter ? in equation (28) where pi=0 corresponds to ? =0. Notice that the extent of overdispersion increases with the probability of a zero accident state. The variance of the ZINB model may be expressed as

Var (Yi | X i , Ai ) ? ? i (1 ? p i )[1 ? ? i (? ? p i )]

(29)

and if pi = 0, the standard NB regression model will result, but if pi > 0, the dispersion will be greater than that of the standard NB model.

The foregoing shows that both the zero-inflated models and the NB model induce overdispersion. In the former, overdispersion arises from excess zero accident

16

observations whereas in the latter, overdispersion arises from unobserved heterogeneity or positive within-period temporal dependence (or possibly both). Therefore, it may not be a straightforward process to identify the sources of overdispersion. Moreover, the zero-inflated count models and the basic models (i.e., Poisson and NB) are not nested (Greene 1995). Hence, even though the t-test for ? may be used to distinguish between the Poisson and NB models or between the ZIP and the ZINB models, there remains the problem of distinguishing the zero-inflated count models apart from the basic models.

A test based on the t-statistic for non-nested models was proposed by Vuong (1989) to determine the appropriateness of zero-inflated count data models. To define this test, he considered two functions: (1) Pˆr1 (Yi | X i ) , which is the predicted probability of observing n i based on the zeroinflated count data models; and (2) Pˆr 2 (Yi | X i ) , which is the predicted probability for the standard Poisson model or parent NB model.

Suppose that ? Pˆr1 (Yi | X i ) ? m i ? ln ? ? ˆ ?P r2 (Yi | X i ) ?

(30)

17

whose mean and standard deviation are m and S m respectively, the Vuong statistic can then be derived as

V?

m n Sm

(31)

Voung (1989) also showed that V asymptotically follows a standard normal distribution, so that the case with V ? 1.96 will distinctly favor the zero-inflated count model while that with V ? ? 1.96 will distinctly favor the NB model but otherwise neither model is preferred. The selection of the relevant model is therefore established by evaluating two criteria, one involving the t-statistic for ? and another, the Vuong statistic V for detecting the source of overdispersion, as shown in Table 1. The complementary condition for selecting the ZIP ( ? ) or the ZINB ( ? ) model is that the shape parameter ? is necessarily statistically significant. Lambert (1992) suggested that the difference derived from the log likelihood values of the ZIP/ZINB and ZIP( ? ) /ZINB( ? ) models follows approximately a ? 2 distribution. In selecting the appropriate model, this may be used to test the null hypothesis that the ZIP( ? ) /ZINB( ? ) model is correct. However, Long (1997) did not consider the ? version of zero-inflated model as he found it difficult to imagine how the parameters of the binary process in a social science application can be a simple multiple of the parameters in the Poisson process.

Table 1 about here

18

SIGNIFICANT VARIABLES AND MODEL GOODNESS-OF-FIT

Several procedures (Hocking and Leslie 1967; Mallows 1973; Sommer and Huggins 1994) have been developed to select the best subset of independent variables, but one that is becoming increasingly popular is the use of the Akaike’s Information Criteria (AIC) developed by Akaike (1973). This procedure identifies significant independent variables, and determines the best model without the necessity to stipulate a level of significance. The details regarding the AIC criterion and model goodness-of-fit can be found in Appendix A.

SPECIFICATION TEST FOR CORRELATION IN TIME-SERIES DATA

Regardless of whether the basic model or the zero-inflated model is adopted and whether the Poisson or NB specification is used, the accident counts are assumed to be independent. Furthermore, the gamma-distributed error term in the NB model may be correlated between observations. This violates the assumption of independence in the error-term and will result in erroneous conclusions regarding the coefficient estimates. This is a serious concern for the case of repeated observations, and in our case, timeseries data from each intersection. To overcome this problem, the likelihood ratio test proposed by Ben-Akiva and Lerman (1985) may be used to assess the amount of correlation in the data. Details of this test are given in Appendix B.

19

ILLUSTRATIVE EXAMPLE

To illustrate the process of establishing a suitable statistical model for accident analysis, accident frequencies at four-legged signalized intersections in the Southwestern part of Singapore were examined in relation to intersection geometric, traffic and regulatory characteristics. A total of 52 intersections were used in the study, and accident data from the years 1992 to 1999 were extracted from the National Road Accident Database. Two observations per intersection were obtained, one on the major road and the other on the minor road, making a total of 832 observations. To show how the different models can be used, three types of accidents were examined, i.e., all accidents, pedestrian accidents and motorcycle accidents. The data set examined accounted for 2879 cases of all accidents of which about 4% were fatal, 15% involved serious injuries and the rest involved minor injuries. There were only 262 pedestrian accidents and 5% of them resulted in fatalities and 14% in serious injuries. Of the 498 cases of motorcycle accidents, 8% resulted in fatalities and 23% resulted in seriously injuries.

A total of 32 explanatory variables representing geometric, traffic and regulatory controls at the intersection were explored in the model. These included total traffic volume, right-turn volume, approach road width, the presence of uncontrolled left-turn lane, exclusive right-turn lane and median width as well as signal control type, signal cycle time and the number of signal phases per cycle.

20

Based on the accident types, three separate frequency distributions were obtained and the proposed count data models were evaluated and discussed in the following sections.

RESULTS

The results for each of the three accident frequency models are reported in Tables 2 to 4 respectively. Since accidents in a particular intersection are likely to be correlated from one year to the next, there is the possibility of unit-specific temporal dependence in the data. Therefore, to compute the test statistics, robust standard errors suggested by White (1980) were used instead of the conventional standard errors. By this, the error term is corrected when the errors are not identically distributed and there is no need to assume any specific form of heteroskedasticity. Despite of this, the conventional standard errors are also reported along with the robust standard errors in Tables 2 to 4. The interpretation of the results is reported in the following sections.

ALL ACCIDENTS

Of the 832 observations only 48 observations involved no accidents. On the other hand, 75 observations involved more than 10 accidents per year (see Figure 1).

Figure 1 about here

21

Figure 1 shows that the accident distribution yielded a mean of 4.39 and a mode of 3 accidents per year. The data set exhibited overdispersion with a t-statistic of 8.71 (pvalue

Suggest Documents