arXiv:1511.00634v1 [stat.OT] 2 Nov 2015

A Simple and Adaptive Dispersion Regression Model for Count Data Hadeel S. Kalktawi, Veronica Vinciotti and Keming Yu Department of Mathematics, Brunel University London, Uxbridge,U.K, [email protected]

arXiv:1511.00634v1 [stat.OT] 2 Nov 2015

Summary. Regression for count data can be widely performed by few but not many models such as Poisson, negative Binomial and zero-inflated regression models. A challenge often faced by practitioners is the selection of the right model to take into account dispersion and excessive zeros, both of which typically occur in count data sets. It is highly demanded to have a unified model that can automatically adapt to the underlying dispersion and that can be easily implemented in practice. While the hazard rate adaptation is important in modelling continuous lifetime data, the dispersion adaptation is essential in count data modelling. In this paper, a discrete Weibull based regression model is shown to be able to adapt to different types of dispersions in a simple way. This model has the ability to capture equi-dispersion (variance equal to mean), over-dispersion (variance greater than mean), under-dispersion (variance less than mean), and even a combination of them. Additionally, this simple model is capable of modelling highly skewed count data with excessive zeros, without the need for introducing zero-inflated and hurdle components. Maximum likelihood can be used for efficient parameter estimation. Both simulated examples and real data analyses illustrate the model and its adaptation to modelling over-dispersion, under-dispersion and excessive zeros, in comparison to classical regression approaches for count data.

Keywords: Discrete Weibull; Count data; Dispersion; Generalised linear models

1. Introduction Count data, which refers to the number of times an item or an event occurs within a fixed period of time, commonly arises in many fields. Indeed, examples of count data include the number of heart attacks or the number of hospitalisation days in medical studies, the number of students absent during a period of time in education studies or the number of times parents perpetrate domestic violence against their child in social science investigations. There is now a great deal of interest in the literature on investigating the relationship between a count response variable and other variables; for example, how the education level of parents can affect the incidence of domestic violence against their children. Methods to address these questions fall in the general area of regression analysis of count data (see Cameron and Trivedi (2013) and Hilbe (2014b) among others). Classic regression models for count data belong to the family of generalised linear models (Nelder and Wedderburn, 1972), such as Poisson regression, which models the conditional mean of the count as a linear regression on a set of covariates through the log link function. Although Poisson regression is fundamental to the regression analysis of count data, it is often of limited use for real data, due to its equi-dispersion assumption. Real data usually experience over-dispersion or the opposite case of under-dispersion. Thus,

2

Kalktawi et al.

accounting for over-dispersion and under-dispersion when modelling count data is essential, and failing to cope with these features of the data can lead to biased parameter estimates, and thus false conclusions and decisions. The Negative Binomial (NB) regression relaxes the assumption of equi-dispersion and is often considered as the default choice for an over-dispersion model. However, NB regression may not be the best choice for power-law data with long tails, or for highly skewed data with an excessive number of zeros due to the rare occurrence of non-zero events. The latter often requires the application of zero-inflated and hurdle models. These models mainly adapt a mixture of models, whereby two component mixtures with a point mass at zero and a count distribution for non-zeros are considered. That is, two data generation processes are considered: one that generates zero counts only and one that generates the other positive counts. However, these models can be complicated, with additional parameters to estimate. Thus, if one is not interested into distinguishing two different processes for the data generation, it would be more appropriate to consider a simple model that can cope with skewed data, such as those with excessive zeros. NB regression cannot deal with under-dispersion. There have been some attempts to extend Poisson regression based models to under dispersion, such as the generalised Poisson regression model (Efron, 1986; Famoye, 1993), COM-Poisson regression (Sellers and Shmueli, 2010) or hyper-Poisson regression models (Sáez-Castillo and Conde-Sánchez, 2013). However, these models have been shown to be rather complex in practice. This paper aims to introduce a simple count regression model, namely the DiscreteWeibull (DW) regression model, to capture different levels of dispersion adaptively. This is a challenge faced by existing count regression models. Moreover, we show how the DW can capture power-law behaviour, excessive zeros or high skewness of the underlying distributions without the need for an additional mixture component. The motivation behind considering the DW distribution (Nakagawa and Osaki, 1975) stems from the vital role played by the continuous Weibull distribution in the survival analysis and failure time studies. The estimation and inference for parameters of a DW distribution have been investigated in a small number of studies. Khan et al. (1989) proposed the method of proportion whereas Kulasekera (1994) suggested maximum likelihood estimators of the DW parameters based on type I censored samples. The count data application examples of DW include Englehardt and Li (2011) and Englehardt et al. (2012), who showed that the counts of living microbes (pathogen) in water are highly skewed and can be efficiently modelled using a DW distribution. Section 2 provides a review and description of the DW distribution and its properties. Section 3 illustrates the ability of a DW distribution to model a variety of dispersion levels. The DW regression model is introduced in Section 4 to investigate the relationship between a count response and a set of covariates. This model is applied to a number of real datasets in Section 5, showing the ability of the model to capture over- and under-dispersion, as well as excessive zeros and skewness.

2. Discrete Weibull distribution If Y follows a (type 1) DW distribution (Nakagawa and Osaki, 1975), then the cumulative distribution function of Y is given by F (y; q, β) =

  1 − q (y+1)β  0

for y = 0, 1, 2, 3, . . . otherwise

(1)

A Simple and Adaptive Dispersion Regression Model for Count Data

3

and its probability mass function by

f (y; q, β) =

  q yβ − q (y+1)β

for y = 0, 1, 2, 3, . . .

(2)

otherwise

 0

where the parameters 0 < q < 1 and β > 0.

The first two moments of DW are given by E(Y ) =

∞ X

qy

y=1 ∞ X

E(Y 2 ) = 2

β

(3) β

yq y − E(Y ).

y=1

A nice property of the DW distribution is that its τ th (0 < τ < 1) quantile, that is the smallest value of y for which F (y) ≥ τ , has a close form expression, given by Q(τ ) =

l log(1 − τ ) β1 log(q)

m −1 ,

(4)

for τ ≥ 1 − q.

2.1.

Parameter estimation

For a sample y1 , y2 , . . . , yn , from Equation 2, the log-likelihood can be written as: ℓ=

n X i=1

β β log q yi − q (yi +1) ,

from which the MLEs of q and β can be easily obtained by directly maximising this log-likelihood using any non-linear optimisation tool.

2.2.

Properties and special cases of a discrete Weibull distribution

The DW may generalise different models as follows: • It can be seen from Equation 2 that, P (0) = 1 − q. Thus, when q is small, an excessive zero case occurs. • The discrete Rayleigh distribution in Roy (2004) is a special case of a DW with β = 2 and q = θ. • The geometric distribution is a special case of a DW, with β = 1. Moreover, for the geometric distribution, the variance is always greater than its mean. Therefore, a DW with β = 1 is a case of over-dispersion, regardless of the value of q. In particular, when β = 1 and q = e−λ , the distribution is the discrete exponential distribution introduced by Sato et al. (1999). • β can be considered as controlling the range of values of the variable. As β → ∞, the DW approaches a Bernoulli distribution with probability q.

4

Kalktawi et al.

3. Dispersion level of discrete Weibull distribution The dispersion of a distribution with mean µ and variance σ 2 is defined as the ratio of the variance to the mean Disp =

σ2 , µ

leading to the three cases of : • under-dispersion, whereby the variance is less than the mean (Disp < 1), • over-dispersion, whereby the variance is greater than the mean (Disp > 1), • equi-dispersion, whereby the variance is equal to the mean (Disp = 1). Thus, the ability of distributions to capture different levels of dispersion depends on the relationship of their means and variances. For instance, in a Poisson distribution, the mean and variance are identical, hence the dispersion is always equal to 1. In contrast to this, the variance of a NB distribution can be written as: σ2 = µ +

1 2 µ , k

(5)

with k > 0. That is, the variance of a NB is always greater than its mean, or the dispersion is always greater than 1, so the NB distribution is appropriate to capture over-dispersion, but not under-dispersion. We will now demonstrate that the index of dispersion for the DW distribution can include all three types of dispersions. In order to do this, a simulation study is conducted with a sequence of values for β and q. For each combination of parameters, the distribution means and variances are obtained. These can be calculated empirically using the sample mean and sample variance or numerically using the approximated moments of the DW (Barbiero, 2015), with both methods returning the same results. Figure 1 displays the relationship between mean and variance with two different values for the parameter q and a sequence of values for the parameter β. Clearly, the mean-variance relationship depends on both parameters β and q. When q is small (q = 0.1, left plot), the mean is greater than the variance if β is approximately greater than 1.2. On the other hand, if q is big (q = 0.9, right plot), the mean is greater than the variance when β is greater than 2 approximately. Furthermore, Figure 2 examines the empirical ratio, (σ 2 /µ) for a range of values of the parameters β and q. It is evident that this ratio can be less or greater than 1. As such, the DW distribution can capture both the case of over- and under-dispersion, differently to Poisson and and NB distributions which can only be appropriate for the case of equi-dispersion and over-dispersion, respectively. Generally, although there is no closed form for σ 2 and µ for the DW, these careful numerical analyses have shown that: • if 0 < β ≤ 1, σ 2 > µ, • if β ≥ 2, σ 2 < µ, • for 1 < β < 2, σ 2 can be greater or less than µ, depending on q, as described in Table 1.


q=0.9 10

q=0.1

mean

8

0.18

mean

variance

2

0.10

4

0.14

6

variance

0.5

1.0

1.5

2.0

2.5

3.0

1.0

1.5

β

2.0

2.5

3.0

β

Fig. 1: Relationship between mean and variance of a DW distribution for two values of q.

0.8

2.0

0.7

0.6

1.5 q

0.5

0.4

1.0 0.3

0.2

0.5

0.1 1.5

2.0

2.5

3.0

β

Fig. 2: Dispersion values for a DW distribution.

5

6

Kalktawi et al.

Table 1: The dispersion level of DW (q, β), for 1 < β < 2, where "U" denotes Under-dispersion and "O" Over-dispersion. 0.1 U U U U

β\q 1.2 1.4 1.6 1.8

0.2 O U U U

0.3 O U U U

0.4 O U U U

0.5 O O U U

0.6 O O U U

0.7 O O O U

0.8 O O O U

0.9 O O O O

β=3

0.0 0.2 0.4 0.6 0.8

0.0 0.2 0.4 0.6 0.8

β=0.3

q=0.1 q=0.5 q=0.9

0

1

2

3

4

5

6

q=0.1 q=0.5 q=0.9

0

Y

1

2

3

4

5

6

Y

Fig. 3: Effect of q on the probability mass function of the DW for different βs.

4. Discrete Weibull regression model As mentioned earlier, although fitting a univariate variable is important, it is of more interest to analyse the relationship between a count response variable and a set of covariates. Hence, this section introduces a regression model for count data based on the DW distribution.

4.1.

Model formulation

While Weibull regression is well-known in survival analysis and life time modelling, here we introduce the regression model for count data based on the DW distribution as a discrete version of Weibull regression. Recalling that the distribution function of a continuous Weibull distribution is given by β

F (y; λ, β) = 1 − e−λy , y ≥ 0, with scale parameter λ, one can see that the parameter q of a DW distribution is equivalent to e−λ in the continuous case. Since Weibull regression imposes a link between the parameter λ and the predictors (Da Silva et al., 2008; Lee and Wang, 2003), the DW regression can be introduced via the parameter q. Moreover, the parameter q affects the shape of the probability mass function of the DW, as shown in Figure 3. Therefore, in order to introduce a DW regression model, we assume that, for i = 1, 2, . . . , n the response


7

Yi has a DW conditional distribution given by f (yi , q(xi ), β|xi ), where q(xi ) is the DW parameter related to explanatory variables xi through a link function. Similarly to the continuous Weibull regression, where log(λi ) = x ′iα , and noting that q of DW is equivalent to e−λ , the DW regression model can be introduced analogously as follows: log (− log(qi )) = x ′iα i ,

x ′iα i = α0 + xi1 α1 + . . . + xiP αP .

(6)

This transforms q from the probability scale (i.e the interval [0, 1]) to the interval [−∞, +∞])) and ensures that the parameter q remains an element of [0, 1], as seen in Equation 6, where qi can be expressed as: x ′iα i

qi = e−e

.

(7)

Then, the probability mass function of the response variable Y is given by:

x′ α yiβ x ′ α (yi +1)β i i . f (yi |xi ) = e−e i − e−e i

(8)

Alternatively, one can choose different transformations for q to link Y with a set of covariates, for example, logit or probit.

Finally, in order to obtain the MLEs for the unknown parameters α and β, the log-likelihood of Equation 8 is maximised numerically using standard optimisation tools. After a DW regression model has been estimated, the following can be obtained: • The fitted values for the central trend of the conditional distribution can be obtained in one of two methods: – mean: the Equation 3, as mentioned earlier, can be calculated numerically using the approximated moments of the DW (Barbiero, 2015). – median: the quantile formula provided in Equation 4, can be applied. Due to the skewness, which is common for count data, the median is more appropriate than mean. The fitted values via median can be obtained easily from the closed form expression of quantiles for DW, as

1ˆ l m β log(2) − −1 . log(ˆ q (x)) • The conditional quantile for any τ can be obtained from Equation 4.

4.2.

Mixed-level dispersion

Regression for count data often displays a mixed level of dispersion over the range of covariates. In other words, the conditional distribution for a response variable can be over-dispersed for some values and ranges of a covariate and under-dispersed for some other values and ranges of covariates. So far, regression models for

Kalktawi et al.

1.4 1.2 0.8

1.0

Conditional Disp

1.6

1.8

8

0

2

4

6

8

10

Simulated covariate

Fig. 4: Conditional dispersion in the simulation study. count data that can capture both types of over- and under-dispersion simultaneously are in the form of extended versions of Poisson regression, such as quasi-Poisson, COM-Poisson or hyper-poisson (Sáez-Castillo and Conde-Sánchez, 2013), where the dispersion parameter can be assumed to be linked to covariates. However, in practice, most implementations fix the dispersion parameter and assume that only the mean is linked to the covariates for reason of better interpretation. In this subsection, we conduct a simulation to demonstrate that the DW regression model can also capture these mixed levels of dispersion. We consider a random sample of size n = 150 from a uniform distribution (0, 10) for the covariate X. The true value of the regression parameter is assumed to be α = (α0 , α1 ) = (2.2, −0.5). In addition, the parameter β of the DW is assumed to be β = 1.7. This simulation is carried out for 1000 iterations. Subsequently, 1000 values of the MLEs and their 95% asymptotic confidence interval (CI) bounds are computed. Table 2 reports the average estimates of the regression parameters α0 and α1 , as well as the distribution parameter β, together with the average bias and the mean squared error (MSE) over the 1000 iterations. Figure 4 shows that the ratios of the conditional variances to the conditional means, that is the conditional dispersion, can take values of 1, less than 1 and greater than 1 (equi-dispersion, under-dispersion and over-dispersion, respectively) over the range of values of x.

5. Application to real data sets To demonstrate the ability of the DW regression model to handle over- and under-dispersion automatically, in this section DW regression is applied to different data sets that show various types of dispersions. The first


9

Table 2: Simulation study: DW parameter estimates.

α0 α1 β

MLE 2.2850 -0.5178 1.7526

Bias 0.0850 -0.0178 0.0526

MSE 0.1168 0.0044 0.0350

CI Length 1.2076 0.2318 0.6470

Table 3: Maximum likelihood estimates, standard errors (in parentheses) and AICc from different regression models fitted to airfreight breakage data.

Poisson NB COM-Poisson DW

intercept 2.3529 (0.1317) 2.3529 (0.1317) 13.8247 (6.2369) -27.0839 (0.0074)

number of box transfers 0.2638 (0.0792) 0.2638 (0.0792) 1.4838 (0.6888) -2.5937 (0.0074)

other

AICc

-

52.1088

ˆ k=1228880 ( 67721603) νˆ=5.7818 (2.5967) ˆ β=10.991 (0.1203)

56.3947 47.2898 47.5964

subsection includes two under-dispersed datasets, while the second includes an over-dispersed case. The third subsection focuses on zero-inflated datasets. Finally, an illustrative example for the mixed level of dispersion is provided. Various popular count data regression models, namely Poisson regression (R function glm), NB regression (R function glm.nb), COM-Poisson regression (COMPoissonReg R package (Sellers and Lotze, 2011)), zero-inflated and hurdle models (pscl R package (Zeileis et al., 2008)), are applied and compared with DW regression by means of classical AIC and BIC criteria (Dayton, 2003).

5.1.

The case of under-dispersion

5.1.1. Airfreight breakage data The data is taken from Kutner et al. (2005) and is about 10 air shipments which freight some biological and medical substances. Each of them carries 1000 ampoules in boxes on their journey. The data report the number of times the box was transferred from one aircraft to another, which can be considered as the predictor X, and the number of ampoules found broken upon arrival, which is taken as the response Y . The data was previously analyzed using a COM-Poisson model (Sellers and Shmueli, 2010) and found to be under-dispersed. Table 3 shows the MLEs of DW regression and three other regression models, together with their standard errors (in brackets) and with the AICc , which is better suited to these data due to the very small sample size. The table shows how COM-Poisson and DW regression give a similar fit to the data and provide a better fit than the other models. In addition, Figure 5 shows that the conditional dispersion takes values uniformily less than one, which does indicate the case of under-dispersion.

5.1.2. Inhaler using data These data are taken from Grunwald et al. (2011). They consist of 5209 observations and report the daily count of using (albuterol) asthma inhalers for 48 children aged between 6 and 13 during the school day, for a period of time at the Kunsberg School at National Jewish Health in Denver, Colorado. The main objective

Kalktawi et al.

0.6 0.2

0.4

Conditional Disp

0.8

1.0

10

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Number of box transfers

Fig. 5: Conditional dispersion based on DW regression model for airfreight breakage data.

of this analysis is to investigate the relationship between the inhaler use (representing the asthma severity), and air pollution, which is recorded by four covariates: the percentage of humidity, the barometric pressure (in mmHG/1000), the average daily temperature (in Fahrenheit degree/100) and the morning levels of P M25 , which are small air particles less than 25mm in diameter. The response variable, which is the inhaler use count, has a sample mean of 1.2705 and variance of 0.8433, thus pointing to the under-dispersion case. The results in Table 4 suggest that DW and COM-Poisson regression provide better fitting, according to both AIC and BIC, than both Poisson and NB models, with COM-Poisson being marginally superior to DW regression. In addition, Figure 6 indicates under-dispersion for all covariates and across the full range. Here, for each covariate, we compute the conditional dispersion from the DW regression model for a range of values of that covariate and with all other covariates set to their average value.

Table 4: Maximum likelihood estimates, standard errors (in parentheses), AIC and BIC from different regression models fitted to the inhaler use data.

Poisson NB COM-Poisson DW

intercept -2.2132 (1.7115) -2.2132 (1.7115) -2.9301 (2.1203) 1.8112 (1.9731)

humidity -0.1125 (0.0840) -0.1125 (0.0840) -0.1727 (0.1041) 0.2233 (0.1013)

pressure 4.0950 (2.7230) 4.0950 (2.7230) 6.2838 (3.3746) -5.6120 (3.1357)

temperature -0.2035 (0.1293) -0.2035 (0.1293) -0.3127 (0.1604) 0.3691 (0.1563)

particles 0.0225 (0.0129) 0.0225 (0.0130) 0.0347 (0.0161) -0.0289 (0.0156)

other

AIC

BIC

-

13915.47

13948.26

13917.54

13956.89

13450.77

13490.12

13484.36

13523.71

ˆ k=31905.28 (65589 ) νˆ=1.9205 (0.0468) ˆ β=2.1277 (0.0259)

0.2

0.4

0.6

1.0 0.9 0.8 0.7

Conditional Disp

1.0 0.9 0.8 0.7

Conditional Disp


0.8

0.610

0.620

0.1

0.2

0.3

0.4

0.5

0.6

Average daily temperature

0.7

0.8

0.9

1.0

Barometric pressure

Conditional Disp

1.0 0.9 0.8 0.7

Conditional Disp

Percentage of humidity

0.630

1

2

3

4

Morning levels of small particles

Fig. 6: Conditional dispersion based on DW regression model for inhaler use data.

11

Kalktawi et al.

2.0 1.0

1.5

Conditional Disp

2.5

3.0

12

−0.10

−0.05

0.00

0.05

Economic activity

Fig. 7: Conditional dispersion based on the DW regression model for the strikes data.

5.2.

The case of over-dispersion

5.2.1. Strikes data This dataset is available in the Ecdat R package (Croissant, 2015), under the name of StrikeNb. The response variable is the number of contract strikes in U.S. manufacturing observed monthly from January 1968 to December 1976. The predictor is the level of economic activity, which is measured as cyclical departure of aggregate production from its trend level. The response variable has a sample mean of 5.2407 and variance of 14.0723, suggesting over-dispersion. Indeed, the best fitting DW distribution to the response data gives a dispersion of 2.6735, larger than 1. Also a comparison of Poisson and NB distributions using a likelihood ratio test (lmtest R package, (Zeileis and Hothorn, 2002)) shows evidence of over-dispersion with a chi-square test statistic of 73.024 and a p-value of < 0.001. After fitting three regression models and comparing them via AIC and BIC, Table 5 shows that the DW model is only marginally superior to NB, but both DW and NB give much better fit to the data than the Poisson regression model. In addition, Figure 7 shows that the conditional dispersion from the DW regression model has values greater than 1 across the whole range of the predictor, thus indicating over-dispersion.

5.2.2. Doctor visits from German health survey data The data are available from the COUNT R package (Hilbe, 2014a) under the name of badhealth. The data come from the German health survey and contain 1127 observations for the number of visits to certain doctors during 1998. In addition, the data include two other variables: an indicator variable representing patients


13

Table 5: Maximum likelihood estimates, standard errors (in parentheses), AIC and BIC from different regression models fitted to the strikes data.

Poisson NB

economic activity 3.1342 (0.8032) 3.2250 (1.2841) -5.2956 (1.9096)

other

AIC

BIC

-

627.9689

633.3332

566.5969

574.6433

564.157

572.2034

ˆ k=3.1849 ( 0.7390) ˆ β=1.6527 (0.1302)

3.0 1.0

2.0

Conditional Disp

6 5 4 3 2 1

Conditional Disp

7

DW

intercept 1.6539 (0.0422) 1.6538 (0.0686) -3.0706 (0.2910)

0.0

0.2

0.4

0.6

0.8

1.0

Bad health

20

30

40

50

60

Age

Fig. 8: Conditional dispersion based on the DW regression model for the doctor visits data from the German health survey. claiming to be in bad health (1) or not (0), and the age of the patient. The response variable (number of visits) ranges from 0 to 40 visits to doctors throughout the year 1998 and has a sample mean of 2.3532 and variance of 11.9818, suggesting over-dispersion. The best-fitting DW distribution returns an (unconditional) dispersion of 4.0653, much larger than 1. Furthermore, the likelihood ratio test between Poisson and NB distributions returns a chi-square test statistic of 1593.3 and p-value< 0.001, suggesting once again the need for the dispersion parameter. As before, Table 6 shows very close results for the NB and DW regression models, both of which are preferable to the Poisson regression model and with a slight difference in favour of the DW regression model. Figure 8 for the conditional dispersion based on the DW regression model indicates a fixed over-dispersion for all covariates and across the whole range. Here, for each covariate, we compute the conditional dispersion from the DW regression model for a range of values of that covariate and with the other covariate set to its average value in the case of age and to 0 in the case of bad health.

5.3.

The case of excessive zeros

The following two datasets illustrate the case of excessive zero counts. Thus, besides the Poisson, NB, and DW regression, we will include also zero-inflated and hurdle models in the comparison. For these, we consider

14

Kalktawi et al.

Table 6: Maximum likelihood estimates, standard errors (in parentheses), AIC and BIC from different regression models fitted to the doctor visits data from the German health survey.

Poisson NB DW

intercept 0.4470 (0.0714) 0.4041 (0.1308) -0.6505 (0.1088)

bad health 1.1083 (0.0462) 1.1073 (0.1116) -0.9656 (0.1036)

age 0.0058 (0.0018) 0.0070 (0.0034) -0.0057 (0.0028)

other

AIC

BIC

-

5638.552

5653.634

4475.285

4495.394

4474.973

4495.083

ˆ k=0.9975 (0.0693) ˆ β=0.9887 (0.0265)

the logit link function for the binomial distribution representing the probability of the extra zeros (R package pscl, (Zeileis et al., 2008)).

5.3.1. Doctor visits from United States These data are available from the Ecdat R package, under the name Doctor. The data consist of 485 observations from the United States in the year 1986, and contain four variables for each patient: the number of doctor visits, which is taken as response, the number of children in the household, a measure of access to healthcare and a measure of health status (larger positive numbers are associated with poorer health). The response variable in this study, the number of doctor visits, has approximately 50% of zeros, and thus it can be considered as zero excessive data. This is, however, a special case of over-dispersion. Indeed, the response variable has a mean of 1.6103 and variance of 11.2011, the DW distribution returns a dispersion of 4.9397, larger than 1, and the likelihood ratio test between NB and Poisson return a test statistic of 804.59 and a p-value of < 0.001. Table 7 shows that the DW regression model provides the best fit, according to both AIC and BIC. Figure 9 indicates a fixed over-dispersion for all covariates. We justify the superiority of DW to the zero-inflated and hurdle models by the fact that the parameter q of the DW model is directly linked to zeros and thus the model can achieve similar performances to zero-inflated and hurdle models without the need of additional parameters.

5.3.2. Doctor visits from Germany prior to health reform data This dataset is available in the R package COUNT under the name rwm. It is data from the German health registry for the years 1984-1988 and contains health information for the years prior to the health reform. The data contain 27326 observations on the following four variables: number of visits to doctor during a year (which ranges from 0 to 121), age (which ranges from 25 to 64), years of formal education (spanning from 7 to 18) and household yearly income (in DM/1000). The response variable, number of visits, has about 37% of zeros, a sample mean of 3.1835 and a variance of 32.3726 (over-dispersion). The dispersion from the DW distribution is 8.1987 > 1 and the LRT p-value between NB and Poisson distribution is < 0.001, with a chi-squared statistic of 95653. Also in this case, Table 8 shows the best fit for the DW regression model. Moreover, Figure 10 shows a fixed over-dispersion for all covariates.

2

4

6

8

0.0

Number of children

0.4

13

19

15

1 4 7

Conditional Disp

4.0 3.0 1.0

2.0

Conditional Disp

4.0 3.0 2.0 1.0

Conditional Disp


0.8

0

Access to healthcare

2

4

6

Health status

Fig. 9: Conditional dispersion based on the DW regression model for the doctor visits (US) data.

Table 7: Maximum likelihood estimates, standard errors (in parentheses), AIC and BIC from different regression models fitted to the doctor visits (US) data.

Poisson NB

intercept 0.3751 (0.1102) 0.5607 (0.2118)

children -0.1759 (0.0316) -0.1706 (0.0582)

access 0.9369 (0.1928) 0.4197 (0.3915)

health 0.2898 (0.0183) 0.3154 (0.0481)

0.9801 (0.1249) -0.3739 (0.3167)

-0.1498 (0.0362) 0.0843 (0.0867)

0.8053 (0.2121) -0.1048 (0.5754)

0.1736 (0.02054) -0.4147 (0.0853)

0.5707 (0.2281) -4.3372 (1.8604)

-0.1414 (0.0673) 0.2465 (0.2764)

0.6491 (0.4018) 1.2085 (1.8727)

0.2239 (0.0535) -2.0676 (1.2429)

0.2060 (0.2752) 0.9777 (0.1244) 0.0989 (0.5070) -0.4913 (0.1436)

-0.1462 (0.0732) -0.1506 (0.0364) -0.1664 (0.0869) 0.1024 (0.0362)

0.4252 (0.5133) 0.8143 (0.2114) 0.5404 (0.5285) -0.2662 (0.2411)

0.4524 (0.0784) 0.1733 (0.02049) 0.2157 (0.0687) -0.2157 (0.0344)

other

AIC

BIC

-

2179.487

2196.223

ˆ k=0.5525 (0.0613)

1581.88

1602.801

1885.813

1919.287

1578.5

1616.158

-

-

1885.808

1919.281

1576.302

1613.959

1575.796

1596.717

logit-ZIP count model zero model

-

logit-ZINB count model zero model

ˆ k=0.6869 (0.15129) -

hurdle models binomial-logit model HP count model HNB count model DW

ˆ k=0.2596 (0.6164) ˆ β=0.7823 (0.0370)

Kalktawi et al.

50

8

Age

7 5 1

1 30

3

Conditional Disp

9 7 5 3

Conditional Disp

9 7 5 3 1

Conditional Disp

11

16

12

16

0

10

Education

20

30

Income

Fig. 10: Conditional dispersion based on the DW regression model for doctor visits (Germany) prior to health reform data.

Table 8: Maximum likelihood estimates, standard errors (in parentheses), AIC and BIC from different regression models fitted to doctor visits (Germany) prior to health reform data.

Poisson NB

intercept 0.8523 (0.0255) 0.9133 (0.0636)

age 0.0213 (0.0003) 0.0204 (0.0008)

educ -0.0421 (0.0017) -0.0460 (0.0042)

income -0.0532 (0.0022) -0.0477 (0.0054)

1.5241 (0.0262) -0.0387 (0.0876)

0.0121 (0.0003) -0.0239 (0.0012)

-0.0269 (0.0018) 0.0422 (0.0058)

-0.0491 (0.0023) 0.0093 (0.0077)

1.0509 (0.0754) -1.1827 (0.6564)

0.0179 (0.0009) -0.0514 (0.0070)

-0.0410 (0.0050) 0.0797 (0.0410)

-0.0547 (0.0054) -0.1450 (0.0650)

0.0358 (0.0856) 1.5216 (0.0262) 1.04937 (0.0779) -0.5852 (0.0425)

0.0247 (0.0012) 0.01217 (0.0003) 0.0158 (0.0009) -0.0141 (0.0005)

-0.0452 (0.0056) -0.0268 (0.0018) -0.0383 (0.0051) 0.0287 (0.0028)

-0.0164 (0.0074) -0.0494 (0.0023) -0.0586 (0.0061) 0.0275 (0.0034)

other

AIC

BIC

-

209636.2

209669

ˆ k=0.5164 (0.0060)

120654

120695.1

168860.6

168926.3

120612.6

120686.5

-

-

168853.6

168919.4

120520.2

120594.1

120335.4

120376.5

logit-ZIP count model zero model

-

logit-ZINB count model zero model

ˆ k=0.5756 (0.0219) -

hurdle models binomial-logit model HP count model HNB count model DW

ˆ k=0.4914 (0.0343) ˆ β=0.7359 (0.0042)

17

0.95 0.9

Conditional Disp

2.0 1.6 0.8

1.2

Conditional Disp

1.00 0.90 0.80 0.70

Conditional Disp

1


1.0

1.4

1.8

0

Bid price

5

15

0.0

Size

0.4

0.8

Regulator

Fig. 11: Conditional dispersion based on the DW regression model for the bids data. Table 9: Maximum likelihood estimates, standard errors (in parentheses), AIC and BIC from different regression models fitted to bids data.

Poisson NB DW

5.4.

intercept 1.5318 (0.5043) 1.5276 (0.5174) -3.3933 (0.7257)

price -0.7849 (0.3775) -0.7824 (0.3870) 1.3119 (0.5006)

size 0.0362 (0.0175) 0.0369 (0.0183) -0.1070 ( 0.0404)

regulator 0.0547 (0.1567) 0.0544 (0.1610) -0.0568 (0.2216)

other

AIC

BIC

-

402.2602

413.6054

403.9481

418.1295

395.1214

409.3028

ˆ k=33.3289 (63.3) ˆ β=1.9403 (0.1365)

The case of a mixed level of dispersion

In this section, we report the analysis of a dataset where a mixed level of dispersion was observed, that is the conditional distribution is over-dispersed for some range of one of the covariates but is under-dispersed for another range or other covariates. The data are taken from Cameron and Johansson (1997) and are available in the Ecdat R package under the name of Bids. The data record the number of bids received by 126 US firms that were targets of tender offers during a certain period of time. The dependent variable here is the number of bids, with a mean of 1.7381 and a variance of 2.0509. The objective of the study is to investigate the effect of some variables on the number of bids. For this analysis, we consider the following variables: bid price, taken as the price at a particular week divided by the price 14 working days before the bid, size, that is the total book value of assets measured in billions dollars, and regulator, a dummy variable which is 1 there was an intervention by federal regulators and 0 otherwise. As a result of greater unconditional variance compared with mean, this dataset points to an over-dispersion case. However, the conditional dispersion based on the DW regression model in Figure 11 show a mixed level of dispersion, across the three covariates and also for each individual covariate. In particular, the conditional distribution with respect to the covariate regulator shows a fixed under-dispersion level across the whole range, whereas the other two covariates show ranges of under-, equi- and over-dispersion. Table 9 shows once again a very good fit of the DW regression model to these data.

18

Kalktawi et al.

5.5.

Model diagnostics

We conclude the paper with a diagnostic analysis of the fitted DW regression models, in order to assess their goodness-of-fit. In particular, a graphical test is conducted to verify that the count data are generated by the assumed DW regression model, based on the Pearson residuals. The distribution of these residuals is unknown, and thus a simulated envelope can be considered as a useful diagnostic tool, as it allows to assess whether or not the observed residuals are consistent with the fitted model (Ferrari and Cribari-Neto, 2004; Garay et al., 2011; Sáez-Castillo and Conde-Sánchez, 2013; Atkinson, 1985). In particular, we consider a 95% simulated envelope for the Pearson residuals and produce an empirical probability plot for the ordered residuals against their sampling distribution quantiles. Figure 12 show the QQ-plot and the simulated envelope of the Pearson residuals for all the analyses conducted. The figure shows that there are not many points falling beyond the envelope bounds. Therefore, there is no evidence illustrating lack of fit for the DW regression model.

6. Conclusion In this paper we introduce a regression model based on a DW distribution and show how this can be seen as a simple and unified model to capture different levels of dispersion in the data, namely under-dispersion and over-dispersion, including the common case of excessive zeros. This is an attractive feature of DW, same as the flexibility of the continuous Weibull distribution to adapt to a variety of hazard rate. In addition, the proposed DW regression model, unlike generalised linear models in which the conditional mean is central to the interpretation, has the advantage of modeling the whole conditional density, including all conditional quantiles which can be easily extracted from the fitted model. This is particularly useful when most count data are highly skewed, even power-law. A popular model for under-dispersion is the COM-Poisson regression model. However, the probability mass function of COM-Poisson is not in a closed form and contains an infinite sum, which requires an approximate computation. In fact, the COM-Poisson implementation which was used for the examples in this paper required more computational time than the DW regression model, which uses a straightforward maximum likelihood estimation procedure. This is particularly beneficial in the case of large sample sizes. While NB is the most applied model for over-dispersion, the DW regression model is shown to be an attractive alternative to the NB regression model for over-dispersion. In particular, several examples in this study show that DW regression provides the best fitting model, both in cases of over- and under-dispersion, and is also able to capture situations of mixed-level of dispersion. In addition, the DW regression model can be applied to data with an excessive number of zero counts without requiring the need for additional parameters, as in the case of zero-inflated or hurdle models. The DW regression model has been implemented in the R package DWreg, currently available at http:// people.brunel.ac.uk/~mastvvv/Software/.


Inhaler use

Strikes

2 1 0 −1

−1

−2

1

−1

2

0

3

1

4

3

5

2

Airfreight breakage

19

−1.0

0.0

1.0

−3

1

3

−2

0

1

2

Doctor visits (Germany) prior to health reform

Doctor visits (US)

−3

−1

1

3

10 0

0

0

5

2

5

4

6

15

10

20

8 10

Doctor visits (Germany)

−1

−3

−1

1

3

−4 −2

0

2

4

−2

0

1

2

3

4

5

Bids

−2

0

1

2

Fig. 12: Empirical probability plot for the Pearson residuals against standard normal quantiles and 95% simulated envelope for the different datasets.

20

Kalktawi et al.

References Atkinson, A. C. (1985). Plots, transformations, and regression: an introduction to graphical methods of diagnostic regression analysis. Clarendon Press Oxford. Barbiero, A. (2015). DiscreteWeibull: Discrete Weibull Distributions (Type 1 and 3). R package version 1.0.1. Cameron, A. C. and P. Johansson (1997). Count data regression using series expansions: with applications. Journal of Applied Econometrics 12, 203–223. Cameron, A. C. and P. K. Trivedi (2013). Regression analysis of count data. Cambridge university press. Croissant, Y. (2015). Ecdat: Data Sets for Econometrics. R package version 0.2-9. Da Silva, M. F., S. L. Ferrari, and F. Cribari-Neto (2008). Improved likelihood inference for the shape parameter in Weibull regression. Journal of Statistical Computation and Simulation 78 (9), 789–811. Dayton, C. M. (2003). Model comparisons using information measures. Journal of Modern Applied Statistical Methods 2 (2), 2. Efron, B. (1986). Double exponential families and their use in generalized linear regression. Journal of the American Statistical Association 81 (395), 709–721. Englehardt, J. D., N. J. Ashbolt, C. Loewenstine, E. R. Gadzinski, and A. Y. Ayenu-Prah (2012). Methods for assessing long-term mean pathogen count in drinking water and risk management implications. Journal of water and health 10 (2), 197–208. Englehardt, J. D. and R. Li (2011). The discrete Weibull distribution: an alternative for correlated counts with confirmation for microbial counts in water. Risk Analysis 31 (3), 370–381. Famoye, F. (1993). Restricted generalized Poisson regression model. Communications in Statistics-Theory and Methods 22 (5), 1335–1354. Ferrari, S. and F. Cribari-Neto (2004). Beta regression for modelling rates and proportions. Journal of Applied Statistics 31 (7), 799–815. Garay, A. M., E. M. Hashimoto, E. M. Ortega, and V. H. Lachos (2011). On estimation and influence diagnostics for zero-inflated negative binomial regression models. Computational Statistics & Data Analysis 55 (3), 1304–1318. Grunwald, G. K., S. L. Bruce, L. Jiang, M. Strand, and N. Rabinovitch (2011). A statistical model for under-or overdispersed clustered and longitudinal count data. Biometrical Journal 53 (4), 578–594. Hilbe, J. M. (2014a). COUNT: Functions, data and code for count data. R package version 1.3.2. Hilbe, J. M. (2014b). Modeling Count Data. Cambridge University Press. Khan, M. A., A. Khalique, and A. Abouammoh (1989). On estimating parameters in a discrete Weibull distribution. IEEE Transactions on Reliability 38 (3), 348–350. Kulasekera, K. (1994). Approximate MLE’s of the parameters of a discrete Weibull distribution with type i censored data. Microelectronics Reliability 34 (7), 1185–1188. Kutner, M. H., C. Nachtsheim, and J. Neter (2005). Applied Linear Statistical Models, Fifth Edition. McGrawHill/Irwin, New York. Lee, E. T. and J. Wang (2003). Statistical methods for survival data analysis, Volume 476. John Wiley & Sons. Nakagawa, T. and S. Osaki (1975). The discrete Weibull distribution. IEEE Transactions on Reliability 24 (5), 300–301. Nelder, J. A. and R. W. Wedderburn (1972). Generalized linear models. Journal of the Royal Statistical Society. Series A, 370–384. Roy, D. (2004). Discrete Rayleigh distribution. IEEE Transactions on Reliability 53 (2), 255–260. Sáez-Castillo, A. and A. Conde-Sánchez (2013). A hyper-Poisson regression model for overdispersed and underdispersed count data. Computational Statistics & Data Analysis 61, 148–157. Sato, H., M. Ikota, A. Sugimoto, and H. Masuda (1999). A new defect distribution metrology with a consistent discrete exponential formula and its applications. IEEE Transactions on Semiconductor Manufactur-


21

ing 12 (4), 409–418. Sellers, K. and T. Lotze (2011). COMPoissonReg: Conway-Maxwell Poisson (COM-Poisson) Regression. R package version 0.3.4. Sellers, K. F. and G. Shmueli (2010). A flexible regression model for count data. Annals of Applied Statistics 4 (2), 943–961. Zeileis, A. and T. Hothorn (2002). Diagnostic checking in regression relationships. R News 2 (3), 7–10. Zeileis, A., C. Kleiber, and S. Jackman (2008). Regression models for count data in r. Journal of Statistical Software 27.