A TEST FOR CORRECT MODEL SPECIFICATION IN INFLATED ...

A TEST FOR CORRECT MODEL SPECIFICATION IN INFLATED BETA REGRESSIONS TARCIANA LIBERAL PEREIRA AND FRANCISCO CRIBARI-NETO Abstract. Beta regression models are useful for modeling data that assume values in the standard unit interval (0, 1), such as rates and proportions. These models, however, cannot be used when the data contain observations that equal zero or one (the limits of the unit interval). Ospina and Ferrari (2010b) developed the class of inflated beta regressions to handle situations in which the data contain positive mass at zero and/or one. The model is quite general and contain three submodels: for the mean, for the probability that the variate equals zero or one, and for the precision. In this paper we propose a misspecification test for inflated beta regressions. In particular, we propose two variants of the test. In the first variant, we only add testing variables to the mean submodel. The second variant follows from adding testing variable to all three submodels. We perform extensive Monte Carlo simulations in order to assess the finite-sample properties of the tests (size and power), and also to gain insight on which variables should be used as testing variable and on which asymptotic testing procedure delivers the most reliable inferences. We consider a number of different misspecifications, namely: neglected nonlinearities, omitted independent variables, incorrect link functions, neglected spatial correlation and neglected variable dispersion. Finally, an empirical illustration is presented and discussed.

1. Introduction Oftentimes practitioners are interested in modeling the dependence of rates, proportions and other other variates that assume values in the standard unit interval on a set of explanatory variables. To that end, the beta regression model proposed by Ferrari and CribariNeto (2004) is particularly useful. The random variable of interest assumes values in (0, 1) and the underlying assumption is that it follows a beta law. Nonetheless, some datasets include observations on the limits of the interval, i.e., they include observations that equal zero and/or one. Ospina and Ferrari (2010a) introduced a family of distributions known as inflated beta distributions that are mixtures of beta and Bernoulli distributions in order to allow for positive probability mass at the limits of the interval. Their distributions allow users to model data that assume values in [0, 1), (0, 1] or [0, 1]. Ospina and Ferrari (2010b) introduced the class of inflated beta regression models in which the response follows a distribution that allow positive probability mass at zero or one. Their model include a regression submodel for the probability that the dependent variable equals one of the limits of the unit interval and also a submodel for the data dispersion. Given that the model encompasses three regression submodels, with three link functions and three linear predictors, model misspecification is likely to take place and practitioners should be careful when basing inferences on a given model. When performing any regression analysis one typically does not know whether the estimated regression model at hand provides an adequate representation for the phenomenon under study. When model specification is incorrect, inaccurate inferences are likely to Date: April 26, 2010. Key words and phrases. Beta distribution, beta regression, inflated beta regression, misspecification test. 1

TESTING MODEL SPECIFICATION IN INFLATED BETA REGRESSIONS

2

follow. The development and assessment of misspecification tests is thus an important research topic in statistics and econometrics. Ramsey (1969) introduced the ‘Regression Specification Error Test’ (RESET) for the linear regression model as a general misspecification test. His test was tailored to detect both omitted variables and incorrect functional form. The testing strategy lies in contrasting the distribution of the residuals under correct model specification to that under the alternative hypothesis that the model specification is in error. Under the null hypothesis of no misspecification, there will exist an efficient, consistent and asymptotically normal estimator of the regression parameters. Under the alternative hypothesis of model misspecification, however, this estimator will be biased and inconsistent (Hausman, 1978). Ramsey and Schmidt (1976) showed that the RESET test based on least squares residuals is equivalent to the test originally proposed by Ramsey (1969). Thus, the testing procedure consists of including in the model powers of one or more variables (which are called ‘testing variables’) and then assessing their significance using an exact F test. The underlying main idea is that when the model is correctly specified powers of the testing variables should not noticeably improve the model fit. In this article we propose two general RESET-like misspecification tests for inflated beta regressions. These tests can be used by practitioners to assess whether their selected model is correctly specified. The null and nonnull finite-sample behavior of the proposed tests are investigated via extensive Monte Carlo simulations. In the numerical evaluations we consider specification errors due to neglected nonlinearities, incorrectly chosen link functions, omitted variables, neglected spatial correlation and nonconstant dispersion. Different choices of testing variables are considered and evaluated. The results indicate that the proposed tests can be quite useful for detecting model misspecification. The paper unfolds as follows. The next section describes the class of inflated beta regression models with variable and fixed dispersion. We present the model log-likelihood function, the score functions, Fisher’s information matrix and its inverse, since these quantities are needed for the proposed misspecification tests, which are introduced in Section ??. In Section ?? we evaluate the finite-sample behavior of different implementations of the proposed misspecification tests using Monte Carlo simulations. An application to real data is presented in Section ??. Finally, Section ?? offers some concluding remarks.

2. The zero- or one-inflated beta model Ospina and Ferrari (2010a) proposed distributions that are mixtures between a beta distribution and a Bernoulli distribution degenerated at zero and/or one to accommodate data that are observed in (0, 1], [0, 1) and [0, 1]. The proposed models are said to be inflated since the they allow for positive probability mass at some points (zero and/or one), unlike the beta law. Ospina and Ferrari (2010b) proposed inflated beta regression models, which are natural extensions of the beta regression model introduced by Ferrari and Cribari-Neto (2004). The authors assume that the response distribution is inflated beta. The continuous part of the data is modeled by the beta law and the point mass (the discrete part) is modeled using a degenerate distribution at a known value c , where c equals zero or one. Under the parameterization used by Ferrari and Cribari-Neto (2004), the beta density function can be expressed as f (y ; µ, φ) =

Γ(φ) y µφ−1 (1 − y )(1−µ)φ−1 , Γ(µφ)Γ((1 − µ)φ)

0 < y < 1,


3

where 0 < µ < 1 and φ > 0. We say that y has beta distribution with mean µ and precision φ and write y ∼ B (µ, φ). Here, IE(y ) = µ and Var(y ) = V (µ)/(1 + φ), where V (µ) = µ(1 − µ) denotes the variance function. The cumulative distribution function of the zero- or one-inflated beta variate (inflation at c ) is (Ospina and Ferrari, 2010a) BIc (y ; α, µ, φ) = α1l{c } (y ) + (1 − α)F (y ; µ, φ), where 1l{c } (y ) is an indicator function that equals 1 if y = c and 0 otherwise, F (·; µ, φ) is the cumulative distribution function of the beta distribution B (µ, φ) and 0 < α < 1 is the mixture parameter given by α = Pr(y = c ). The function BIc has positive mass at y = c , and hence it is not absolutely continuous. Thus, with probability α the variable y is selected from a distribution degenerated at c , and with probability (1 − α) it is selected from a beta distribution. The corresponding probability density function is (Ospina and Ferrari, 2010a) α, y = c, bic (y ; α, µ, φ) = (1) (1 − α)f (y ; µ, φ), y ∈ (0, 1), where 0 < α < 1, 0 < µ < 1, φ > 0 and f (y ; µ, φ) is the B (µ, φ) density function. The density in (??) is inflated beta at the point c , where c = 0 or c = 1. Here, IE(y ) = αc + (1 − α)µ and Var(y ) = (1 − α)V (µ)/(φ + 1) + α(1 − α)(c − µ)2 . If c = 0, the distribution in (??) is called the inflated beta distribution at zero (BEZI) and we write y ∼ BEZI(α, µ, φ). If c = 1, it is known as the inflated beta distribution at one (BEOI) and we write y ∼ BEOI(α, µ, φ).1 Let y 1 , . . . , y n be a random sample from the inflated beta distribution at c (c = 0 ou c = 1). Suppose the conditional mean, the mixture parameter and the precision parameter satisfy the following functional relations: h(αt ) = g (µt ) = b (φt ) =

M X i =1 m X i =1 q X

z t i γi = ζt ,

(2)

x t i βi = ηt ,

(3)

s t i λi = κt ,

(4)

i =1

t = 1, . . . , n , where γ = (γ1 , . . . , γM )> , β = (β1 , . . . , βm )> and λ = (λ1 , . . . , λq )> are vectors of unknown regression parameters such that γ ∈ IRM , β ∈ IRm and λ ∈ IRq , x t 1 , . . . , x t m , z t 1 , . . . , z t M and s t 1 , . . . , s t q are fixed and known covariates that may coincide totally or partly. The link functions h : (0, 1) → IR, g : (0, 1) → IR and b : (0, ∞) → IR are strictly monotonic and twice differentiable. Several different link functions can be used, such as logit, probit, Cauchy, complementary log-log and log-log for the µ and α and log or square root for φ. Here, µt is the mean of y t conditional on y t ∈ (0, 1). The likelihood function for θ = (γ> , β > , λ> )> is given by L(θ ) =

n Y

bic (y t ; αt , µt , φt ) = L 1 (γ)L 2 (β , λ),

t =1 1For brevity, we do not consider simultaneous inflation at zero and one.


4

where L 1 (γ) =

n Y

1l

αt {c }

(y t )

(1 − αt )1−1l{c } (yt ) ,

t =1

Y

L 2 (β , λ) =

f (y t ; µt , φt ).

t :y t ∈(0,1)

The parameters αt , µt and φt are functions of γ, β and λ, respectively, through (??), (??) and (??), i.e., αt = h −1 (ζt ), µt = g −1 (ηt ) and φt = b −1 (κt ). The log-likelihood function for θ = (γ> , β > , λ> )> is given by `(θ ) = `1 (γ) + `2 (β , λ), where `1 (γ) =

n X

`t (αt )

X

`2 (β , λ) =

and

t =1

`t (µt , φt ),

t :y t ∈(0,1)

with `t (αt ) = 1l{c } (y t ) log αt + (1 − 1l{c } (y t )) log(1 − αt ), `t (µt , φt ) = log Γ(φt ) − log Γ(µt φt ) − log Γ((1 − µt )φt ) + (µt φt − 1) log y t + {(1 − µt )φt − 1} log(1 − y t ). Since the likelihood function factorizes into two terms, the parameters are separable and maximum likelihood inference on (β > , λ> )> can be performed separately from that carried out on γ, and vice-versa. From the separability of the vectors of parameters γ and (β > , λ> )> , it is possible to independently obtain the score functions for γ and for (β > , λ> )> . For R = 1, . . . , M , the score function for γ is n

UR = with

and

∂ `1 (γ) X ∂ `t (αt ) d αt ∂ ζt = , ∂ γR ∂ αt d ζt ∂ γR t =1 d αt d h −1 (ζt ) 1 = = 0 d ζt d ζt h (αt )

∂ `t (αt ) 1l{c } (y t ) 1 − 1l{c } (y t ) = − . ∂ αt αt (1 − αt )

Thus, UR =

n X 1l{c } (y t ) − αt t =1

αt (1 − αt )

1 h 0 (α

t)

ztR.

For r = 1 . . . , m , the score function for β is given by X ∂ `t (µt , φt ) d µt ∂ ηt ∂ `2 (β , λ) Ur = = , ∂ βr ∂ µt d η t ∂ βr t :y ∈(0,1) t

with

and

d µt d g −1 (ηt ) 1 = = 0 d ηt d ηt g (µt ) ∂ `t (µt , φt ) yt = φt log − ψ(µt φt ) − ψ((1 − µt )φt ) . ∂ µt 1 − yt


5

Let ¨ y t∗

=

log

yt 1−y t

0,

, y t ∈ (0, 1), otherwise,

and µ∗t = IE(y t∗ |1l{c } (y t ) = 0) = ψ(µt φt ) − ψ((1 − µt )φt ).

(5)

Then, Ur =

n X

(1 − 1l{c } (y t ))φt (y t∗ − µ∗t )

t =1

1

xt r . g 0 (µt )

Consider now the function score for λ. It follows that, for p = 1 . . . ,q , Up =

X ∂ `t (µt , φt ) d φt ∂ κt ∂ `2 (β , λ) = , ∂ λp ∂ φt d κt ∂ λp t :y ∈(0,1) t

with d φt d b −1 (κt ) 1 = = 0 d κt d κt b (φt ) and ∂ `t (µt , φt ) yt − ψ(µt φt ) − ψ((1 − µt )φt ) + log(1 − y t ) = µt log ∂ φt 1 − yt − ψ((1 − µt )φt ) + ψ(φt ). We then have that Up =

n X

(1 − 1l{c } (y t ))a t

t =1

1 b 0 (φ

t)

st p ,

n o y with a t = µt log 1−yt − ψ(µt φt ) − ψ((1 − µt )φt ) + s (y t ) − ψ((1 − µt )φt ) + ψ(φt ) and t s (y t ) = log(1 − y t ), if y t ∈ (0, 1) and s (y t ) = 0, if y t = 1. The score vectors for γ, β and λ can be conveniently written in matrix form as Uγ (γ) = Z > PG (y c − α∗ ), Uβ (β , λ) = X > ΦT H (y ∗ − µ∗ ), Uλ (β , λ) = S > V Ha , where Z is the n × M matrix with z t> = (z t 1 , . . . , z t M ), X is the n × m matrix with x t> = (x t 1 , . . . , x t m ), S is the n ×q matrix with s t> = (s t 1 , . . . , s t q ), y ∗ = (y 1∗ , . . . , y n∗ )> , µ∗ = (µ∗1 , . . . , µ∗n )> , > y c = 1l{c } (y 1 ), . . . , 1l{c } (y n ) , α∗ = (α1 , . . . , αn )> and a = (a 1 , . . . , a n )> . In the notation used above, the following are n ×n diagonal matrices: P = diag{1/ [α1 (1 − α1 )] , . . . , 1/ [αn (1 − αn )]}, H = diag{1−1l{c } (y 1 ), . . . , 1−1l{c } (y n )}, G = diag{d α1 /d ζ1 , . . . , d αn /d ζn }, T = diag{d µ1 /d η1 , . . . , d µn /d ηn }, V = diag{d φ1 /d κ1 , . . . , d φn /d κn } and Φ = diag{φ1 , . . . , φn }. The maximum likelihood estimator of γ is obtained as the solution of the nonlinear system Uγ (γ) = 0. Moreover, the maximum likelihood estimators of β and λ are obtained as the solution of the nonlinear system (Uβ (β , λ)> ,Uλ (β , λ)> ) = 0. Such estimators do not have closed-form and must be computed by numerically maximizing the log-likelihood function using a Newton or quasi-Newton nonlinear optimization algorithm.


6

We shall next obtain Fisher’s information matrix. For R = 1, . . . , M and O = 1, . . . , M , we have that n ∂ `t (αt ) d αt ∂ ζt d αt ∂ ζt ∂ 2 `1 (γ) X ∂ = URO = ∂ γR γO ∂ αt ∂ αt d ζt ∂ γR d ζt ∂ γO t =1 ¨ n X −1l(0,1) (y t ) 1l{c } (y t ) 1l{c } (y t ) 1l(0,1) (y t ) d αt 2 = − + − d ζt αt (1 − αt ) (1 − αt )2 α2t t =1 « ∂ d αt d αt z tOz t R. × ∂ αt d ζt d ζt Since, under the usual conditions of regularity, IE(∂ `t (αt )/∂ αt ) = 0, it follows that n ¨ « X −1l(0,1) (y t ) 1l{c } (y t ) d αt 2 IE(URO ) = IE z tOz t R − d ζt (1 − αt )2 α2t t =1 n X −1 1 d αt 2 = − z tOz t R. (1 − αt ) αt d ζt t =1 Let p t =

1 . αt (1−αt )

Then, IE(URO ) = −

n X

pt

t =1

1 0 h (αt )

2 z tOz t R.

For r = 1, . . . , m and o = 1, . . . , m , X ∂ 2 `2 (β , λ) ∂ `t (µt , φt ) d µt ∂ ηt d µt ∂ ηt ∂ Uro = = ∂ βr βo ∂ µt ∂ µt d ηt ∂ βr d ηt ∂ βo t :y t ∈(0,1) ¨ « X ∂ µ∗t d µt 2 ∂ d µt d µt ∗ ∗ = −φt + φt (y t − µt ) xt oxt r . ∂ µt d ηt ∂ µt d ηt d ηt t :y ∈(0,1) t

From (??) it follows that ∂ µ∗t ∂ µt

= −φt φt ψ0 (µt φt ) + ψ0 ((1 − µt )φt ) .

Let w t = ψ0 (µt φt ) + ψ0 ((1 − µt )φt ) . Then, ¨ « X d µt 2 ∂ d µt d µt 2 ∗ ∗ −φt w t + φt (y t − µt ) xt oxt r . Uro = d ηt ∂ µ t d ηt d ηt t :y ∈(0,1)

t

For cumulants involving the vectors of parameters β and λ one can use the fact that if τ : (0, 1) → IR is a continuous function, then (Ospina and Ferrari, 2010b)   n X X IE  τ(y t ) = (1 − αt )IE τ(y t )|1l{c } (y t ) = 0 . t :y t ∈(0,1)

t =1

Recall that IE y t∗ |1l{c } (y t ) = 0 = µ∗t . Also, IE ∂ `t (µt , φt )/∂ µt = 0. Thus,   ¨ « X d µt 2 ∂ d µt d µt 2 ∗ ∗ IE(Uro ) = IE  −φt w t + φt (y t − µt ) xt oxt r  d ηt ∂ µt d η t d η t t :y ∈(0,1) t


=−

n X

φt2 (1 − αt )w t

t =1

1 g 0 (µt )

7

2 xt oxt r .

Additionally, for p = 1, . . . ,q and l = 1, . . . ,q , X ∂ 2 `2 (β , λ) ∂ `t (µt , φt ) d φt ∂ κt d φt ∂ κt ∂ Up l = = ∂ λp λl ∂ φt ∂ φt d κt ∂ λp d κt ∂ λl t :y t ∈(0,1) ¨ « X ∂ `t (µt , φt ) ∂ d φt d φt ∂ 2 `t (µt , φt ) d φt 2 + st l st p , = d κt ∂ (φt ) ∂ φt d κt d κt ∂ φt2 t :y ∈(0,1) t

where ∂ 2 `t (µt , φt ) ∂ φt2

= −µt

∂ µ∗t ∂ φt

− (1 − µt )ψ0 ((1 − µt )φt ) + ψ0 (φt )

¦ © = − (1 − µt )2 ψ0 ((1 − µt )φt ) + µ2t ψ0 (µt φt ) − ψ0 (φt ) . Let d t = (1 − µt )2 ψ0 ((1 − µt )φt ) + µ2t ψ0 (µt φt ) − ψ0 (φt ). We then have that ¨ « X d φt 2 ∂ d φt d φt Up l = −d t +at st l st p . d κt ∂ φt d κt d κt t :y ∈(0,1) t

Ospina and Ferrari (2010a, Proposition 2.1) showed that Density (??) belongs to the expo nential family. Hence, IE log(1 − y t )|1l{c } (y t ) = 0 = ψ((1 − µt )φt ) − ψ(φt ). Under the usual regularity conditions, we have that IE ∂ `t (µt , φt )/∂ φt = 0 and, therefore,   X d φt 2 st l st p  IE(Up l ) = IE  −d t d κ t t :y t ∈(0,1) 2 n X 1 =− ) st l st p . (1 − αt )d t b 0 (φt t =1 Finally, for r = 1, . . . , m and p = 1, . . . ,q , X ∂ 2 `2 (β , λ) ∂ `2 (β , λ) d φt ∂ κt ∂ Ur p = = ∂ βr λp ∂ φt ∂ βr d κt ∂ λ p t :y t ∈(0,1) X = (y t∗ − µ∗t ) − φt ψ0 (µt φt )µt − ψ0 ((1 − µt )φt )(1 − µt ) t :y t ∈(0,1)

×

1

xt r g 0 (µt )

d φt ∂ κt . d κ t ∂ λp

By defining c t = φt µt w t − ψ0 ((1 − µt )φt ) , we can write X 1 d φt ∂ κt Ur p = − c t − (y t∗ − µ∗t ) 0 xt r . g (µt ) d κt ∂ λ p t :y ∈(0,1) t

Therefore,  IE(Ur p ) = IE 

X

−

c t − (y t∗ − µ∗t )

t :y t ∈(0,1)

=−

n X t =1

(1 − αt )c t

1 g 0 (µ

1 t

) b 0 (φ

t)

 1 d φt ∂ κt  xt r g 0 (µt ) d κ t ∂ λp

xt r st p .


8

From the separability of the vector of parameters γ and (β > , λ> ), we have that URp = Ur R = 0. The Fisher information matrix for the inflated beta distribution with nonconstant dispersion has the form K γ (γ) 0 K (θ ) = . 0 K ϑ (ϑ) Here, K γ (γ) = K γγ is the Fisher information matrix of γ given by K γγ = Z >QZ , where Q = G PG = diag{q1 , . . . ,qn } is a diagonal matrix, with qt = p t (d αt /d ζ)2 . Additionally, Kββ Kβλ K ϑ (β , λ) = K λβ K λλ > = is Fisher’s information matrix for ϑ = (β > , λ> ). Here, K β β = X > ∆T ΦW ΦT X , K β λ = K λβ

X > ∆C T V S and K λλ = S > V ∆DV S. Here, ∆ = diag{δ1 , . . . , δn }, W = diag{w 1 , . . . , w n }, D = diag{d 1 , . . . , d n } and C = {c 1 , . . . , c n }. For t = 1, . . . , n , we have that δt = 1 − αt . In what follows we shall simplify the above results by assuming fixed dispersion. The score vectors for γ and β become Uγ (γ) = Z > PG (y c − α∗ ), Uβ (β , φ) = φX > T H (y ∗ − µ∗ ). The score function for φ (common precision parameter) is obtained independently from that for γ. Let D ∗ = diag{d 1∗ , . . . , d n∗ }, with d t∗ = µt (y t∗ − µ∗t ) + s (y t ) + ψ(φ) − ψ((1 − µt )φ). Then, Uφ = tr(H D ∗ ), where tr(·) is the trace operator. The vector of parameters γ is orthogonal to (β > , φ)> . As a result, the corresponding score functions are uncorrelated and, asymptotically, the maximum likelihood estimator of γ is independent from the maximum likelihood estimators of β and φ. The maximum likelihood estimator of γ is obtained as the solution of the nonlinear system Uγ (γ) = 0 and those of β and φ are obtained as the solution of the nonlinear system (Uβ (β , φ)> ,Uφ (β , φ)> ) = 0. Such estimators do not have closed-form and are computed by numerically maximizing the log-likelihood function using a nonlinear optimization algorithm, such as a Newton or quasi-Newton algorithm. The Fisher information matrix for the inflated beta regression model with constant dispersion can be written as K γ (γ) 0 K (θ ) = , 0 K ϑ (ϑ) where K γ (γ) = K γγ is the Fisher information for γ given before. Additionally, Kββ Kβφ K ϑ (β , φ) = K φβ K φφ > is the Fisher information matrix for ϑ = (β > , φ), with K β β = φ 2 X > ∆T W T X , K β φ = K φβ =

X > ∆T c , K φφ = tr(∆D) and c = (c 1 , . . . , c n )> . For t = 1, . . . , n w t , c t and d t are as before with φt replaced by φ. 3. Testing for model misspecification in inflated beta regressions

In what follows our goal will be to determine whether an inflated beta regression is somehow misspecified. To that end, we shall propose two RESET-like misspecification tests. In the first test we focus on the regression submodel for µ and in the second approach we devise an strategy in which all three regression submodels are augmented.


9

At the outset, consider the model h(α) = Z γ, g (µ) = X β , where γ and β are M × 1 and m × 1 vectors, respectively, Z and X are n × M and n × m matrices of explanatory variables, respectively, and µ and α are n × 1 vectors. Additionally, when dispersion is not constant, the precision parameter is given by b (φ) = Sλ, where λ is a q × 1 vector of unknown regression parameters, S is an n ×q matrix and φ is an n × 1 vector. The test we propose is carried out in two steps. First, one estimates the parameter vectors and select a set of ‘testing variables’, which are used to define the following ‘augmented model’: g (µ) = X β + A 1 τ1 , (6) where A 1 is an n × u 1 matrix of testing variables and τ1 is an u 1 × 1 vector of parameters. Second, we estimate (??) and test the null hypothesis H0 : τ1 = 0 (the model is correctly specified) against the alternative hypothesis H1 : τ1 6= 0 (the model is not correctly specified). The underlying main idea is that when the model is correctly specified such testing variables should not provide additional contribution to the quality of the model fit. The practical implementation of the test depends on two factors. First, one should to select the testing variables to be used in the augmented regression. Second, it is necessary to select a testing procedure for the exclusion of the added variables. The testing variables used can be selected as powers of the fitted linear predictor or powers of the predicted values. The exclusion test can be performed using, e.g., the likelihood ratio, score or Wald test. > > > > Let ν = (τ> 1 , γ , β , λ ) . The likelihood ratio test statistic is ξ1 = 2{`(ˆ ν ) − `(ν˜ )},

(7)

ˆ > )> is the (unrestricted) maximum likelihood estimators of ν and ˆ> , βˆ > , λ where νˆ = (τˆ1 > , γ > > > > > ˜ ˜ ν˜ = (0 , γ˜ , β , λ ) is the maximum likelihood estimator of ν obtained by imposing the null hypothesis, i.e., the restricted maximum likelihood estimator. The score test statistic can be written as > ˜ ββ ˜ ξ2 = U˜ 1β K 11 U1β , ββ

where U1β is the u 1 -vector with the first u 1 elements of the score function Uβ (β , λ) and K 11 is the u 1 × u 1 matrix formed out of the first u 1 rows and the first u 1 columns of the inverse of K β β . Tildes denote evaluation at the restricted maximum likelihood estimator. The Wald test statistic is β β −1 ˆ > Kˆ ˆ1 . τ ξ3 = τ 1

11

Here, hats denote evaluation at the unrestricted maximum likelihood estimator. Under the usual regularity conditions and under H0 , ξ1 , ξ2 and ξ3 converge in distribution to χu2 1 . Thus, the test can be performed using approximate (asymptotic) critical values from that distribution. We shall now introduce a second misspecification testing procedure. We now augment the three submodels, i.e., the regression submodels for µ, α and φ. The augmented model is thus g (µ) = X β + A 1 τ1 ,


10

h(α) = Z γ + A 2 τ2 , d (φ) = Sλ + A 3 τ3 , where A 1 , A 2 and A 3 are n × u 1 , n × u 2 and n × u 3 matrices, respectively, that contain the testing variables and τ1 , τ2 and τ3 are parameter vectors of dimensions u 1 × 1, u 2 × 1 and u 3 × 1, respectively. The null hypothesis to be tested is H0 : τ1 = τ2 = τ3 = 0 (the model is correctly specified) against the alternative hypothesis that at least one τi , i = 1, 2, 3, is nonzero (the model is not correctly specified). The likelihood ratio test statistic is as given in Equation (??) and the score and Wald test statistics, given below, can be written as sums of three quadratic forms: > ˜ γγ ˜ > ˜ ββ ˜ > ˜ λλ ˜ ξ2 = U˜ 1γ K 11 U1γ + U˜ 1β K 11 U1β + U˜ 1λ K 11 U1λ ,

−1 γγ −1 > ˆββ ˆ ˆ λλ −1 τ ˆ3 , ˆ> ˆ ˆ ˆ2 + τ ˆ> ξ3 = τ K τ + τ K τ 1 1 2 3 K 11 11 11 where U1γ is the u 2 -vector with the first u 2 elements of the score function Uγ (γ) and U1λ γγ is the u 3 -vector with the first u 3 elements of the score function Uλ (β , λ). Similarly, K 11 is defined as the u 2 × u 2 matrix that contains the first u 2 rows and the first u 2 columns of λλ the inverse of K γγ and K 11 is defined as the u 3 × u 3 matrix formed out of the first u 3 rows and the first u 3 columns of the inverse matrix of K λλ . Under the usual regularity conditions and under H0 , ξ1 , ξ2 and ξ3 converge in distribution to χu2 1 +u 2 +u 3 . Hence, the tests can be carried out using approximate (asymptotic) critical values from that distribution.

4. Numerical evaluation We shall now present the results of a set of Monte Carlo simulations on the finite-sample behavior of the proposed misspecification tests for inflated beta regressions. The evaluating criteria are the empirical sizes (null rejection rates) and the empirical powers (nonnull rejection rates) of the tests. The zero- or one -inflated beta regression model used in the simulations is h(αt ) = γ0 + γ1 z t 1 , g (µt ) = β0 + β1 x t 1 ,

(8)

b (φt ) = λ0 + λ1 s t 1 , t = 1, . . . , n , where h(·), g (·) and b (·) are link functions and z t 1 , x t 1 and s t 1 are covariates. The covariates values were selected as follows. For each covariate, we randomly generated 40 observations from U (0, 1) and replicated them in order to obtain larger samples. All simulations were performed using R (see http://www.R-project.org) and were based on 10,000 replications. In each replication we generated a random sample of the dependent variable y = (y 1 , . . . , y n )> with y t ∼ BIU(αt , µt , φt ), where αt = h −1 (ζt ), µt = g −1 (ηt ) and φt = b −1 (κt ). The specification errors considered in the power simulations are neglected nonlinearities, incorrect link function, omitted variables, neglected spatial correlation and neglected nonconstant dispersion. We use as testing variables the square of the estimated linear predictor and also the square of the predicted values; we denote the vector of predicted values by µ† . We have also considered other testing variables (including powers of the regressors) but such results are not displayed for brevity.


11

All tests are carried out considering three nominal levels, namely: 1%, 5% and 10%. In order for the powers comparisons to be meaningful, they were performed using exact critical values estimated from the corresponding size simulations, and not asymptotic critical values. Thus, we shall compare the powers of tests that share the same size. The test performed by only augmenting the submodel for the mean shall be denoted by R µ and the test based on the augmentation of the three submodels shall be denoted by R θ , with θ = (µ, α, φ)> representing the vector that contains the three parameters. 4.1. Size simulations. In the size simulations, the response is generated according to (??). For brevity, we shall only present the results of simulations performed using the logit link function for µ and α and the log link for φ. The true parameter values are γ0 = −2.0, γ1 = 1.5, β0 = −1.0, β1 = 3.5, λ0 = 5.1 and λ1 = −2.8. Table ?? contains the rejection rates (%) of the null hypothesis that the model is correctly specified. We present results for the likelihood ratio (LR), score and Wald implementations of the R µ test and also for the likelihood ratio implementation of the R θ misspecification test, which was the best performing testing procedure for it. The figures in Table ?? show that the Wald implementation of the R µ test has the worst performance. For example, when n = 120 and the testing variable is ηˆ2 , the empirical sizes of the likelihood ratio, score and Wald tests at the 5% nominal level are, respectively, 5.58%, 5.05% and 6.37%. Overall, the score implementation of R µ was the best performing impleˆ †2 , the estimated sizes of the likelihood mentation. When n = 80 and the test is based on µ ratio, score and Wald tests at the 10% nominal level are 11.70%, 10.34% and 12.84%, respectively. We also note that the best strategy is to use ηˆ2 as testing variable. We now move to the R θ test. Recall that we now augment the three submodels. According ˆ 2 ) indicates that the variable ηˆ2 is added to the regression to the notation we use, (ηˆ2 , ζˆ2 , κ ˆ 2 is added submodel for µ, the variable ζˆ2 is added to the submodel for α and the variable κ †2 †2 †2 †2 ˆ ,µ ˆ ,µ ˆ ) indicates that µ ˆ was used as testing varito the submodel for φ. Accordingly, (µ able in all three submodels. The figures in Table ?? show that the estimated sizes achieved with the different testing variables are similar. Overall, the numerical evidence shows that the least size-distorted test is the score implementation of the R µ misspecification test with ηˆ2 used as testing variable. 4.2. Power Simulations. Power simulations for the R µ and R θ tests were performed considering that only the submodel for the mean is misspecified and under misspecification of all three submodels (except under neglected variable dispersion). In what follows, we shall only present numerical resuls for the score implementation of the R µ test, which is the least size-distorted. Also, we shall only report numerical results for the best and worst choices ˆ 2 ) under neglected sçatial of testing variables for the R θ test, namely (ηˆ2 , ηˆ2 , ηˆ2 ) e (ηˆ2 , ζˆ2 , κ 2 2 2 †2 †2 †2 ˆ ,µ ˆ ,µ ˆ ) in all remaining cases. correlation and (ηˆ , ηˆ , ηˆ ) e (µ 4.2.1. Nonlinearity. Under neglected nonlinearity, the data generating process includes δ g (µt ) = β0 + β1 x t 1 , with δ = 1.8, β0 = 1.7 and β1 = −1.7. When misspecification also takes place in the other two submodels, we generated the data using δ h(αt ) = γ0 + γ1 z t 1 , b (φt ) = (λ0 + λ1 s t 1 )δ ,

Test LR LR LR LR LR LR

ˆ2) (ηˆ2 , ζˆ2 , κ 2 2 (ηˆ , ηˆ , ηˆ2 ) ˆ †2 , µ ˆ †2 , µ ˆ †2 ) (µ 2 2 ˆ †2 ) (ηˆ , ηˆ , µ ˆ †2 , µ ˆ †2 ) (ηˆ2 , µ ˆ †2 , ηˆ2 ) (ηˆ2 , µ

LR score Wald LR score Wald

Test

Added regressors

ˆ †2 µ

ηˆ2

Added regressors

1% 3.37 2.40 2.60 2.06 2.14 2.29

1% 1.88 0.78 3.45 2.01 0.98 4.15

10% 13.44 10.93 15.86 14.10 11.46 17.37

10% 17.69 15.39 15.21 14.91 14.43 15.12

n = 40 5% 7.55 5.36 9.82 7.83 5.87 10.57 n = 40 5% 10.34 8.77 8.71 8.31 7.86 8.23

R µ test n = 80 1% 5% 10% 1.3 6.15 11.45 0.98 5.2 10.45 1.98 7.15 12.69 1.40 6.15 11.70 1.01 5.27 10.34 2.11 7.28 12.84 R θ test n = 80 1% 5% 10% 1.70 6.99 12.88 1.63 6.73 12.86 1.57 6.92 13.12 1.72 6.79 12.38 1.64 6.66 12.47 1.52 6.66 12.52 1% 1.37 1.32 1.38 1.22 1.23 1.32

1% 1.26 1.01 1.57 1.44 1.17 1.84

Table 1. Null rejection rates (%).

n = 120 5% 10% 6.31 11.9 6.11 11.55 6.34 12.24 6.27 11.49 6.15 11.66 5.95 11.57

n = 120 5% 10% 5.58 10.81 5.05 10.15 6.37 11.40 5.82 10.94 5.26 10.22 6.48 11.72

1% 1.15 1.06 1.19 1.12 1.10 0.97

1% 1.21 0.95 1.35 1.02 0.83 1.25

n = 200 5% 10% 5.68 10.67 5.72 10.77 6.05 10.97 5.59 10.58 5.63 11.01 5.60 10.96

n = 200 5% 10% 4.96 10.22 4.57 9.75 5.22 10.64 5.75 10.65 5.44 10.34 6.15 11.25

TESTING MODEL SPECIFICATION IN INFLATED BETA REGRESSIONS 12


13

where γ0 = 0.7, γ1 = −0.7, λ0 = 3.0 and λ1 = −1.0. Correct model specification would require δ = 1. Our interest lies in assessing whether the proposed test be can identify that the model specification is in error. Table ?? presents the rejection rates under neglected nonlinearity. We note that considerably higher powers are obtained when ηˆ2 is used as testing variable. For instance, when n = 80 and at the 5% nominal level, the nonnull rejection rates of the score test is 96.85% ˆ †2 . It is notewhen ηˆ2 is used as testing variable and 35.39% when the testing variable is µ worthy that when the mean submodel is the only incorrectly specified submodel the R µ test with ηˆ2 as testing variable is more powerful than the R θ test (in which all three submodels are augmented). When all three submodels are incorrectly specified the R µ test with ηˆ2 as testing variable and R θ based on (ηˆ2 , ηˆ2 , ηˆ2 ) are equally powerful. We also note that the choice of testing ˆ †2 is used as variable can have a decisive impact on power. As before, the power of R µ when µ 2 testing variable is substantially lower than when based on ηˆ . Additionally, the best choice of testing variables for R θ is (ηˆ2 , ηˆ2 , ηˆ2 ). 4.2.2. Incorrect link function. In order to consider specification error in the link function, data are generated from Model (??) but using the complementary log-log link for µ. Here, β0 = −1.5 and β1 = 2.5. The estimated model incorrectly uses the logit link for µ. When all three submodels are incorrectly specified, we generate the data using the complementary log-log link for α and the inverse link for φ. Here, γ0 = −1.7, γ1 = 1.0, λ0 = 0.03 and λ1 = −0.02. The estimated model incorrectly uses the logit link for α and the log link for φ. Table ?? presents the nonnull rejection rates (%) under misspecified link function(s). The R µ test with ηˆ2 as testing variable outperforms R θ . Notice that the power of R µ is considerˆ †2 . The maximal power achieved ably higher when we use ηˆ2 as testing variable, instead of µ by using the latter is 29.03% whereas the corresponding figure for the former is 99.95%. When the three link functions are incorrectly specified, the R µ test with ηˆ2 as testing variable again outperforms R θ . 4.2.3. Omitted variables. We shall now move to the situation in which the practitioner fails to include in the regression an important regressor. As before, we consider two scenarios. First, misspecification only takes place in the mean submodel, which is estimated using x t 1 as a regressor and yet the true data generating process is given by g (µt ) = β0 + β1 x t 1 + β2 x t 2 , with β0 = −0.1, β1 = 1.5 and β2 = −2.0. Second, when all three submodels are in error, we have, in addition, that the data generating mechanism is such that h(αt ) = γ0 + γ1 z t 1 + γ2 z t 2 , d (φt ) = λ0 + λ1 s t 1 + λ2 s t 2 , where γ0 = −0.5, γ1 = 0.5, γ2 = −1.5, λ0 = 3.5, λ1 = 1.5 and λ2 = −3.5. The estimated model does not include the covariates x t 2 , z t 2 and s t 2 . The results displayed in Table ?? lead to important conclusions. First, the R µ test (in which we only augment the mean submodel) is considerably more powerful when ηˆ2 is used as testing variable. Second, once again R µ is more powerful than R θ when misspecification occurs in the mean submodel and also in all three submodels, especially when the sample size is small. For instance, when n = 40 and at the 10% nominal level, the powers of R µ and of the best performing R θ test, when all three

LR LR

Tests score score Tests LR LR

ˆ †2 , µ ˆ †2 , µ ˆ †2 ) (µ

Added regressors

ηˆ2 ˆ †2 µ

Added regressors

(ηˆ2 , ηˆ2 , ηˆ2 ) ˆ †2 , µ ˆ †2 , µ ˆ †2 ) (µ

ηˆ2 )

Tests

Added regressors

ηˆ2 ,

score score

ηˆ2 ˆ †2 µ

(ηˆ2 ,

Tests

Added regressors

1% 59.29 5.54

1% 68.69 7.07

1% 31.89 4.63

1% 48.36 6.87

Specification errors in the submodel for the mean R µ test n = 40 n = 80 n = 120 5% 10% 1% 5% 10% 1% 5% 10% 74.46 84.46 88.79 96.85 98.72 98.26 99.76 99.91 19.22 28.14 19.08 35.39 47.06 28.52 48.69 60.48 R θ test n = 40 n = 80 n = 120 5% 10% 1% 5% 10% 1% 5% 10% 59.39 71.34 75.63 91.45 95.61 93.69 98.76 99.59 14.16 23.20 12.84 26.85 36.84 19.46 36.57 48.18 Specification errors in the three submodels R µ test n = 40 n = 80 n = 120 5% 10% 1% 5% 10% 1% 5% 10% 86.79 92.09 99.16 99.80 99.91 99.98 100.00 100.00 18.08 25.55 25.32 35.94 42.01 34.87 43.53 49.57 R θ test n = 40 n = 80 n = 120 5% 10% 1% 5% 10% 1% 5% 10% 86.55 92.84 99.43 99.85 99.91 99.99 100.00 100.00 17.92 26.59 26.63 36.83 43.64 34.77 44.42 50.66

Table 2. Nonnull rejection rates (%): neglected nonlinearity.

1% 100.00 41.98

1% 100.00 41.25

1% 99.87 31.34

1% 99.98 47.24

10% 99.99 66.16

n = 200 5% 10% 100.00 100.00 49.89 55.58

n = 200 5% 10% 100.00 100.00 50.61 56.99

n = 200 5% 99.97 52.70

n = 200 5% 10% 100.00 100.00 69.19 79.29




Added regressors ηˆ2 ˆ †2 µ

Added regressors

(ηˆ2 , ηˆ2 , ηˆ2 ) ˆ †2 , µ ˆ †2 , µ ˆ †2 ) (µ

Added regressors ηˆ2 ˆ †2 µ

Added regressors

(ηˆ2 , ηˆ2 , ηˆ2 ) ˆ †2 , µ ˆ †2 , µ ˆ †2 ) (µ

Specification errors in the submodel for the mean R µ test n = 40 n = 80 n = 120 1% 5% 10% 1% 5% 10% 1% 5% 29.81 53.61 66.23 64.79 85.48 91.95 86.81 96.24 2.19 8.14 14.99 3.55 11.75 19.67 4.24 14.00 R θ test n = 40 n = 80 n = 120 1% 5% 10% 1% 5% 10% 1% 5% 17.87 37.58 51.11 46.87 70.88 81.07 70.50 89.08 2.22 8.61 15.86 3.33 12.79 20.54 5.0 15.09 Specification errors in the three submodels R µ test n = 40 n = 80 n = 120 1% 5% 10% 1% 5% 10% 1% 5% 22.14 45.73 57.82 48.78 74.81 84.11 75.10 90.45 1.86 7.41 13.36 2.28 8.05 14.30 2.83 9.13 R θ test n = 40 n = 80 n = 120 1% 5% 10% 1% 5% 10% 1% 5% 10.44 29.16 40.76 29.72 55.94 68.34 55.80 77.34 1.93 8.05 15.40 3.35 12.12 19.49 4.81 14.42 10% 85.94 22.93

10% 94.55 15.75

10% 94.07 24.92

10% 98.34 23.06

Table 3. Nonnull rejection rates (%): incorrect link function.

1% 86.52 7.02

1% 95.58 3.26

1% 95.18 8.22

1% 98.83 8.53

n = 200 5% 10% 95.06 97.74 19.55 29.95

n = 200 5% 10% 98.68 99.33 10.33 16.93

n = 200 5% 10% 98.88 99.54 21.97 33.59

n = 200 5% 10% 99.85 99.95 18.86 29.03



16

submodels are in error, are 59.83% and 36.83%, respectively. Third, the R θ test performs best when (ηˆ2 , ηˆ2 , ηˆ2 ) are used as testing variables. 4.2.4. Neglected spatial correlation. Oftentimes practitioners fail to account for existing spatial correlation in their modeling strategies. It is thus important to determine whether the proposed tests can reliably detect that the model is misspecified for not taking such correlation into account. We have performed simulations in which the data are spatially correlated. To that end, we generated latent Gaussian spatially correlated effects and included them in the model linear predictor. The model is estimated without taking into account the spatial correlation. The mean submodel parameter values are β0 = −0.8 and β1 = 3.5. When all three submodels are in error of specification, we have, in addition, that γ0 = −2.0 and γ1 = 1.5 (submodel for α) and λ0 = 6.0 and λ1 = −3.5 (submodel for φ). The simulation results are presented in Table ??. Important conclusions can be drawn from these results. First, the powers of all tests are low when the sample is small. Large samples are needed to reliably detect neglected spatial correlation. That is why we now report results for sample sizes that range from 40 to 1,000 observations. Second, once again the R µ test was considerably more powerful when the mean submodel was augmented using ηˆ2 . When n = 200 and at the 5% nominal level, the power of the score implementations ˆ †2 is 6.62%. Third, of R µ based on ηˆ2 is 40.68% whereas the corresponding power based on µ unlike the previous misspecification scenarios, R µ is now less powerful than R θ . For instance, when n = 200 and at the 10% nominal level, the powers of the best performing R µ and R θ tests are, respectively, 62.22% and 96.31% when misspecification only takes place in the mean submodel, and 35.06% and 96.23% under misspecification in all three submodels. Nonetheless, we notice that R µ has good power when the sample size is large. 4.2.5. Nonconstant dispersion. An important property of a misspecification test lies in its ability to identify the lack of fit due to nonconstant dispersion when the model is estimated under the assumption that dispersion is fixed, i.e., that φ is constant across observations. We have performed simulations to assess whether our test is able to detect such a lack of fit. Here, we only consider inference based on the R µ test. The responses were generated using the following parameter values: β0 = −0.8, β1 = 3.5, γ0 = −2.0, γ1 = 1.5, λ0 = 6.0 and λ1 = −6.0. Parameter estimation was carried out assuming that φ is constant. The results presented in Table ?? show that the score implementation delivers the most powerful test. The results also show that the test based on ηˆ2 is once again considerably more powerful ˆ †2 . than that based on µ 4.3. Size and power simulations under constant dispersion. A simulation study, similar to that performed for the inflated beta model with variable dispersion, was carried out for inflated beta regression models with constant dispersion. The constant dispersion inflated beta regression model used in the simulations was h(αt ) = γ0 + γ1 z t 1 , g (µt ) = β0 + β1 x t 1 , t = 1, . . . , n , where h(·) and g (·) are link functions and z t 1 and x t 1 are covariates. In all simulations, φ = 70. We randomly generated 40 observations of each covariate from the standard uniform distribution and then replicated them twice, three and five times in order to obtain samples of 80, 120 and 200 observations. The reported results are based on 10, 000 Monte Carlo replications. The specification errors considered in the power simulations are



Added regressors

ηˆ2 ˆ †2 µ

Added regressors

(ηˆ2 , ηˆ2 , ηˆ2 ) ˆ †2 , µ ˆ †2 , µ ˆ †2 ) (µ

Added regressors

ηˆ2 ˆ †2 µ

Added regressors

(ηˆ2 , ηˆ2 , ηˆ2 ) ˆ †2 , µ ˆ †2 , µ ˆ †2 ) (µ

1% 7.54 0.72

1% 23.78 1.17

1% 38.34 0.99

1% 59.87 0.89

Specification errors in the submodel for the mean R µ test n = 40 n = 80 n = 120 5% 10% 1% 5% 10% 1% 5% 80.03 89.77 94.14 98.76 99.55 99.52 99.96 3.77 8.23 1.17 4.45 8.60 1.10 4.06 R θ test n = 40 n = 80 n = 120 5% 10% 1% 5% 10% 1% 5% 66.95 79.14 84.58 95.62 97.92 97.97 99.60 4.63 9.61 1.33 4.84 9.60 1.34 4.90 Specification errors in the three submodels R µ test n = 40 n = 80 n = 120 5% 10% 1% 5% 10% 1% 5% 46.16 59.83 55.13 79.47 86.88 80.04 93.12 5.30 10.50 1.69 5.97 11.20 1.76 6.33 R θ test n = 40 n = 80 n = 120 5% 10% 1% 5% 10% 1% 5% 24.04 36.83 29.05 53.14 66.15 51.15 72.88 3.94 8.71 1.40 6.03 10.96 1.52 6.46 10% 82.35 12.07

10% 96.56 11.68

10% 99.88 9.09

10% 99.98 8.75

Table 4. Nonnull rejection rates (%): omitted variables.

1% 81.82 2.08

1% 97.16 2.28

1% 99.98 1.15

1% 99.99 1.04

10% 99.67 13.73

10% 96.56 14.11

n = 200 5% 99.17 7.97 n = 200 5% 93.71 8.18

n = 200 5% 10% 99.99 100.00 4.78 9.51

n = 200 5% 10% 100.00 100.00 4.03 8.71


Tests score score Tests

Added regressors

ηˆ2 ˆ †2 µ

Added regressors

ˆ2) (ηˆ2 , ζˆ2 , κ

LR LR

LR LR

(ηˆ2 , ηˆ2 , ηˆ2 ) ˆ2) (ηˆ2 , ζˆ2 , κ

ηˆ2 )

Tests

Added regressors

ηˆ2 ,

score score

ηˆ2 ˆ †2 µ

(ηˆ2 ,

Tests

Added regressors 10% 7.25 3.45

10% 21.39 9.12

10% 4.68 6.52

5% 16.18 1.65

10% 26.33 4.91

n = 40

5% 1.45 2.87

n = 40

5% 11.51 3.90

n = 40

5% 2.79 1.37

10% 19.54 5.91

5% 10% 16.97 32.79 2.93 8.46 R θ test n = 120

10% 62.22 15.38

n = 200

5% 40.68 6.62

10% 10.67 3.42

5% 39.95 9.51

10% 55.08 18.17

n = 80

5% 4.18 1.16

5% 63.84 16.04

10% 77.27 27.10

5% 10% 7.87 18.22 1.20 3.96 R θ test n = 120

10% 35.06 5.93

5% 91.22 32.90

10% 96.23 48.29

n = 200

5% 19.04 2.38

5% 10% 5% 10% 5% 10% 32.37 48.74 57.56 73.69 89.84 96.31 14.04 25.27 26.34 41.47 57.24 73.52 Specification errors in the three submodels R µ test n = 80 n = 120 n = 200

n = 80

5% 8.58 2.02

10% 100.00 98.51

10% 69.53 10.23

5% 99.96 73.82

10% 99.99 84.58

n = 400

5% 51.47 5.06

n = 400

5% 99.96 95.66

n = 400

10% 93.91 34.44

n = 400

5% 85.22 20.09

Specification errors in the submodel for the mean R µ test n = 40 n = 80 n = 120 n = 200

Table 5. Nonnull rejection rates (%): neglected spatial correlation.

10% 100.00 82.89

10% 100.00 100.00

10% 99.01 29.30

5% 100.00 99.69

10% 100.00 99.85

n = 1000

5% 96.46 16.16

n = 1000

5% 100.00 100.00

n = 1000

5% 99.98 68.30

n = 1000 TESTING MODEL SPECIFICATION IN INFLATED BETA REGRESSIONS 18

ˆ †2 µ

ηˆ2

Added regressors LR score Wald LR score Wald

Tests 1% 39.00 42.07 32.49 17.93 20.34 14.06

5% 59.18 61.01 56.10 34.34 34.32 32.46

n = 40 10% 69.20 70.01 68.08 44.88 44.62 43.98

1% 69.76 71.67 66.77 39.61 40.24 36.79

5% 84.31 84.49 83.70 58.60 58.33 57.54

R µ test n = 80 10% 89.26 89.50 88.97 68.41 67.93 68.01

1% 86.65 87.95 85.63 57.33 56.58 56.12

5% 94.09 94.30 94.04 73.48 73.57 73.0

n = 120 10% 96.41 96.47 96.33 82.29 82.02 82.06

Table 6. Nonnull rejection rates (%): nonconstant dispersion.

1% 97.69 97.71 97.51 78.38 78.52 77.92

5% 99.27 99.29 99.30 91.48 91.20 91.39

n = 200 10% 99.66 99.65 99.66 95.0 94.82 95.02



20

neglected nonlinearity in the linear predictor, omitted variables, incorrect choice of link function and neglected spatial correlation. In the size simulations the response values are generated using logit links, β0 = −1.0, β1 = 3.5, γ0 = −1.5 and γ1 = 1.0. Under unaccounted nonlinearity, data generation used δ g (µt ) = β0 + β1 x t 1 , with δ = 1.8, β0 = 1.9 and β1 = −1.9. When misspecification takes place in both the submodels (for µ and α), we have, additionally, that δ h(αt ) = γ0 + γ1 z t 1 , where γ0 = 0.7 and γ1 = −0.7. Here, we shall use the following notation: θ 0 = (µ, α)> . Hence, R θ 0 denotes the test in which we augment the two submodels (µ and α). Under incorrect choice of link function, we generate data using the complementary loglog link function for µ and estimate the model assuming that the link is logit. Here, β0 = −1.5 and β1 = 2.5. When, in addition, the submodel for α is incorrectly specified, data generation uses the complementary log-log link and estimation is based on the logit link for that submodel as well. In that case, γ0 = −1.7 and γ1 = 1.0. Under omitted variable, the data generating process includes a second regressor, x t 2 , in the mean submodel. The parameter values are β0 = −0.1, β1 = 1.5 and β2 = −2.0. The estimated model does not include x t 2 . When misspecification reaches the submodel for α, data generation includes an additional regressor, z t 2 , which is not in the estimated model. Here, γ0 = −0.5, γ1 = 0.5 and γ2 = −1.5. Finally, we generate data with spatial correlation and estimate the regression parameters without accounting for such a correlation. Here, β0 = −1.0 and β1 = 3.5. When the submodel for α is also in error, we use γ0 = −1.5 and γ1 = 1.0. For brevity, we only report results for the score implementation of R µ test using ηˆ2 as testing variable and for the likelihood ratio implementation of R θ 0 based on (ηˆ2 , ηˆ2 ), i.e., when ηˆ2 is used to augment the submodel for µ and the submodel for α; these are the best performing tests. The Monte Carlo simulation results are presented in Table ??. We note that the size distortions of the two tests are small, with slight advantage for R µ . We also conclude that the tests are satisfactorily powerful, except under neglected spatial correlation, in which case the sample size must be large in order for the type II error probability to be small. Interestingly and unlike what we observed under variable dispersion, R µ is more powerful than R θ 0 when one fails to account for spatial correlation. (Recall that under variable dispersion the test we called R θ is the equivalent of R θ 0 here.) Indeed, under both misspecification only in the mean submodel and misspecification in the two submodels (µ and α), R µ typically outperforms R θ 0 . 5. An empirical application We shall now apply the proposed test to a real (not simulated) dataset. The data used refer to efficiency indices for municipalities in the state of São Paulo, Brazil. Pereira and CribariNeto (2009) estimated beta and inflated beta regressions with spatial effects to model such efficiency scores. They have also estimated linear regressions and considered inference via quasi-likelihood. Since the data contain ones (which correspond to fully efficient public administrations), the authors estimated beta regressions after replacing such observations by a slightly smaller value. The inflated beta regression model proved to be the best model for explaining the variation in efficiency scores. The data contain 427 municipalities of the

Tests score LR Tests score LR score LR Tests score LR score LR Tests score LR score LR Tests score LR score LR

Scenario R µ test R θ 0 test Scenario

R µ test with error in µ R θ 0 test with error in µ R µ test with error in µ and α R θ 0 test with error in µ and α Scenario



R µ test with error in µ R θ 0 test with error in µ R µ test with error in µ and α R θ 0 test with error in µ and α

Size n = 40 n = 80 Added regresors 5% 10% 5% 10% 5.68 11.07 5.37 10.57 ηˆ2 (ηˆ2 , ηˆ2 ) 6.96 13.31 5.70 11.01 Power: Nonlinearity n = 40 n = 80 Added regressors 5% 10% 5% 10% ηˆ2 89.19 94.19 99.58 99.90 (ηˆ2 , ηˆ2 ) 82.90 90.51 99.25 99.64 2 ηˆ 65.56 76.68 93.28 96.70 (ηˆ2 , ηˆ2 ) 58.86 71.40 90.32 94.89 Power: Incorrect Link Function n = 40 n = 80 Added regressors 5% 10% 5% 10% ηˆ2 58.00 70.63 88.06 93.46 (ηˆ2 , ηˆ2 ) 46.70 60.41 81.83 89.14 ηˆ2 58.24 70.83 88.33 93.34 (ηˆ2 , ηˆ2 ) 47.16 60.76 82.19 89.24 Power: Omitted Variable n = 40 n = 80 Added regressors 5% 10% 5% 10% ηˆ2 86.85 92.89 99.43 99.78 (ηˆ2 , ηˆ2 ) 79.27 88.73 98.54 99.50 ηˆ2 83.18 90.56 98.93 99.50 (ηˆ2 , ηˆ2 ) 75.07 85.22 97.81 99.21 Power: Neglected Spatial Correlation n = 40 n = 80 Added regressors 5% 10% 5% 10% 2 ˆ η 9.08 19.39 26.64 41.93 (ηˆ2 , ηˆ2 ) 6.19 13.24 17.53 30.90 ηˆ2 8.32 19.30 28.86 46.94 ηˆ2 , (ηˆ2 ) 6.63 13.73 22.24 37.41 n = 200 5% 10% 4.77 9.97 5.40 10.49 n = 200 5% 10% 100.00 100.00 100.00 100.00 99.98 99.99 99.98 99.99 n = 200 5% 10% 99.83 99.97 99.58 99.81 99.85 99.95 99.51 99.81 n = 200 5% 10% 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 n = 200 5% 10% 75.78 88.35 60.98 77.32 84.55 93.13 74.48 87.45

n = 120 5% 10% 5.16 10.08 5.45 11.01 n = 120 5% 10% 99.99 100.00 99.98 99.99 99.16 99.65 98.49 99.35 n = 120 5% 10% 97.30 98.92 94.50 97.34 97.38 99.00 84.87 94.61 n = 120 5% 10% 99.96 99.99 99.90 99.96 99.95 99.98 99.84 99.95 n = 120 5% 10% 44.75 63.45 31.59 48.15 51.51 71.24 39.70 57.68

Table 7. Null and nonnull rejection rates (%) under fixed dispersion.



22

Table 8. Continuous variables. Variable EFIC E1 E3 E4 E 16 E 20 E 21

Description Efficiency scores DEA-CCR Personnel expenses % heads earns up to 1 minimum wage Average earnings Updatedness of the real state register Demographic density Urbanization rate

Mean (standard error) 0.66 (0.19) 1.95e+07 (9.32e+07) 2.01 (1.84) 700.15 (232.37) 0.91 (0.12) 306.89 (1146.85) 82.45 (15.67)

Table 9. Dummy variables. Variable E6 E7 E8 E9 E 10 E 11 E 12 E 13 E 17 E 18 E 19 E 22 E 25 MT R2

Description PFL PMDB PSDB PT PPS PPB PTB PDT Participation in inter-municipal consortia Degree of computer utilization Decision power of municipal councils Alvorada Project Municipality age Touristy municipalities Royalties

Frequency of ones (%) 52 (12.18) 76 (17.80) 118 (27.63) 30 (7.02) 24 (5.62) 21 (4.92) 50 (11.71) 14 (3.28) 75 (17.56) 390 (91.33) 217 (50.82) 3 (0.70) 36 (8.43) 72 (16.86) 55 (12.88)

state of São Paulo for the year 2000. Tables ?? and ?? present a brief description of the variables along with some descriptive measures along some descriptive statistics. Note that all variables in Table ?? are of the dummy kind, i.e., they only assume values 0 and 1. Table ?? displays the maximum likelihood estimates of the parameters that index the inflated beta model; the logit link is used for µ and α and the log link is used for φ. The dependent variable are the efficiency scores of the São Paulo municipalities. Our goal here is to test whether the estimated model is correctly specified. To that end, we apply our misspecification test proposed to the inflated beta regression model estimated by Pereira and Cribari-Neto (2009). The score R µ test based on ηˆ2 and the likelihood ratio R θ test based on (ηˆ2 , ηˆ2 , ηˆ2 ) were used. The score R µ test statistic equals 4.94 and the likelihood ratio R θ test statistic equals 6.70. Therefore, correct model specification is not rejected at the 1% signficance level when we use R µ and at the 5% nominal level when inference is based on R θ . 6. Concluding remarks In this paper we proposed a general misspecification test for inflated beta regressions. In particular, we proposed two variants of the test. In the first variant we only augment the


23

Table 10. Parameter estimates of the variable dispersion inflated beta model using the efficiency scores dataset.

Variable constant W1EFIC E4 E9 E21

Submodel for µ Estimate Standard error 0.1403 0.2012 0.0142 0.0080 −0.0007 0.0001 0.2776 0.1495 0.0095 0.0026

Variable constant E1 E3 E4 E17 E18

Submodel for α Estimate Standard error 7.774e-01 1.271e+00 9.363e-09 4.915e-09 −5.244e-01 2.267e-01 −2.835e-03 1.511e-03 1.357e+00 4.296e-01 −1.068e+00 5.248e-01

Variable constant W1EFIC E4 E18

Submodel for φ Estimate Standard error 2.0711 0.3028 −0.0384 0.0139 0.0012 0.0003 −0.7416 0.2538

mean submodel whereas in the second one all three submodels are augmented with testing variables. The null hypothesis that the model is correctly specified is rejected whenever the added variables noticeably improve the quality of the model fit. We have considered different testing strategies and compared their finite-sample performances using Monte Carlo simulation. We performed simulations under both the null (correct specification) and nonnull (misspecification) hypotheses. Different sources of model misspecification were considered, namely: neglected nonlinearity, incorrect choice of link function, omitted regressors, neglected variable dispersion, and neglected spatial correlation. The results showed that the proposed test has good power in small to moderate sample sizes, except when the practitioner fails to account for spatial correlation, in which case large sample sizes are needed to achieve small type II error probabilities. The numerical evicence has also shown that the simpler of the two tests, i.e., that in which only the mean submodel is augmented, is typically the best performing one. This test is carried out using the squared values of the fitted linear predictor as testing variable. We believe the proposed test can be quite useful in practical situations and we recommend that practitioners that model data using inflated beta regressions use it to assess whether their models are correctly specified.

Acknowlegements FCN gratefully acknowledges financial support from CNPq.


24

References [1] Cribari-Neto, F; Lima, L. B. (2007). A misspecification test for beta regressions. Working paper, Universidade Federal de Pernambuco. [2] Ferrari, S. L. P.; Cribari-Neto, F. (2004). Beta regression for modelling rates and proportions. Journal of Applied Statistics, 31, 799–815. [3] Hausman, J. A. (1978). Specification tests in econometrics Econometrica, 46, 1251–1271. [4] Ospina, R.; Ferrari, S. L. P. (2010a). Inflated beta distributions. Statistical Papers, 51, 111–126. [5] Ospina, R.; Ferrari, S. L. P. (2010b). Inflated beta regression models. Working paper, Universidade Federal de Pernambuco and Universidade de São Paulo. [6] Pereira, T. L.; Cribari-Neto, F (2009). Explaining efficiency scores of Brazilian municipalities. Working paper, Universidade Federal de Pernambuco. [7] Ramsey, J. B. (1969). Tests for specification erros in classical linear least squares regression analysis. Journal of the Royal Statistical Society B, 31, 350–371. [8] Ramsey, J. B.; Schmidt, P. (1976). Some further results on the use of OLS and BLUS residuals in specification error tests. Journal of the American Statistical Association, 71, 389–390. Departamento de Estatística, Universidade Federal da Paraíba, Cidade Universitária, João Pessoa/PB, 58059– 900, Brazil E-mail address, T.L. Pereira: [email protected] Departamento de Estatística, Universidade Federal de Pernambuco, Cidade Universitária, Recife/PE, 50740– 540, Brazil E-mail address, F. Cribari-Neto: [email protected] URL, F. Cribari-Neto: http://www.de.ufpe.br/~cribari

A TEST FOR CORRECT MODEL SPECIFICATION IN INFLATED ...

A TEST FOR CORRECT MODEL SPECIFICATION IN INFLATED ...

Suggest Documents

A test for model specification of diffusion processes - Project Euclid

Test Specification

A hierarchical model for zero-inflated biomass data ...

A spatial zero-inflated poisson regression model for oak regeneration

A Poisson-Gamma Model for Zero Inflated Rainfall Data

A Joint Specification Test for Response Probabilities in ... - MDPI

A uniform model for the specification, representation

A DSL for EER Data Model Specification

Comparison of Test Statistic for Zero-Inflated Negative ...

SPES-3 Test Specification

Test Generation for Subtractive Specification Errors - CiteSeerX

Java specification Extension for Automated Test Development

Hahn-Hausman Test as a Specification Test - Semantic Scholar

A Provably Correct Compiler for Efficient Model Checking of Mobile ...

A zero-inflated Poisson model with correlated ...

A Pareto scale-inflated outlier model and its Bayesian analysis

A Note on the Characterization of Zero-Inflated Poisson Model

A simple zero inflated bivariate negative binomial regression model ...

A spatial zero-inflated poisson regression model ... - Purdue University

A Consistent LM Type Specification Test for Semiparametric

ConData: A Tool for Automating Specification-Based Test Case ...

ConData: A Tool for Automating Specification-Based Test Case ...

Model-free forecasting outperforms the correct mechanistic model for ...

A specification test for nonlinear nonstationary models - arXiv