an mm-algorithm for a class of overdispersed regression ... - CiteSeerX

0 downloads 0 Views 114KB Size Report
The Poisson models is a linchpin in the count data statistical modelling toolkit. If the counts ..... http://www.stat.ucla.edu/deleew/block.pdf (1998). [4] DELLA ...
Monografías del Seminario Matemático García de Galdeano 33, 229–236 (2006)

A N MM- ALGORITHM FOR A CLASS OF OVERDISPERSED REGRESSION MODELS S. Dossou-Gbété, C. Demétrio and C. C. Kokonendji Abstract. The aim of the paper is to provide an algorithm for the computation of the regression parameters estimation in the framework of generalized linear model for count data. Regression parameters are estimated through the minimization of the quasi-likelihood and the main feature of that algorithm, which relies on MM method, is not to resort to matrix inversion as in Newton-Raphson algorithm and Fisher-Scoring method. Keywords: Count data, exponential dispersion models, Hinde-Demétrio models, generalized linear model, minimization, quasi-likelihood, auxiliary function, MM-algorithm.

§1. Introduction The Poisson models is a linchpin in the count data statistical modelling toolkit. If the counts are observed along with covariates, the generalized linear models is in the core of the modelization of the expected counts with respect to the covariates. The main property of the Poisson distributions is the equality of their means to their variances. It characterizes the Poisson family of discrete distributions amongst the exponential families of count distributions. It frequently occurs Poisson models are not able to handle the variability demonstrated by the data in hand. There are several ways to overcome the lack of fit of the Poisson models. One of them is to use the negative binomial models when there is evidence of overdispersion. The negative binomial models is generally stated by postulating the distribution variance σ 2 is a quadratic function of its mean m, expressed as σ 2 = m + φ1 m2 . An alternative way to handle the negative binomial model is through a linear relationship between the variance and the mean σ 2 = ϕ µ where ϕ is the Fisher index of dispersion [2, 7]. We consider in this paper an alternative to the negative binomial models, when the quadratic function of the mean is not compatible with the variability shown by the data.

§2. Hinde-Demétrio models for overdispersed count data 2.1. The Hinde-Demétrio models for count data ([5]) Let’s consider the function Vp defined as µ 7→ Vp (µ) = µ + µ p and indexed by p ≥ 1. It is shown in [9] that for each p ≥ 1, the function Vp is the unit variance function of an additive exponential dispersion model ([8]), named Hinde-Demétrio model in [9], with support S p = N + pN. A probability distribution belonging in a Hinde-Demétrio model indexed by some p

230

S. Dossou-Gbété, C. Demétrio and C. C. Kokonendji

is characterized by its mean value m > 0 and a dispersion parameter φ in such a way that its variance σ 2 is linked to the mean by the relation   m σ 2 = m + φ 1−p m p = φVp . (1) φ

2.2. The quasi-likelihood for Hinde-Demétrio models The quasi-likelihood of the mean m and the dispersion parameter φ with respect to a distribution belonging in a Hinde-Demétrio model indexed by p is defined as Z m y−u du. Q p (m, φ | y) = − 1−p u p y u+φ This results in the equation below Q p (m, φ | y) = −φ

m φ

Z y φ

y φ

−v

φVp (v)

dv = −

m φ

Z y φ

y φ

−v

v + vp

dv.

A simple algebra yields to y Q p (m, φ | y) = − φ

Z

m φ

Z

m φ

m

φ 1 1 dv + φ y dv p y v + v 1 + v p−1 φ φ (    p−1  p ) y m + φ 1−p m p φ + m p−1 p−1 =− ln − ln φ y + φ 1−p y p φ p−1 + y p−1



y φ

Z

1 dv. 1 + v p−1

This result allows to state that: Lemma 1. 

 1 − p−1 Z m y 1 +φ φ Q p (m, φ | y) = − ln  dv. 1 y − p−1 φ 1 + v p−1 1−p p−1 φ y (1 + φ y ) m 1 + φ 1−p m p−1

Moreover Q p is a convex positive function on the means domain of the Hinde-Demétrio models and   1  Z m 1−p m p−1 − p−1 m 1 + φ 1 y +φ φ b φ | y) − ln  Q p (m, φ | y) = Q p (m, dv. 1 b 1 + v p−1 m − p−1 φ 1−p p−1 φ b (1 + φ b ) m m 1 By noticing that 1+v1p−1 ≤ v p−1 for v > 0 and x 7→ ln (x) is a concave function on [0, + ∞[, one can readily show the lemma below.

b belonging in the means domain of the Hinde-Demétrio Lemma 2. Let p ≥ 2. For any m and m models h i   b p−1 −p m p−1 −m b 2−p φ 1−p m2−p − m p−1 y  m  yφ b φ | y) − ln Q p (m, φ | y) ≤ Q p (m, + + . b b p−1 φ m 1 + φ 1−p m 2− p

231

An MM-algorithm for a class of overdispersed regression models

§3. Estimation method for the generalized linear models 3.1. Generalized linear regression models Let yi , i = 1 : n, are count data where each response yi is observed along with a vector xi = (xi j ) j=1:k of k covariates values. We assume that the responses are realisations of independant random variables distributed according to distribution from the same Hinde-Demétrio model with known index parameter p and unknown dispersion parameter φ . Conditionaly to the covariates vector xi , the mean of the distribution underlying to the response yi is related to xi as a function mi (xi , β ) where β is a vector of k unkown parameters. In the generalized linear models framework it is assumed that there is a injective function g, defined on the means domain of the distributions underlying the the response yi , such that k

g(m(xi , β )) =

∑ xi j β j = t xi β .

j=1

Since the responses yi are modelled by means of distributions belonging in a Hinde-Demetrio family indexed by fixed p with the same dispersion parameter φ , the quasi-likelihood of the unknown model parameters φ and β j , j = 1 : k, with respect to the data (yi , xi ), i = 1 : n, is n  Q p β , φ | {(yi , xi ), i = 1 : n} = ∑ Q p (m(xi , β ), φ | yi ). i=1

For an exponential dispersion family of distributions, the minimization of the quasi-likelihood results in the maximization of the log-likelihood. As a consequence of what come above, the quasi-likelihood can be computed without the explicit knowledge of the cumulant function of the Hinde-Demétrio exponential dispersion models. Then a tentative algorithm for the estimation of the means mi and the dispersion parameter φ is as follows: Algorithm 1. General procedure for regression parameters estimation. Repeat until convergence within a numerical tolerance:

1. Hold φ fixed.  Minimize Q p β , φ | {(yi , xi ), i = 1 : n} . n

2. Solve the moment equation

i −m(xi ,β )) 1−p [m(x ,β )] p = ∑ m(xi ,β(y)+φ i 2

degree of freedom with respect

i=1

to φ . End(Repeat)

Using a surrogate objective function at the place of the actual function to optimize proves to be a convenient gateway on the route of designing algorithms, efficient and easy to implement ([1]). Such surrogate objective functions are sometime called auxiliary functions ([4]). Since auxiliary functions will play a key role in solving the minimization of the quasilikelihoods that we are dealing with in this paper, let’s introduce to basic properties of such functions.

232

S. Dossou-Gbété, C. Demétrio and C. C. Kokonendji

3.2. Introducing to function optimization with auxiliary functions Definition 1. Let L denote a function on a domain X ⊂ R p . An auxiliary function for L is a function A defined on X × X for which the following properties hold: (i) ∀x, x0 ∈ X , L(x) − L(x0 ) ≤ A(x, x0 ), (ii) ∀x ∈ X , A(x, x) = 0. Although the auxiliary functions can be defined in many different ways, what should be beared in mind is that they define pointwise upper-bounds of the gap between two values of the function L. Given a value x0 , if one can find a value x such that A(x, x0 ) ≤ 0 then L(x) ≤ L(x0 ) and a sequence could be carried out, a cluster point of which might be a stationary point for the function L. Lemma 3. Let A be an auxiliary function for L and xˆ ∈ X . If, for all x ∈ X , A(x, x) ˆ ≥0 ˆ x=xˆ = 0. and A is differentiable in a neighbourhood of x, ˆ then ∂∂ Ax (x, x)| Proof. Since A(x, ˆ x) ˆ = 0 and, ∀x ∈ X , A(x, x) ˆ ≥ 0, then xˆ = arg min{A(x, x), ˆ x ∈ X } and the lemma holds. Proposition 4. Let A be an auxiliary function for the function L and let (xt )t∈N denote a sequence such that xt+1 = arg min{A(x, xt ), x ∈ X }. The sequence {L(xt )}t∈N is a non increasing one. Furthermore, if A is differentiable in a neighbourhood of a cluster point xˆ of ˆ x=xˆ = 0. the sequence (xt )t∈N , ∂∂ Ax (x, x)| Proof. 0 = A(xt , xt ) ≥ A(xt+1 , xt ) ≥ L(xt+1 ) − L(xt ), then {L(xt )}t∈N is a non increasing sequence. Let xe be a cluster point of the sequence (xt )t∈N . A(x, xt ) ≥ A(xt+1 , xt ) ≥ L(xt+1 ) − L(xt ). Taking the limits and applying the continuity of the functions L and Ax (u) = A(x, u) leads to A(x, x) ˆ ≥ 0 and xˆ = arg min{A(x, x), ˆ x ∈ X }. Corollary 5. Let (xt )t∈N denote a sequence such that xt+1 = arg min{A(x, xt ), x ∈ X }. If A is differentiable and ∂A ∂ L (x, x0 ) 0 = 0 =⇒ (x) 0 = 0, ∂x ∂x x=x x=x then any cluster point xˆ of (xt )t∈N is a stationnary of the function L. Proof. The proof of the statement comes out readily from the fact that ∂L ∂ x (x)|x=xˆ = 0

∂A ˆ x=xˆ ∂ x (x, x)|

=0⇒

As a consequence of the above corollary, xt+1 = arg min{A(x, xt ), x ∈ X } might give an appropriate update rule as the core of some minimization algorithm of the function L. Note that the proposition 4. remains true as well as corollary 5. if one considers a sequence (xt )t∈N such that A(xt+1 , xt ) ≤ 0 instead of xt+1 = arg min{A(x, xt ), x ∈ X }.

233

An MM-algorithm for a class of overdispersed regression models

3.3. Generalized linear regression with concave link function and Hinde Demétrio models Let’s assume the link function g is concave with values in R. Then its reciprocal h = g−1 is t )] p−1 a convex function and this yields that the functions β 7→ [h( xβ and β 7→ − ln(h(t xβ )) are p−1 k convex functions on the parameters domain R . k

Lemma 6. Let (α j ) j=1:k be a vector in Rk with positive entries such that

∑ α j = 1 and

j=1

 βb = βbj j=1:k ∈ Rk . Then, " ( )# p−1 k  t b xj [h(t xβ )] p−1 1 b = h ∑ αj β j − β j + xβ p−1 p−1 αj j=1    p−1 k  xj αj h β j − βbj + t xβb , ≤∑ αj j=1 p − 1 " ( )# k  t b xj t b − ln(h( xβ )) = − ln h ∑ α j β j − β j + xβ αj j=1    k  xj β j − βbj + t xβb . ≤ − ∑ α j ln h αj j=1 By invoking the concavity of the function v 7→ forward.

v2−p 2−p ,

the forthcoming lemma is straight-

Lemma 7. n

  Q p β , φ | {(yi , xi ) i = 1 : n}) ≤ Q p βb, φ | {(yi , xi ) i = 1 : n} + ∑ A p,i β , βb , i=1

where A p,i

k

     t b xi j yi  b ∑ αi j ln h αi j β j − β j + xi β + φ ln m xi , βb j=1 "    # k  t b p−1 αi j xi j yi φ −p b + h β j − β j + xi β  1−p ∑ αj 1 + φ 1−p m xi , βb j=1 p − 1  1−p m xi , βb yi φ −p −  1−p p−1 1 + φ 1−p m xi , βb

 yi β , βb = − φ

 1−p k  + φ 1−p m xi , βb ∑ xi j β j − βbj . j=1

234

S. Dossou-Gbété, C. Demétrio and C. C. Kokonendji n

 Proposition 8. A p β , βb =

∑ A p,i

 β , βb is an auxiliary function for the minimization of

i=1

Q p (β , φ ). One can notice that the parameters β j , j = 1 : k, are separated in the function A p . This makes the minimization of A p with respect to β easy to carry out, without matrix inversion. What comes before suggests that the regression parameters β ∈ Rk and φ > 0 could be computed by running the following algorithm: Algorithm 2. Algorithm for generalized linear regression parameters estimation in case of Hinde-Demétrio models. Repeat until convergence within a numerical tolerance:

1. Hold φ fixed. Repeat until convergence within a numerical tolerance:  ∂A Solve p β , βb = 0, j = 1 : k. ∂βj

End(Repeat)

2

n

2. Solve the moment equation

yi h(t xi βb)







b t b 1−p h t x β i i=1 h xi β +φ

 p = degree of freedom with respect

to φ End(Repeat)

3.3.1. Solving the system of equations

∂ Ap ∂βj

 β , βb = 0, j = 1 : k

One can easily show that   b t b p−2 n y φ −p x h0 t x β  h xi β ∂ Ap 1 n xi j h0 t xi βb i ij i b β , β b = − ∑ yi  +∑ 1−p  b ∂βj φ i=1 β =β h t xi β 1 + φ 1−p m xi , βb i=1 n  1−p + φ 1−p ∑ m xi , βb xi j i=1

and   2 2 2    ∂ 2Ap 1 n yi xi j h00 t xi βb h t xi βb 1 n yi xi j h0 t xi βb b β,β b = − ∑ + ∑   b 2    φ i=1 αi j φ i=1 αi j h t xi βb 2 β =β ∂ β j2 h t xi β  b t b p−2 n y φ −p x2 h00 t x β h xi β i i ij +∑  1−p 1−p α ij 1+φ m xi , βb i=1   b 2 t b p−3 n y φ −p x2 h0 t x β h xi β i i ij + (p − 2) ∑  1−p . 1−p α i j 1+φ m xi , βb i=1

235

An MM-algorithm for a class of overdispersed regression models

One can solve the equation

∂ Ap b ∂ β j (β , β ) = 0

by using the the following update rule

∂ Ap (β , βb) b ∂ β β =β j βbjnew = βbj + 2 . ∂ Ap b (β , β ) b β =β ∂ β j2 3.3.2. Log-linear regression with Hinde-Demétrio models We will focus in this section on the log link function. Then, it comes that   p−1 n  m xi , βb ∂ Ap 1 n −p b β , β b = − ∑ yi xi j + φ ∑ yi xi j  1−p ∂βj φ i=1 β =β 1 + φ 1−p m xi , βb i=1 n  1−p + φ 1−p ∑ mi xi , βb xi j i=1

and   p−1 n y x2  m xi , βb ∂ 2Ap i ij −p b . β , β b = (p − 1)φ ∑  b1−p 1−p m x , β β =β ∂ β j2 i=1 αi j 1 + φ i The computation of the regression parameters is achieved by means of the algorithm below: Algorithm 3. Algorithm for log-linear regression with Hinde-Demétrio models Repeat until convergence within a numerical tolerance:

1. Hold φ fixed. Repeat until convergence within a numerical tolerance: h 

βbjnew = βbj +

i p−1

n m xi ,βb 1 n − ∑ yi xi j + φ −p ∑ yi xi j h  i1−p φ i=1 1+φ 1−p m xi ,βb i=1 n

(p − 1) φ −p



i=1

yi xi2j αi j

h  i p−1 m xi ,βb h  i1−p 1+φ 1−p m xi ,βb

End(Repeat)   2 yi −exp t xi βb   h  i p b t b 1−p exp t x β i i=1 exp xi β +φ n

2. Solve the moment equation spect to φ . End(Repeat)



= degree of freedom with re-

236

S. Dossou-Gbété, C. Demétrio and C. C. Kokonendji

§4. Concluding remarks The use of surrogate objective functions for the optimization of loss functions is gaining interest in computational statistics, extending the EM algorithms that are very popular for missing data problems. This paper aims to emphasis this trend by proposing some computation schemes in the framework of the regression analysis of count data that can help to avoid burden of computation in the regression parameters estimation. This preliminary work should be completed by the study of the rates of convergence of the algorithms and the statistical performances of the parameters estimation.

References [1] B ECKER , M. P., YANG , I., AND L ANGE , K. EM algorithm without missing data. Statistical Methods in Medical Research 6 (1997), 38–54. [2] C AMERON , A. C., AND T RIVEDI , P. K. Regression Analysis of Count Data. Cambridge University Press, 1998. [3]

DE L EEUW, J., AND M ICHAILIDES , G. Block relaxation algorithms in statistics. http://www.stat.ucla.edu/deleew/block.pdf (1998)

[4] D ELLA P IETRA , V., ET AL . Induce features of random fields. IEEE Transactions on pattern recognition and machine intelligence 19, 4 (1997), 1–13. [5] H INDE , J., AND D EMÉTRIO , C. Overdispersion: Models and estimation. Associação Brasileira de Estatistica, Sao Paulo, 1998. [6] H UNTER , D. R., AND L ANGE , K. A tutorial on MM algorithms. American statistician 58, 1 (2004), 30–37. [7] JANSAKUL , N., AND H INDE , J. P. Linear mean-variance negative binomial models for analysis of orange tissue-culture data. Songklanakorin J. Sci. Technol. 26, 5 (2004), 683– 696. [8] J ORGENSEN , B. Exponential Dispersion Models. Chapman & Hall, London, 1997. [9] KOKONENDJI , C., D EMÉTRIO , G. B. C., AND D OSSOU -G BÉTÉ , S. Some discrete exponential dispersion models: Poisson-Tweedie and Hinde-Demétrio classes. SORT 28, 2 (2004)

Suggest Documents