Noname manuscript No. (will be inserted by the editor)
Likelihood based inference for power distributions Arthur Pewsey · H´ ector W. G´ omez · Heleno Bolfarine
Received: date / Accepted: date
Abstract This paper considers likelihood based inference for the family of power distributions. Widely applicable results are presented which can be used to conduct inference for all three parameters of the general location-scale extension of the family. More specific results are given for the special case of the power normal model. The analysis of a large data set, formed from density measurements for a certain type of pollen, illustrates the application of the family and the results for likelihood based inference. Throughout, comparisons are made with analogous results for the direct parametrisation of the skew-normal distribution. Keywords Generalized Gaussian distribution · Kurtosis · Lehmann alternatives · Power normal model · Skew-normal distribution · Skewness Mathematics Subject Classification (2000) 60E05 · 62F10 · 62F12
1 Introduction In recent years there has been considerable interest shown within the statistical literature towards flexible families of distributions capable of modelling Arthur Pewsey Department of Mathematics, Escuela Polit´ ecnica, University of Extremadura, Avenida de la Universidad s/n, 10003 C´ aceres, Spain E-mail:
[email protected] H´ ector W. G´ omez Mathematics Department, Faculty of Basic Sciences, University of Antofagasta, Antofagasta, Chile E-mail:
[email protected] Heleno Bolfarine Statistics Department, IME, University of Sao Paulo, Brazil E-mail:
[email protected]
2
varying degrees of skewness and kurtosis. Noteworthy constructions include: the perturbation/conditioning approach of Azzalini (1985), the two-piece construction revisited by Fern´andez and Steel (1998) and the transformation based approach of Jones and Pewsey (2009). Much of this renewed interest can be traced back to the first of these papers in which a general approach to obtaining the density of a skew-symmetric distribution, based on the perturbation of a base symmetric density, was introduced. Such densities have the form ϕf G (z; λ) = 2f (z)G(λz),
z, λ ∈ R,
(1)
where f is a density which is symmetric about zero, G is an absolutely continuous distribution function which is also symmetric about zero, and λ is an asymmetry parameter. For the case in which f is the standard normal density, φ, and G the standard normal distribution function, Φ, the so-called skew-normal distribution with density ϕφΦ (z; λ) = 2φ(z)Φ(λz),
z, λ ∈ R,
(2)
is obtained. We will use the notation Z ∼ SN (λ) to denote that the random variable Z has this density. The skew-normal distribution has been studied in detail by Azzalini (1985, 1986), Henze (1986), Arnold et al. (1993), Chiogna (1997) and Pewsey (2000), amongst others. Various important issues associated with the more general family with density (1) are considered in Pewsey (2006). Some considerable time prior to these developments, Lehmann (1953) proposed a family of distributions with distribution function FF (z; α) = {F (z)}α ,
z ∈ R,
(3)
where F is itself a distribution function and α is an integer or rational number. Given its role in this construction, we will refer to F as the generating distribution function. Clearly, when α is an integer, (3) can be thought of as the distribution function of the largest value in a sample of size α. More generically, (3) can be viewed as defining a new family of distributions through the power transformation of a generating distribution function. Distributions with distribution function (3) are referred to in the literature as being Lehmann alternatives. Lehmann (1953) specifically referred to the cases of (3) obtained using the distribution functions of the standard normal, standard exponential and U(0,1) distributions. Without any reference to Lehmann (1953), Durrans (1992) extended the definition of (3) by allowing α ∈ R+ , referring to the resulting distributions as being those of fractional order statistics. Assuming F to be absolutely continuous with density f = dF in this extension of (3), the density of a random variable, Z, from such a distribution is trivially ϕF (z; α) = αf (z){F (z)}α−1 ,
z ∈ R, α ∈ R+ .
(4)
Employing the more generally accepted terminology, we will refer to Z as having a power distribution and write Z ∼ PF (α) to denote the fact.
3
Durrans (1992) considered the case of (4) in which F is the distribution function of the standard normal distribution, Φ, referring to the resulting distribution, PΦ (α), with density ϕΦ (z; α) = αφ(z){Φ(z)}α−1 ,
z ∈ R, α ∈ R+ ,
(5)
as being generalized Gaussian. Gupta and Gupta (2008) also considered the PΦ (α) distribution in some detail. Although they referred to Lehmann (1953) (in fact, incorrectly, to Lehman (1953)), they made no mention to the work of Durrans (1992). They referred to the class of distributions with density (5) as the power normal model, and this is the terminology we will employ in the remainder of the paper. Between them, Lehmann (1953), Durrans (1992) and Gupta and Gupta (2008) can be consulted for the fundamental properties of the PΦ (α) distribution. There are no closed-form expressions for the moments of Z ∼ PΦ (α) and numerical integration must be used to compute them. After a suitable change of variable, the nthR moment can be computed using the alternative 1 representation E(Z n ) = α 0 [Φ−1 (z)]n z α−1 dz. For α-values ranging between √ 0.0005 and 50000, the coefficients of skewness and kurtosis, β1 and β2 , for Z ∼ PΦ (α) lie in the intervals [−0.6115, 0.9007] and [1.7170, 4.3556], respectively. Hence the power normal class with density (5) contains distributions that range from those exhibiting relatively low levels of negative skewness, through the symmetric normal distribution to distributions with moderate levels of positive skewness. In comparison, for the skew-normal class with density (2), the coefficient of skewness takes values in (−0.9953, 0.9953) and the coefficient of kurtosis values in [3, 3.8692). Thus, whilst the skew-normal distribution is able to model a wider range of skewness, the power normal class includes distributions that are more leptokurtic as well as others that are more platykurtic. A comparison of (2) and (5) reveals that SN (0) ≡ PΦ (1) ≡ N (0, 1) and SN (1) ≡ PΦ (2). Moreover, Gupta and Gupta (2008) have shown that cases of the skew-normal distribution with low levels of positive skewness can be closely approximated by the power normal model. Five examples of the density (5), corresponding to α-values of 0.5, 0.75, 1, 2 and 4 are displayed in Figure 1. It can be seen that the parameter α affects the location, dispersion, skewness and kurtosis of the distribution. The power normal distribution is also a special case of the so-called betanormal distribution of Eugene et al. (2002). Jones (2004) extended the construction of Eugene et al. (2002) to include any generating distribution function, not just the standard normal. Even more generally, power distributions with density (4) are special cases of the generalised Lehmann alternatives considered by Miura and Tsukahara (1993). In Section 2, we provide general results which can be used to conduct likelihood based inference for the parameters of distributions with densities of the form (4) and their location-scale extensions. More specifically, we present the details of the score equations and the observed and expected information
0.0
0.1
0.2
0.3
0.4
0.5
0.6
4
−4
−2
0
2
4
Fig. 1 Examples of the PΦ (0, 1, α) density (5) with, corresponding to increasing modal height: α = 0.5, 0.75 (short dashed lines), α = 1 (solid line), α = 2, 4 (long dashed lines)
matrices. Section 3 is devoted to analogous results for the special case of the PΦ (α) class of distributions. There we also show that the expected information matrix for the PΦ (α) class is non-singular when the underlying distribution is normal. An illustrative example is presented in Section 4. The paper ends with the concluding remarks of Section 5.
2 The general case Here we present general results for likelihood based estimation of the parameters of a power distribution. We start by considering standard models with densities of the form given in (4) before proceeding to the more practically relevant problem of inference for their location-scale extensions.
2.1 Standard models Consider a random sample of size n, z = (z1 , . . . , zn ), from a standard PF (α) distribution with density (4) for which both f and F are assumed not to involve any unknown parameters. Clearly, with this assumption, such distributions have just a single unknown parameter, α. It then follows directly from (4) that the log-likelihood function for α given z is `(α; z) = n log(α) +
n X
log{f (zi )} + (α − 1)
i=1
n X
log{F (zi )}.
i=1
Thus, n
∂`(α; z) n X = + log{F (zi )}, ∂α α i=1 and P hence the maximum likelihood (ML) estimator of α is α ˆ = n −n/ i=1 log{F (zi )}. Note that this estimator is a function of the complete
5
Pn and sufficient statistic for α, T (z)= i=1 log{F (zi )}. As n ∂ 2 `(α; z) = − 2, 2 ∂α α the asymptotic variance of α ˆ is α2 /n. 2.2 Location-scale extensions In practice, one will generally be interested in fitting location-scale extensions of the standard distributions with density (4). Trivially, if Z is a random variable from a standard PF (α) distribution then the location-scale extension of Z, i.e. X = ξ + ηZ, where ξ ∈ R and η ∈ R+ , has a density given by µ ¶½ µ ¶¾α−1 α x−ξ x−ξ ϕF (x; ξ, η, α) = f F . (6) η η η We will denote the fact using the notation X ∼ PF (ξ, η, α). Clearly, PF (α) ≡ PF (0, 1, α). As in Section 2.1, we will assume that neither f nor F in the definition of the base PF (α) distribution involve any unknown parameters. It is also necessary to assume that the support of (6) does not depend on the three parameters being estimated. This extra assumption implies that the results presented are not applicable to the location-scale extensions of (4) for distributions such as the uniform and exponential distributions. Additional assumptions regarding the smoothness of the distributions involved will also be made as we progress. Under the above assumptions, the log-likelihood function of θ = (ξ, η, α) given a random sample of size n, x = (x1 , . . . , xn ), from an PF (ξ, η, α) distribution can be expressed as `(θ; x) = n{log(α) − log(η)} +
n X
log{f (zi )} + (α − 1)
i=1
n X
log{F (zi )}, (7)
i=1
where zi = (xi − ξ)/η. 2.2.1 Score equations Assuming that f 0 exists, the first-order partial derivatives of the likelihood function with respect to each of the parameters are: ( n ) n X ∂`(θ; x) 1 X f 0 (zi ) f (zi ) =− + (α − 1) , ∂ξ η i=1 f (zi ) F (zi ) i=1 ( ) n n X X ∂`(θ; x) 1 f 0 (zi ) f (zi ) =− n+ zi + (α − 1) zi , ∂η η f (zi ) F (zi ) i=1 i=1 n
∂`(θ; x) n X = + log{F (zi )}. ∂α α i=1
6
The score equations are obtained by equating these partial derivatives to zero. Proceeding as in Chiogna (1997) and Pewsey (2006), and letting wi = f (zi )/F (zi ), vi = f 0 (zi )/f (zi ) and ui = log{F (zi )}, it follows immediately that the solutions to the score equations satisfy v P = (1 − α)w, 1 + zv = (1 − α)zw Pn n and α = −1/u, where v = i=1 vi /n, zv = i=1 zi vi /n, etc. Generally, this system of equations must be solved numerically. 2.2.2 Observed information matrix The elements of the observed information matrix are minus the second-order partial derivatives of the log-likelihood with respect to the parameters. We will denote them by jξξ , jξη , . . . , jαα . Assuming that f 00 exists and letting ti = f 00 (zi )/f (zi ) and si = f 0 (zi )/F (zi ), they can be written, after some algebraic simplification, as: jξξ jξη jξα jηη jηα
= = = = =
n{v 2 − t + (α − 1)(w2 − s)}/η 2 , n{zv 2 − v − zt + (α − 1)[zw2 − zs − w]}/η 2 , nw/η, n{z 2 v 2 − 1 − 2zv − z 2 t − (α − 1)[2zw + z 2 s − z 2 w2 ]}/η 2 , nzw/η, jαα = n/α2 .
ˆ ηˆ and α Assuming the ML estimates ξ, ˆ to be solutions of the score equations, three of the elements of the observed information matrix evaluated at the ML solution simplify further due to the restrictions v = (1 − α)w and 1 + zv = (1 − α)zw. These are: jξˆηˆ jξˆαˆ
= =
n{zv 2 − zt + (ˆ α − 1)[zw2 − zs]}/ˆ η2 , η (1 − α ˆ )}, nv/{ˆ
jηˆηˆ
=
n{1 + z 2 v 2 − z 2 t − (ˆ α − 1)[z 2 s − z 2 w2 ]}/ˆ η2 ,
ˆ η and the wi , vi , etc are evaluated accordingly. where, here, zi = (xi − ξ)/ˆ 2.2.3 Expected information matrix The elements of the expected information matrix are the expected values of their corresponding elements of the observed information matrix. Denoting n−1 times their values by iξξ , iξη , . . . , iαα , and letting w = f (z)/F (z), v = f 0 (z)/f (z), t = f 00 (z)/f (z) and s = f 0 (z)/F (z): iξξ
= =
{E(v 2 ) − E(t) + (α − 1)[E(w2 ) − E(s)]}/η 2 {(c02 − d0 ) + (α − 1)(a02 − b0 )}/η 2 ,
iξη
= =
{E(zv 2 ) − E(v) − E(zt) + (α − 1)[E(zw2 ) − E(zs) − E(w)]}/η 2 {(c12 − c01 − d1 ) + (α − 1)(a12 − b1 − a01 )}/η 2 ,
iξα
=
E(w)/η
=
a01 /η,
7
iηη
iηα
=
{E(z 2 v 2 ) − 1 − 2E(zv) − E(z 2 t) − (α − 1)[2E(zw) +E(z 2 s) − E(z 2 w2 )]}/η 2 {(c22 − d2 − 2c11 − 1) + (α − 1)(a22 − b2 − 2a11 )}/η 2 ,
=
E(zw)/η
=
=
a11 /η,
iαα
=
1/α2 ,
where akj = E{z k (f (z)/F (z))j }, bk = E{z k (f 0 (z)/F (z))}, ckj = E{z k (f 0 (z)/f (z))j }, dk = E{z k (f 00 (z)/f (z))}, k = 0, 1, 2 and j = 1, 2, must, in general, be calculated using numerical integration. Note that when α = 1, i.e. when ϕF (x; ξ, η, 1) = (1/η)f ((x − ξ)/η) (the density of the location-scale extension of f ), n−1 times the expected information matrix reduces to (c02 − d0 )/η 2 (c12 − c01 − d1 )/η 2 a01 /η (c12 − c01 − d1 )/η 2 (c22 − d2 − 2c11 − 1)/η 2 a11 /η . a01 /η a11 /η 1 The properties of this matrix, as well as those of the expected information matrix for the more general situation in which α 6= 1, will depend on the properties of the chosen f and F used within the general formulation. As is generally the case, the observed and expected information matrices are useful for a number of important reasons. Firstly, they can be employed within optimisation methods such as Fisher’s method of scoring to obtain the ML estimates. Secondly, their inverses evaluated at the ML solution provide large-sample approximations to the variances for, and covariances between, the ML estimators. In turn, the elements of the inverse of the information matrices can be used together with the asymptotic normality of the ML estimators to conduct other forms of inference for the parameters such as hypothesis testing and confidence set construction. Alternatively, confidence sets can be obtained using the profile log-likelihood of the parameters of interest together with standard asymptotic chi-squared theory for the likelihood-ratio test statistic. 3 The power normal distribution Here we consider likelihood based inference for the location-scale extension of the PΦ model with density (5). Durrans (1992) considered moment based approaches to parameter estimation for this model. Gupta and Gupta (2008) gave details for ML based point estimation of the parameters including the score equations. The results we present here complement and extend their results. It follows directly from (6) that the density for this model is given by µ ¶½ µ ¶¾α−1 α x−ξ x−ξ (8) ϕΦ (x; ξ, η, α) = φ Φ . η η η We will use the notation X ∼ PΦ (ξ, η, α) to denote that X has this distribution. Assuming an underlying population with density (8), the log-likelihood
−150 l(α)
−250 −300
−140
−65
1
2
3
4
5
0
1
2
3
4
5
0
l(λ)
−59 l(λ)
−60 −61 −62 −63
0
2
5 λ
10
−5
0
3
4
5
α
−58
−28.0 −28.5 −29.0 −5
1
α
−57
α
5
10
−142 −140 −138 −136 −134 −132
0
l(λ)
−200
−80 −100 −120
l(α)
−45 −55
l(α)
−35
−60
8
−5
λ
0
5
10
λ
Fig. 2 Profile log-likelihood of α, `(α), assuming an underlying PΦ (ξ, η, α) population (first row) and the corresponding profile log-likelihood of λ, `(λ), assuming an underlying SN (ξ, η, λ) population (second row) for samples of size 20 (left), 50 (centre) and 100 (right) simulated from the PΦ (1) ≡ SN (0) ≡ N (0, 1) distribution
function (7) becomes `(θ; x) = n{log(α) − log(η) − (1/2) log(2π)} −
n X i=1
zi2 /2 + (α − 1)
n X i=1
log{Φ(zi )}.
(9) Note the missing constant term in Equation 4.2 of Gupta and Gupta (2008). Figures 2 and 3 present profile log-likelihood functions for α for simulated random samples of size 20, 50 and 100 drawn from the PΦ (1) ≡ SN (0) ≡ N (0, 1) and PΦ (2) ≡ SN (1) distributions, respectively. Also included are the corresponding profile log-likelihood functions of the skewness parameter λ assuming that the underlying distribution is the location-scale extension of Azzalini’s skew-normal. Under its so-called direct parametrisation, the density of this distribution is given by µ ¶ ½ µ ¶¾ x−ξ x−ξ 2 Φ λ , (10) ϕφΦ (x; ξ, η, λ) = φ η η η
−140 −160 l(α)
−200
−120
−220
−110
−50 −55
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
5
6
−72
−140 l(λ)
−74
−150
l(λ)
−160
−32
−77
−31
−76
−30
−75
−29
−28
−73
−27
4 α
−130
α
−26
α
l(λ)
−180
−90 l(α)
−100
−40 −45
l(α)
−35
−80
−30
−70
9
−5
0
5 λ
10
−5
0
5 λ
10
−5
0
5
10
λ
Fig. 3 Profile log-likelihood of α, `(α), assuming an underlying PΦ (ξ, η, α) population (first row) and the corresponding profile log-likelihood of λ, `(λ), assuming an underlying SN (ξ, η, λ) population (second row) for samples of size 20 (left), 50 (centre) and 100 (right) simulated from the PΦ (2) ≡ SN (1) distribution
where x, ξ, λ ∈ R and η ∈ R+ . We will use the notation SN (ξ, η, λ) when referring to this distribution. As can be seen from these two figures, the shapes of profile log-likelihoods for α are generally far more regular than their counterparts for the shape parameter of the skew-normal distribution, λ. They exhibit none of the undesirable features known to exist for profile log-likelihoods for λ; such as saddlepoints at λ = 0, multimodality and slowly increasing profiles leading out to infinite ML estimates for λ (Azzalini 1985; Pewsey 2000). An important consequence of this more regular behaviour is that the identification of the ML solution for an assumed underlying power normal distribution will generally be a more straightforward optimisation problem than its counterpart for an assumed underlying skew-normal distribution. Indeed, in the latter scenario, the convergence of optimisation routines to a point close to λ = 0, and not to the true ML solution, may well occur if they are run from only a single starting point in the parameter space. Five of the six profile log-likelihoods for λ in Figures 2 and 3 exhibit saddlepoints at λ = 0. Saddlepoints such as
10
these arise as a consequence of the fact that there is always a solution to the score equations associated with λ = 0 (Arnold et al. 1993). Note also that the profile log-likelihoods for α all have maxima associated with α-values in close neighourhoods of the true values of α. The same cannot be said for the profile log-likelihoods for λ which sometimes have maxima corresponding to λ-values far away from the true values. Thus, the profile log-likelihoods in Figures 2 and 3 also provide empirical evidence that the ML estimates of α will be more precise than those of λ. 3.1 Score equations The first-order partial derivatives of (9) can be written as: ∂`(θ; x) = n{z − (α − 1)w}/η, ∂ξ ∂`(θ; x) = n{u + 1/α}, ∂α
∂`(θ; x) = −n{1 − z 2 + (α − 1)zw}/η, ∂η
where, here, wi = φ(zi )/Φ(zi ) and ui = log{Φ(zi )}. Equating these partial derivatives to zero, the solutions to the score equations satisfy z = (α − 1)w, 1 − z 2 = (1 − α)zw and α = −1/u. As for the general situation described in Subsection 2.2.1, this system of equations must be solved numerically. In our investigations we made use of the L-BFGS-B method of Byrd et al. (1995) from R’s optim library to obtain the ML estimates. It uses a limited-memory modification of the quasi-Newton method which allows for box constraints. 3.2 Observed information matrix Proceeding as in Subsection 2.2.2, the elements of the observed information matrix can be expressed as: jξξ jξη jξα jηη jηα
= = = = =
n{1 + (α − 1)(w2 + zw)}/η 2 , n{2z + (α − 1)[zw2 + z 2 w − w]}/η 2 , nw/η, n{3z 2 − 1 − (α − 1)[2zw − z 3 w − z 2 w2 ]}/η 2 , nzw/η, jαα = n/α2 .
ˆ ηˆ and α Assuming the ML estimates ξ, ˆ to be solutions to the score equations, all but the last of these elements evaluated at the ML solution simplify further due to the restrictions z = (α − 1)w and 1 − z 2 = (1 − α)zw. The first five elements become: jξˆξˆ jξˆηˆ jξˆαˆ jηˆηˆ
= = = =
n{2 − z 2 + (α ˆ − 1)w2 }/ˆ η2 , n{z + (ˆ α − 1)[zw2 + z 2 w]}/ˆ η2 , nz/{ˆ η (ˆ α − 1)}, n{1 + z 2 + (ˆ α − 1)[z 3 w + z 2 w2 ]}/ˆ η2 ,
11
jηˆαˆ
=
n(1 − z 2 )/{ˆ η (1 − α ˆ )},
ˆ η , etc. where, here, zi = (xi − ξ)/ˆ 3.3 Expected information matrix Given the elements of the observed information matrix, and proceeding as in Subsection 2.2.3, the elements of the expected information matrix, times n−1 , can be expressed as: iξξ iξη iξα iηη iηα
= = = = =
{1 + (α − 1)(a∗02 + a∗11 )}/η 2 , {2E(z) + (α − 1)[a∗12 + a∗21 − a∗01 ]}/η 2 , a∗01 /η, {3E(z 2 ) − 1 + (α − 1)[a∗22 + a∗31 − 2a∗11 ]}/η 2 , a∗11 /η, iαα = 1/α2 ,
where a∗kj = E{z k (φ(z)/Φ(z))j }. There are no closed-form expressions for the a∗kj , which must be calculated numerically. We made use of R’s integrate library, based on Piessens et al. (1983), to compute them. Note that when α = 1 (i.e. the underlying distribution is normal), n−1 times the expected information matrix simplifies to 1/η 2 0 0.903197/η 1/η 2 0 a∗01 /η 0 0 2/η 2 −0.595636/η . 2/η 2 a∗11 /η = ∗ ∗ 1 0.903197/η −0.595636/η 1 a01 /η a11 /η (11) The determinant of (11) is 0.013688/η 4 . Thus, the expected information matrix for the PΦ (α) model is non-singular for the special case of an underlying normal distribution. In contrast, under the direct parametrisation of the skew-normal distribution, both the expected and observed information matrices are known to be singular for this same special case (Azzalini 1985; Pewsey 2006). Considering the inverse of (11), it is evident that, under normality, and asymptotically, all three parameters are correlated. This reflects the fact that, as observed in the Introduction, α affects the location and dispersion of the distribution as well as its shape. 4 Illustrative example In order to illustrate the application of the results derived and discussed in Sections 2 and 3, here we consider fitting the power normal distribution to a data set available from the web address http://lib.stat.cmu.edu/datasets /pollen.data. More specifically, we analyze the 3848 observations of the variable “density” in the data file POLLEN5.DAT. This variable measures a geometric characteristic of a specific type of pollen. √ For these data, x = 0.00, s = 3.14, b1 = 0.11 and b2 = 3.20. Given the values of the last two of these summaries, it would appear that the underlying
0.08 0.06 0.00
0.02
0.04
Density
0.10
0.12
0.14
12
−10
−5
0
5
10
Pollen density
Fig. 4 Histogram for the pollen data. The superimposed density corresponds to the maximum likelihood solutions for the power normal distribution and Azzalini’s skew-normal distribution which are effectively indistinguishable
Table 1 Maximum likelihood estimates for the parameters of the power normal (PΦ ) distribution and the direct parametrisation of Azzalini’s skew-normal (SN ) distribution when fitted to the pollen data. The figures between brackets are the standard errors of the estimates. The value of the log-likelihood for each maximum likelihood solution is also given. Model
Log-likelihood
PΦ ξˆ = −1.74 (0.68) ηˆ = 3.69 (0.21) α ˆ = 1.77 (0.37) −9863.37
SN ξˆ = −2.04 (0.24) ηˆ = 3.75 (0.14) ˆ = 0.93 (0.14) λ −9863.42
distribution is slightly positively skewed and marginally more peaked than the normal distribution. A histogram of the data appears in Figure 4. In addition to fitting the power normal distribution, PΦ (ξ, η, α), we compared its fit with that obtained for the direct parametrisation of Azzalini’s skew-normal distribution with density (10). Table 1 presents the ML estimates for the parameters of the two fitted models. The figures between brackets are the asymptotic standard errors of the estimates obtained by inverting the expected information matrices for the two fits evaluated at their respective ML solutions. The maximised log-likelihood values correspond to densities which are effectively indistinguishable from one another. Remember, now, that a normal distribution is obtained when α = 1 in the power normal model and when λ = 0 in the skew-normal model. Given that the normal distribution is a special case of both models, we can apply the usual likelihood-ratio test to investigate whether the underlying distribution for the data could in fact be normal. The maximised log-likelihood value for an assumed normal distribution is easily calculated to be −9867.94. Minus twice the difference between this value and the corresponding maximised loglikelihood values for the power normal and skew-normal fits are 9.15 and 9.05, respectively. Comparing these values with the critical values of the asymptotic χ21 distribution, the null hypothesis of an underlying normal distribution is
−9900 l(λ)
−9960
−9885
−9940
−9920
−9875 −9880
l(α)
−9870
−9880
−9865
−9860
13
0.5
1.0
1.5
2.0
2.5
3.0
3.5
α
−2
−1
0 λ
1
2
Fig. 5 Profile loglikelihood of α, `(α), assuming an PΦ model (left) and the profile log-likelihood of λ, `(λ), assuming an SN model (right) for the pollen data
rejected in both cases with a p-value lower than 0.005. Hence, the two models appear to be roughly equally sensitive to the apparent slight departure from normality evident in the data. Alternatively, one can consider the construction of confidence intervals for the shape parameters of the two models. Using their respective point estimates, standard errors and asymptotic normal theory, nominally 95% confidence inˆ are (1.05, 2.49) and (0.65, 1.21), tervals for α and λ, symmetric about α ˆ and λ, respectively. Both confidence intervals again provide us with evidence that the underlying distribution is not normal (i.e. that α 6= 1 or λ 6= 0). Figure 5 portrays the profile log-likelihood functions for α, assuming an underlying power normal model, and for λ assuming an underlying skew-normal model. As was the case for the profiles displayed in Figures 2 and 3, the profile log-likelihood for α is far more regular than that for λ, the latter having a saddlepoint at λ = 0. Nominally 95% confidence intervals for α and λ calculated from their profile log-likelihoods together with standard asymptotic chi-squared theory for the likelihood-ratio test statistic are (1.22, 2.63) and (0.61, 1.17), respectively. Both are similar to their asymptotic normal theory based counterparts quoted above and, again, are supportive of non-normality for the underlying distribution.
5 Concluding remarks In this paper we have presented general results for likelihood based inference for the power family of distributions. As was noted in Section 2, those results assume that the only parameters requiring estimation are ξ, η and α and that the support does not depend on any of them. When the generating distribution function involves extra unknown parameters our results must be extended to accommodate them. Likelihood based inference for power distributions for which the support depends on the parameters being estimated requires case by case consideration as the usual regularity conditions no longer apply.
14
Our results, suitably adapted, also apply to the power family of distributions with positive support, null location, and density µ ¶ ½ µ ¶¾α−1 α z z ϕF (z; η, α) = f F , z, η, α ∈ R+ , η η η where f and F also have positive support. This density provides a model for reliability and survival analysis data. As an example, when f and F are the density and distribution functions of the standard exponential distribution, one obtains a particular case of the generalized exponential model studied by Gupta and Kundu (1999); see also Mudholkar and Freimer (1995). As one of the referees has suggested, instead of densities of form (4) one might contemplate employing power models of the form ϕGf (z; α) ∝ f (z){G(z)}α−1 , where G is a distribution function but not the distribution function associated with the density f . Whilst R ∞this construction provides an extension of (4), the normalising constant, 1/ −∞ f (z){G(z)}α−1 dz, will generally not have a closed form and, in most cases, will have to be computed numerically. One special case is when α = 2, for which the normalising constant is 2. This corresponds to a special case of the Azzalini densities in (1). For the particular case of the power normal distribution, we have contrasted results obtained for it with their analogues for the direct parametrisation of the skew-normal distribution. In many respects, likelihood inference for the power normal distribution will be more straightforward. First, as was illustrated in Section 3, the shapes of profile log-likelihoods for the power normal distribution’s α are generally more regular than those for the skew-normal distribution’s λ. The irregular behaviour of the latter is primarily due to the presence of the saddlepoints which arise as a consequence of λ = 0 always being a solution to the score equations. Secondly, the expected information matrix for the power normal distribution is of full rank under normality, whereas the observed and expected information matrices for the direct parametrisation of the skew-normal distribution are both singular under normality. These saddlepoint and singularity problems can be circumvented by employing the so-called centred parametrisation of the skew-normal distribution for numerical work. In this alternative parametrisation the mean, standard deviation and coefficient of skewness, µ, σ and γ1 , are used instead of the direct parameters ξ, η and λ. For analytical work, however, the direct parametrisation is generally the more appealing. If well understood, this dichotomy need not be a source of confusion. However, there is both anecdotal evidence as well as evidence within the literature that, even amongst people working in the field, their different roles are still not fully appreciated and even confused. Finally, a second referee drew our attention to the important inferential issue of goodness-of-fit testing. As s/he pointed out, chi-squared goodness-offit testing is always available although, as is well-known, the results obtained using it depend on the class intervals chosen to calculate the observed and expected frequencies. More objectively, parametric bootstrap tests based on the Kolmogorov-Smirnov statistic, or other disparity measures between the empirical distribution function and its fitted power family counterpart, can be
15
employed. A detailed investigation of the performance of the latter, however, falls beyond the scope of the present paper. Acknowledgements We are most grateful to two anonymous referees for their careful reading of a previous draft of the paper and constructive suggestions towards improving it. Financial support for the research which led to the production of this paper was received from the: Spanish Ministry of Science and Education grant MTM2009-07302 and Junta de Extremadura grant PRI08A094 (Pewsey); FONDECYT grant 1090411 (G´ omez); CNPqBrasil (Bolfarine).
References 1. Arnold BC, Beaver RJ, Groeneveld RA, Meeker WQ (1993) The nontruncated marginal of a truncated bivariate normal distribution. Psychometrika 58:471–488 2. Azzalini A (1985) A class of distributions which includes the normal ones. Scand J Stat 12:171–178 3. Azzalini A (1986) Further results on a class of distributions which includes the normal ones. Statistica 46:199–208 4. Byrd RH, Lu P, Nocedal J, Zhu C (1995) A limited memory algorithm for bound constrained optimization. SIAM J Sci Comp 16:1190–1208 5. Chiogna M (1997) Notes on estimation problems with scalar skew-normal distributions. Tech Rep 15, Dept Stat Sci, Univ Padua 6. Durrans SR (1992) Distributions of fractional order statistics in hydrology. Water Resour Res 28:1649–1655 7. Eugene N, Lee C, Famoye F (2002) Beta-normal distribution and its applications. Commun Stat Theory Meth 31:497–512 8. Fern´ andez C, Steel MFJ (1998) On Bayesian modelling of fat tails and skewness. J Am Stat Assoc 93:359–371 9. Gupta RD, Gupta RC (2008) Analyzing skewed data by power normal model. Test 17:197–210 10. Gupta RD, Kundu D (1999) Generalized exponential distributions. Aust NZ J Stat 41:173–188. 11. Henze N (1986) A probabilistic representation of the skew-normal distribution. Scand J Stat 13:271–275 12. Jones MC (2004) Families of distributions arising from distributions of order statistics (with discussion). Test 13:1–43 13. Jones MC, Pewsey A (2009) Sinh-arcsinh distributions. Biometrika, 96:761–780 14. Lehmann EL (1953) The power of rank tests. Ann Math Stat 24:23–43 15. Miura R, Tsukahara H (1993) Nonparametric estimation for generalized Lehmann alternatives. Stat Sinica 3:83–101 16. Mudholkar GS, Freimer M (1995) The exponentiated Weibull family: a reanalysis of the bus motor failure data. Technometrics 37:436-445. 17. Pewsey A (2000) Problems of inference for Azzalini’s skew-normal distribution. J Appl Stat 27:859–870
16
18. Pewsey A (2006) Some observations on a simple means of generating skew distributions. In: Balakrishnan N, Castillo E, Sarabia JM (eds) Advances in Distribution Theory, Order Statistics, and Inference. Birkh¨auser, Boston, pp 75–84 19. Piessens R, de Doncker-Kapenga E, Uberhuber C, Kahaner D (1983) QUADPACK: A Subroutine Package for Automatic Integration. Springer, New York