Modelling Compositional Data Using Dirichlet ...

59 downloads 0 Views 209KB Size Report
Jun 25, 2007 - this paper we investigate the Dirichlet Covariate Model and compare it to the logratio analysis. Maximum likelihood estimation methods are ...
Modelling Compositional Data Using Dirichlet Regression Models Rafiq H. Hijazi† and Robert W. Jernigan‡ †

Department of Statistics, United Arab Emirates University P.O.Box 17555, Al-Ain, UAE Email:

‡ Department

[email protected]

of Mathematics and Statistics, American University Washington, DC 20016 ,USA Email:

[email protected]

June 25, 2007 Abstract Compositional data are non-negative proportions with unit-sum. These types of data arise whenever we classify objects into disjoint categories and record their resulting relative frequencies, or partition a whole measurement into percentage contributions from its various parts. Under the unit-sum constraint, the elementary concepts of covariance and correlation are misleading. Therefore, compositional data are rarely analyzed with the usual multivariate statistical methods. Aitchison (1986) introduced the logratio analysis to model compositional data. Campbell and Mosimann (1987a) suggested the Dirichlet Covariate Model as a null model for such data. In this paper we investigate the Dirichlet Covariate Model and compare it to the logratio analysis. Maximum likelihood estimation methods are developed and the sampling distributions of these estimates are investigated. Measures of total variability and goodness of fit are proposed to assess the adequacy of the suggested models in analyzing compositional data.

Key words: Compositional data, Dirichlet distribution, Logratio analysis, Aitchison’s distance, maximum likelihood estimation.

1

1

Introduction

Compositional data are proportions or percentages of disjoint categories adding to one. This simple unit-sum constraint severely complicates analysis. The first clear and unified approach to the statistical analysis of compositional data came with the landmark work of Aitchison (1986). Aitchison developed a range of methods based on the “simple comment that information in compositional vectors is concerned with relative, not absolute magnitudes,” Aitchison (1986). Logratios emerged as the preferred method to deal with the unit-sum constraint. The resulting set of tools and applications have contributed greatly to the understanding of countless compositional problems and datasets. In Aitchison’s approach any meaningful function of a composition must be expressible in terms of ratios of the components of the composition. But even this view has not proved universally applicable since such a modelling approach emphasizes those components with high relative variation. In many datasets these are exactly the components with low absolute percentages and variation. This can give greater importance to components that have little overall emphasis on a meaningful understanding of the composition. Such was the concern of Beardah et al. (2003) and Baxter et al. (2005) in examining the elemental components of glass. In this setting, concentration on trace components distracts from more fundamental and practical considerations of those components that reflect meaningful differences in glass-making technology or choice of raw materials. Meaningful compositional interpretations can be limited by an emphasis on ratios. Beardah et al. (2003) and Baxter et al. (2005) address this problem with a weighted analysis, down-weighting the influence of components with low absolute percentages. At the extreme, when such trace components are not even found, an analysis based on logratios is not possible. Zeros are undefined for logs and some ratios. One approach to retain a logratio analysis takes the down-weighting to an extreme, by amalgamating a zero component with another nonzero one. There are alternatives to a logratio analysis. There are approaches that stay in the simplex defined by the unit-sum constraint for nonnegative data. For example, Rayens and Srinivasan (1994), Smith and Rayens (2002) have developed generalized Liouville and conditional Liouville distributions for use on the simplex. We propose another such approach using a Dirichlet covariate model. We offer this as an additional tool in the analysis of compositional 2

data. We recognize well that logratio analysis is in all ways more comprehensive than what is presented here. But the tools proposed below can provide an additional view of compositional structure. Connor and Mosimann (1969) originally proposed the Dirichlet distribution as a null model for compositional data. The Dirichlet distribution has a strong implied independence structure. This would seem to make it an inappropriate candidate for modelling compositional data. A simulation study by Brehm et al. (1998) found that when components are influenced by common covariates, Dirichlet modelling was as successful as logratio methods. Campbell and Mosimann (1987a, 1987b) developed an approach extending the Dirichlet distribution to a class of Dirichlet covariate models. They showed that this class of models can accommodate different variational behavior of compositional data. This class can be produced by reparametrization of the Dirichlet parameters in terms of the observed covariate. Further, they successfully showed that the covariance structure of this class is not necessarily negative as a simple Dirichlet model would suggest. It can thus be useful in modelling compositional data with many different covariance structures. The parameters in Dirichlet models are estimated here by the method of maximum likelihood. Closed form solutions are unavailable and numerical optimization techniques are used to maximize the likelihood function. In these numerical techniques, the choice of the starting values plays an important role in achieving reliable and rapid convergence of the algorithm used in the optimization. Our work involves developing an algorithm for good initial values in Dirichlet models, deriving and investigating the properties of the maximum likelihood estimates in Dirichlet models, developing measures of goodness-of-fit and total variability, and comparing the performance of logratio and Dirichlet models in modelling compositional data. The remainder of this paper is organized as follows. In section 2, we introduce the Dirichlet regression models and methods and an efficient algorithm to choose the starting values in the numerical optimization of the maximum likelihood function. The asymptotic properties of the maximum likelihood estimates are considered in section 3. Formulas for the bias and the skewness of the estimates are derived. In section 4, we propose two measures of explained variability in compositional data to assess the adequacy of fit of logratio and Dirichlet models. An illustrative example is given to compare the logratio and the Dirichlet models in section 5. Section 6 summarizes the main results of the paper. 3

2

Dirichlet Regression

The Dirichlet model with constant parameters is a model that can accommodate some shapes of compositional data, but we will see that the Dirichlet regression gives a much richer class. This modelling is flexible enough to explain the different trends and covariance structures in compositional data, and it does not require the strong form of independence objected to by Aitchison (Aitchison, 1986). The covariance structure of the Dirichlet distribution is necessarily negative (Aitchison, 1986), however this is not the case for Dirichlet regression models (Campbell and Mosimann, 1987b). In Dirichlet density function, data should be incorporated in proportion-form only. This means that given two vectors, x and x∗ =cx, both will be treated as y = C(x) = C(x∗ ) where C is the closure operation (Aitchison, 1986). Therefore, the Dirichlet models have scale-invariance property.

2.1

Dirichlet Models

Let x =(x1 , ..., xD ) be a 1 × D positive vector having Dirichlet distribution with positive parameters (λ1 , ..., λD ) with density function à ! D D Y Y f (x) = Γ(λ)/ Γ(λi ) xλi i −1 (2.1) i=1

where

D P i=1

xi = 1 and λ =

D P

i=1

λi .

i=1

A Dirichlet regression model is readily obtained by allowing the parameters of a Dirichlet distribution to change with a covariate. For a given covariate s, the parameters of a Dirichlet distribution D(λ1 , ..., λD ) can be written as positive-valued functions gj (s) of the covariate s. Besides the exponential family, the family of polynomials is a suitable candidate in this situation where it exhibits the desired positivity on a certain range. A different Dirichlet distribution is modelled for every value of the covariate, resulting in a conditional Dirichlet distribution with x|s is D(g1 (s), ..., gD (s)).

2.2

Estimation in Dirichlet Regression

Let xi =(xi1 , ...,xiD ) be an observed vector of proportions with xij > 0 and D P xij = 1 and let si be the corresponding observed covariate for i = j=1

4

1, ..., n. The conditional distributions of Xi given si are mutually independent with Xi |si is distributed as Dirichlet with unknown parameters λ(si )= (λ1 (si ), ...,λD (si )) . The likelihood function given the covariate is given by ( ) λ (s )−1 n D Y Y xijj i L= Γ (Λ(si )) Γ (λj (si )) i=1 j=1 where Λ(si ) =

D P

λj (si ).

j=1

Now, consider the Dirichlet regression model with parameters λ0 s depending on the covariate s through polynomial functions as introduced by Campbell and Mosimann (1987a). The new parameters can be written, in p k P P the new parameterization, as λij = λj (s) = βjm sm and Λ(s ) = λij = i i m=0

p P m=0

βm s m i where βm =

2.2.1

k P

j=1

βjm .

j=1

Selection of Starting Values in Dirichlet Regression Models

Numerical optimization of such a likelihood function with constant parameters is straightforward. We use the S-plus function ms() to optimize the likelihood (Venables and Ripley, 2002). This Newton-type algorithm takes initial estimates and revises them utilizing the first and the second derivatives of the likelihood function. As starting values for this constant parameter fitting, we used method of moments estimates. For more details on the estimation of Dirichlet parameters and the selection of starting values see Ronning (1989), Narayanan (1990, 1992) and Hariharan and Velu (1993). The difficulty arises when we attempt to extend this to Dirichlet regression models. Starting values must be carefully chosen for the optimization algorithm to converge. We developed an algorithm to select the starting values for Dirichlet regression models based on MLEs numerically obtained from a collection from Dirichlet distributions with constant parameters. We regress these estimates, obtained over a range of covariate values, on the covariate to yield starting values for the parameters in Dirichlet regression. This method is efficient and leads to convergence in most cases especially in linear models with large sample sizes. 5

Algorithm 1 (Selection of Starting values in Dirichlet Regression) The method works as follows: 1. Choose k samples with replacement each of size m (m ≤ n) from the data. 2. For each sample fit a Dirichlet model with constant parameters using the method of moments estimates as starting values and compute the mean of the corresponding covariate sample. This will result in k × (D + 1) matrix V, where the first D columns (V1 , ..., VD ) represent the ML estimates of the k samples and the last column; VD+1 represents the mean of the covariate in each sample. 3. Fit by least squares D models of the form Vi = f (VD+1 ), i = 1, ..., D and f (.) is the desired reparametrization of the Dirichlet parameters. 4. Use the coefficients from the regression models as starting values. Simulation studies using S-Plus indicate that the highest probability of convergence occurred for n/m almost equal to 3. This suggests m ≈ [n/3]. A value of k close to 20 has indicated good regression estimates.

3 3.1

The Asymptotic Properties of Maximum Likelihood Estimates The Asymptotic Bias

The maximum likelihood estimates of the Dirichlet parameters are biased. Theorem (3.1) shown below provides estimates for the bias. Bias-reduced estimators can be then formed by subtracting this bias from the estimates (Barndorff-Nielsen and Cox (1989)). The theorem gives the second order approximation of the bias of the maximum likelihood estimates in the Dirichlet models. Theorem 3.1 Let θ =(θ1 , ..., θp )0 be a p × 1 vector of the Dirichlet parameters, H(θ) be the p × p Hessian matrix and I = −E[H(θ)] be the p × p information matrix. The second order approximation of the bias is given by ¡ ¢ 1 −1 ∂vecH(θ)0 b vec I−1 B(θ) = I 2 ∂θ 6

(3.1)

Proof: Let θb be a p × 1 vector of the maximum likelihood estimates of the Dirichlet parameters, u(θ) is a p × 1 vector of first order derivatives of the log-likelihood function ` with respect to θ, H(θ) be a p × p matrix of second order derivatives and I = −E[H(θ)] be the p×p information matrix. Cadigan (1994) gives the second order approximation of the bias in the maximum likelihood estimates θb as ¶ ¸ · µ ¡ −1 ¢ © ª 1 ∂vecH(θ)0 −1 −1 b vec I (3.2) B(θ) = I E H(θ)I u(θ) + E 2 ∂θ The second order derivatives of the log-likelihood function of Dirichlet distribution are constant and hence so are the matrices H and I. This means that I = − H and the first term in (3.2) is then £ ¤ E H(θ)I−1 u(θ) |θ=θb = E [Ip u(θ)] |θ=θb = E [u(θ)] |θ=θb = 0 (3.3) where Ip is the p × p identity matrix. The second term of (3.2) reduces to µ ¶ ¡ ¢ 1 ∂vecH(θ)0 ¡ ¢ ∂vecH(θ)0 1 E vec I−1 = vec I−1 (3.4) 2 ∂θ 2 ∂θ due to the constant Hessian matrix. Finally, combine (3.3) and (3.4) to get the result in (3.1)¤.

3.2

The Asymptotic Skewness

q b for the distriThe following theorem gives the asymptotic skewness β1 (θ) bution of the maximum likelihood estimates in the Dirichlet models. Theorem 3.2 Let θba be the maximum likelihood estimate of θa in a Dirichlet model. The asymptotic skewness of θba is given by   q p p p  3 X X X Y Cov(θba , θbα )  s q {[α1 α2 α3 ]} (3.5) β1 (θba ) = 2   b α1 =1 α2 =1 α3 =1 s=1 V ar(θa ) where

·

∂ 3` [α1 α2 α3 ] = E ∂θα1 ∂θα2 ∂θα3 7

¸

Proof: Bowman and Shenton (2000) give the following asymptotic skewness of the maximum likelihood estimate θba   q p p  3 p Y Cov(θba , θbα )  X X X s q β11 (θba ) = {[α1 , α2 , α3 ] + 3[α1 α2 α3 ] + 6[α1 α2 , α3 ]} s=1  α1 =1 α2 =1 α3 =1 V ar(θba ) (3.6) where the “square bracket terms” are · ¸ ∂` ∂` ∂` [α1 , α2 , α3 ] = E , ∂θα1 ∂θα2 ∂θα3 · ¸ ∂ 3` [α1 α2 α3 ] = E ∂θα1 ∂θα2 ∂θα3 and

·

∂ 2` ∂` [α1 α2 , α3 ] = E ∂θα1 ∂θα2 ∂θα3

¸

2

` is constant and independent For Dirichlet distribution, the quantity ∂θα∂ ∂θ α2 1 of the data, this means that the third term in (3.6) evaluated at the maximum likelihood estimates θb is · ¸ ∂ 2` ∂` [α1 α2 , α3 ]|θ=θb = E ∂θα1 ∂θα2 ∂θα3 θ=θb · ¸ ∂` ∂ 2` = E ∂θα1 ∂θα2 ∂θα3 θ=θb = 0

The last equality follows from the fact that the first derivatives of the loglikelihood are 0 at the maximum likelihood estimates. Another simplification of (3.6) follows from the following equality [α1 α2 α3 ] = −[α1 , α2 , α3 ] reducing [α1 , α2 , α3 ] + 3[α1 α2 α3 ] to 2[α1 α2 α3 ]. By this, the skewness given in (3.6) reduces to (3.5)¤.

8

4

Diagnostics Measures

After estimation of the Dirichlet model and investigation of the distribution of the maximum likelihood estimates, we focus on assessing the fit of the model to the compositional data. Different models can provide adequate fit to the data and give reasonable results but a best model depends on the criteria chosen. The generalized likelihood ratio test can be used to choose between the constant and the covariate models (Casella and Berger, 2002). Hijazi (2007) proposed several diagnostics and goodness-of-fit measures to use in Dirichlet regression. In this section, we propose more model-fit diagnostics to assess the fitted Dirichlet regression models.

4.1

R2 -Type Measure

Analogous to linear regression, we can develop a numerical measure to evaluate model performance like the usual R2 measure. This measure will be used to estimate the percentage of variation explained by a proposed model. Different R2 measures have been proposed to evaluate the nonlinear models such as logistic and probit models (Cameron and Windmeijer, 1996). These pseudo R2 measure the proportional improvement in the log-likelihood function due to the explanatory variable in the model, compared to the minimal “constant” model. When introducing the logratio analysis, Aitchison (1986) suggested a measure of total variability based on the variation matrix of the transformed logratio data, T(x) defined as T(x) = [τij ] = [var {log(xi /xj )}] Obviously, T(x) is symmetric with zero diagonal elements. Aitchison’s total variability measure totvar(x) is defined as totvar(x) =

1X 1 X [var {log(xi /xj )}] = T(x) d i

Suggest Documents