Exploring Compositional Data with the CoDa-Dendrogram 1 Introduction

4 downloads 5804 Views 127KB Size Report
compositional data, compositional orthonormal coordinates allow the appli- .... These points match the so called principles of compositional data analysis ...
AUSTRIAN J OURNAL OF S TATISTICS Volume 40 (2011), Number 1 & 2, 103-113

Exploring Compositional Data with the CoDa-Dendrogram Vera Pawlowsky-Glahn1 and Juan Jose Egozcue2 1 University of Girona, Spain 2 Technical University of Catalonia, Barcelona, Spain Abstract: Within the special geometry of the simplex, the sample space of compositional data, compositional orthonormal coordinates allow the application of any multivariate statistical approach. The search for meaningful coordinates has suggested balances (between two groups of parts)—based on a sequential binary partition of a D-part composition—and a representation in form of a CoDa-dendrogram. Projected samples are represented in a dendrogram-like graph showing: (a) the way of grouping parts; (b) the explanatory role of subcompositions generated in the partition process; (c) the decomposition of the variance; (d) the center and quantiles of each balance. The representation is useful for the interpretation of balances and to describe the sample in a single diagram independently of the number of parts. Also, samples of two or more populations, as well as several samples from the same population, can be represented in the same graph, as long as they have the same parts registered. The approach is illustrated with an example of food consumption in Europe. Keywords: Aitchison Geometry, Euclidean Vector Space, Orthonormal Coordinates.

1 Introduction The sample space of D-part compositional data, the simplex, being a subset of the real space RD , has a real Euclidean vector space structure (Billheimer, Guttorp, and Fagan, 2001; Pawlowsky-Glahn and Egozcue, 2001). The easiest way to study data whose sample space is a real Euclidean space is to represent them in coordinates with respect to an orthonormal basis. Coordinates behave like real random vectors (Kolmogorov and Fomin, 1957) and thus, as discussed in Pawlowsky-Glahn (2003), any usual statistical technique can be applied. In any Euclidean space, an infinite number of orthonormal bases exists, and the simplex is one of them. Different techniques can be used to build such a basis. The best known techniques in mathematics use the Gram-Schmidt orthonormalisation process or a Singular Value Decomposition (Egozcue, Pawlowsky-Glahn, Mateu-Figueras, and Barcel´o-Vidal, 2003). These mathematically straightforward methods, however, not always lead to easy-to-interpret coordinates. The analysis of problems related to the amalgamation of parts, and the search for dimension reducing techniques related to subcompositions, suggested a new strategy: balances. Balances are a specific kind of orthonormal coordinates associated with groups of parts (Egozcue and Pawlowsky-Glahn, 2005b). They are based on a sequential binary partition of a D-part composition into non-overlapping groups. This approach is very

104

Austrian Journal of Statistics, Vol. 40 (2011), No. 1 & 2, 103–113

intuitive and the resulting coordinates are frequently easy to interpret. Moreover, it leads to a decomposition of the total variance into marginal variances which can be assigned either to intra-group (subcompositional) variability, or to inter-group variability (relative variability between two groups of parts). To visualise this, together with other univariate characteristics, a specific tool, the CoDa-dendrogram, has been developed.

2 A Compositional Data Set To present the approach from an intuitive perspective, let us consider the following problem: To decide his business strategy, one merchant, leader in the food industry, wants to compare the food consumption habits in the old East and the West countries. To do so, he wants to analyse data—published by Eurostat (Pe˜na, 2002)—reproduced in Table 1. These data are percentages of consumption of 9 different kinds of food in 25 countries in Europe in the early eighties. A preliminary question is which is the relevant information Table 1: Food consumption expenditure in the East (E) and the West (W), published by Eurostat, in percent. The sample size is 25. Legend: RM: red meat; WM: white meat; F: fish; E: eggs; M: milk; C: cereals; S: starch; N: nuts; FV: fruit and vegetables. RM 10.1 8.9 13.5 7.8 9.7 10.6 9.5 18.0 9.3 8.4 11.4 10.2 5.3 13.9 9.0 9.4 6.9 6.2 6.2 7.1 9.9 13.1 9.5 17.4 4.4

WM 1.4 14.0 9.3 6.0 11.4 10.8 4.9 9.9 4.6 11.6 12.5 3.0 12.4 10.0 5.1 4.7 10.2 3.7 6.3 3.4 7.8 10.1 13.6 5.7 5.0

E 0.5 4.3 4.1 1.6 2.8 3.7 2.7 3.3 2.1 3.7 4.1 2.8 2.9 4.7 2.9 2.7 2.7 1.1 1.5 3.1 3.5 3.1 3.6 4.7 1.2

M 8.9 19.9 17.5 8.3 12.5 25.0 33.7 19.5 16.6 11.1 18.8 17.6 9.7 25.8 13.7 23.3 19.3 4.9 11.1 8.6 24.7 23.8 23.4 20.6 9.5

F 0.2 2.1 4.5 1.2 2.0 9.9 5.8 5.7 3.0 5.4 3.4 5.9 0.3 2.2 3.4 9.7 3.0 14.2 1.0 7.0 7.5 2.3 2.5 4.3 0.6

C 42.3 28.0 26.6 56.7 34.3 21.9 26.3 28.1 43.6 24.6 18.6 41.7 40.1 24.0 36.8 23.0 36.1 27.0 49.6 29.2 19.5 25.6 22.4 24.3 55.9

S 0.6 3.6 5.7 1.1 5.0 4.8 5.1 4.8 6.4 6.5 5.2 2.2 4.0 6.2 2.1 4.6 5.9 5.9 3.1 5.7 3.7 2.8 4.2 4.7 3.0

N 5.5 1.3 2.1 3.7 1.1 0.7 1.0 2.4 3.4 0.8 1.5 7.8 5.4 1.6 4.3 1.6 2.0 4.7 5.3 5.9 1.4 2.4 1.8 3.4 5.7

FV 1.7 4.3 4.0 4.2 4.0 2.4 1.4 6.5 2.9 3.6 3.8 6.5 4.2 2.9 6.7 2.7 6.6 7.9 2.8 7.2 2.0 4.9 3.7 3.3 3.2

group E W W E E W W W E E W W E W W W E W E W W W W W E

country Albania Austria Belgium Bulgaria Check Rep. Denmark Finland France FSU Germany (E) Germany (W) Greece Hungary Ireland Italy Norway Poland Portugal Rumania Spain Sweden Switzerland The Netherlands United Kingdom Yugoslavia

V. Pawlowsky-Glahn, J. Egozcue

105

in this data set and which is the sample space of the data. Although data are presented here as percentages of expenditure, it is not clear what is the meaning of total expenditure or how it was measured. Moreover, each data-vector does not add to 100%. This means that there is an implicitly defined additional component, that we call other, that completes the total, i.e. 100%. Even more, we have doubts about the units of expenditure: if they are measured in different currencies, how have the reference prices been established? Also, if the units were tons of food of each type, what would the meaning of the above percentages be? What does a percentage of tons of a total, made of tons of meat plus tons of nuts, mean? These questions lead to two important conclusions: • The definition of the total is irrelevant, both with respect to its units and to the reported components constituting the data-vector. • The information to be extracted from such a data-set is not related to the units in which the original components were registered. These points match the so called principles of compositional data analysis (Aitchison, 1986; Aitchison and Egozcue, 2005; Egozcue, 2009). They can be summarized as scale invariance and subcompositional coherence. The first one states that a change of units should not alter compositional information. The second one advocates that a change of scale should be applicable to any subset of two or more components, called subcomposition; also, that conclusions obtained from a subcomposition should not be in contradiction with those obtained from a composition including it. For instance, if an analyst studies the parts of animal based food (meat, fish, . . . ), he should not reach a conclusion which stands in contradiction with the conclusions of another analyst dealing with the whole composition. The key point of these principles is that the only information conveyed by compositional data are the ratios between the different parts of the observed composition. This is the case of the data-set presented in Table 1 and they should be considered as a compositional data-set. Therefore, the sample space of the food consumption is the 9part simplex, S 9 . In order to represent the data set in the simplex, the∑ closure operation 9 is used: if x = (x1 , x2 , . . . , x9 ) is one of the data vectors and t = j=1 xj , then the closed vector is Cx = (x1 /t, x2 /t, . . . , x9 /t), so that its components, called parts, add to 1. In this case, they are expressed in parts per unit. To obtain a different closure constant, the resulting closed vector has to be multiplied by the corresponding constant; e.g. κ = 100 gives percentages. The vectors x and Cx are said to be compositionally equivalent (Barcel´o-Vidal, Mart´ın-Fern´andez, and Pawlowsky-Glahn, 2001). The importance of the closure is only apparent. The whole compositional analysis is based on the scale invariance and all characteristics of a composition are invariant under a multiplication by a positive constant. Another important point in compositional analysis is that the distances in the simplex, called Aitchison distances, are invariant under perturbation. Perturbation of a composition of D parts, x, by a D-vector with positive components, p, is defined as the composition x ⊕ p = C(x1 p1 , x2 p2 , . . . , xD pD ) in S D . Perturbation is the addition in the Aitchison geometry of the simplex and can be viewed as a shift of x by p.

106

Austrian Journal of Statistics, Vol. 40 (2011), No. 1 & 2, 103–113

The Aitchison geometry of the simplex provides a distance between two compositions, ( )2 1 ∑ xi yi log − log , da (x, y) = D i