May 30, 2012 - This research has been supported by the Scottish Government, the Spanish Ministry of Science and Innovation under the project âCODA-RSSâ ...
Journal of Classification 29:14 4 -16 9 (2012) DOI: 10.1007/s00357-012-9105-4
Dealing with Distances and Transformations for Fuzzy C-Means Clustering of Compositional Data Javier Palarea-Albaladejo Biomathematics & Statistics Scotland, UK
Josep Antoni Mart´ın-Fern´andez Universitat de Girona, Spain
Jes´us A. Soto Universidad Cat´olica San Antonio, Spain
Abstract: Clustering techniques are based upon a dissimilarity or distance measure between objects and clusters. This paper focuses on the simplex space, whose elements—compositions—are subject to non-negativity and constant-sum constraints. Any data analysis involving compositions should fulfill two main principles: scale invariance and subcompositional coherence. Among fuzzy clustering methods, the FCM algorithm is broadly applied in a variety of fields, but it is not well-behaved when dealing with compositions. Here, the adequacy of different dissimilarities in the simplex, together with the behavior of the common log-ratio transformations, is discussed in the basis of compositional principles. As a result, a well-founded strategy for FCM clustering of compositions is suggested. Theoretical findings are accompanied by numerical evidence, and a detailed account of our proposal is provided. Finally, a case study is illustrated using a nutritional data set known in the clustering literature. Keywords: Fuzzy clustering; FCM; Compositional data; Closed data; Simplex space; Aitchison distance.
This research has been supported by the Scottish Government, the Spanish Ministry of Science and Innovation under the project “CODA-RSS” Ref. MTM2009-13272; and by the Ag`encia de Gesti´o d’Ajuts Universitaris i de Recerca of the Generalitat de Catalunya under the project Ref: 2009SGR424. We are in debt with the editor and the referees for their helpful comments and suggestions on an earlier version of this paper. Corresponding Author’s Address: Javier Palarea-Albaladejo, Biomathematics and Statistics Scotland, JCMB, The King’s Buildings, Edinburgh, EH9 3JZ, UK, e-mail: javier @bioss.ac.uk. Published online 30 May 2012
145
Fuzzy C-Means Clustering
1.
Introduction
The goal of any clustering method consists of recognizing homogeneous groups or clusters on the assumption that an underlying true group structure exists. Rather than the hard partitioning of classical clustering methods, where objects belong only to a unique cluster, fuzzy methods reflect the uncertainty with which objects are assigned to the different clusters, providing a degree of membership to each one of them and facilitating the identification of overlapping groups. Most clustering algorithms are designed to deal with data on the whole real space. However, this is not always the case, e.g. data on a bounded interval, binary data or directional data, to name just a few. Ignoring the geometric structure of the space of features could lead to results that are without significance or logically inconsistent, although sometimes apparently suitable or reasonable. In this paper we focus on objects described by vectors comprising parts of some whole, which are termed compositional or closed data. This kind of data is typically measured in proportions, percentages, parts per million, or similar. Some illustrative examples are the chemical composition of rocks, nutritional composition of foods, multiparty electoral results, time allocation in a day, distribution of household budget shares, investment portfolio composition, proportions of red, blue and green defining the color of a pixel, and so on. Formally, a composition is a vector x defined on the (D − 1)-dimensional simplex space D D S = {x = [x1 , x2 , . . . , xD ] : x1 > 0, x2 > 0, . . . , xD > 0; xi = κ}, i=1
where κ is an arbitrary positive constant—usually 1 (parts per one), 100 (percentages), or 106 (ppm). A transformation-based methodology to deal with compositional data began to emerge in the early 80’s culminating with the monograph Aitchison (1986). The point is that compositions provide information only about the relative magnitudes of their components. Thus the data analysis must be focused on the ratios between components, which implies that the value of the closure constant κ is irrelevant. This idea was summarized in the principle known as scale invariance (Aitchison 1986). After Aitchison’s monograph, knowledge of the algebraic-geometric structure of the simplex has increased significantly. It is now known that the simplex S D has metric vector space structure (see Billheimer, Guttorp and Fagan 2001; Pawlowsky-Glahn and Egozcue 2001). Such a structure allows us to define orthonormal bases from which any element of S D may be obtained. As detailed in the next section, the coordinates of a composition with respect to an orthonormal basis can be used to conduct the data analysis in a well-founded way.
146
J. Palarea-Albaladejo, J.A. Mart´ın-Fern´andez, and J. Soto
As shown throughout this work, the constraints of the simplex space make in general inappropriate to apply within its confines the ordinary approach to clustering. The Euclidean distance is beyond doubt the one most widely used in cluster analysis, but in any application one must investigate if the measure used is coherent with the underlying space of features. To illustrate this point we here consider an example about votes in political elections. Suppose that we have collected the results in two electoral districts— E1 and E2—in two different elections: 2004 and 2008. In these elections people can vote party D , R, or simply abstain. Table 1 shows a particular example. The results for district E1 in 2004 form a composition from S 3 : E104 = [0.1, 0.2, 0.7]. It is easy to show that if one only wants to analyze the poll percentages (people who vote) a subvector—subcomposition—must be considered. In this case, the subcomposition SE104 = [1/3, 2/3] from S 2 is formed. Concerning the results in district E1, the number of people that voted in 2008 for party D was double that of 2004, whereas the people who voted for R was 50 percent less. The level of abstention was similar in both cases (70%). On the other hand, in district E2 the party D increased its vote by 33%, whereas party R suffered a decrease of 25%. The same proportion of abstention was observed in the two elections (30%). A sensible measure of similarity should detect a larger difference between the electoral results in district E1 than between the results in E2. Nevertheless when the classical Euclidean distance is calculated the same difference (0.14; see Table 3) is obtained in both cases. The reason is that this measure is based upon the same vector of absolute differences [0.1, −0.1, 0]. Note, however, that the respective vectors of between-individuals ratios—[2, 1/2, 1] and [4/3, 3/4, 1]—reflects such differences in a better way. An additional problem is related to the principle known as subcompositional coherence (Aitchison 1986). This principle can be illustrated from several perspectives but we focus here on that related with distances—called subcompositional dominance. In our electoral example, subcompositional dominance implies that the distance between any two elections based on the subcomposition P oll = [D, R] cannot be greater than such distance based on the full composition Election = [D, R, Abst.]. In addition recall that, in each district, the two results have the same amount of abstention. Therefore, results from any data analysis only based on Poll must be the same as those based on Election. As will be seen in Section 3 (Table 3), the Euclidean distance does not fulfill the subcompositional dominance principle. In the present paper the adequacy of a number of distances proposed for compositional data is discussed in the basis of compositional principles, and a coherent framework for applying the widely-used fuzzy c-means
147
Fuzzy C-Means Clustering Table 1. Electoral results in districts E1 and E2 for two elections: 2004 and 2008. District E104 E108 E204 E208
Election D R 0.1 0.2 0.2 0.1 0.3 0.4 0.4 0.3
Abstent. 0.7 0.7 0.3 0.3
District SE104 SE108 SE204 SE208
Poll D 1/3 2/3 3/7 4/7
R 2/3 1/3 4/7 3/7
(FCM) algorithm (Bezdek 1981) is introduced. In addition to theoretical results, our proposal is numerically evaluated on simulated data. Finally, an application example is worked out in detail. 2. Mathematical Framework
Two basic operations are defined in the simplex: perturbation and powering. For any two compositions x, x∗ ∈ S D , the perturbation operation ⊕ is defined as x ⊕ x∗ = C(x1 x∗1 , x2 x∗2 , . . . , xD x∗D ), whereas the powering operation ⊗ is given by α⊗ x = C(xα1 , xα2 , . . . , xαD ), with α ∈ R. Here C denotes the closure operation, which scales the resulting vector to the constant sum κ. From now on we will consider D κ = 1 without loss of generality, thus C divides each component xi by i=1 xi . These operations have a role in the simplex analogous to that of sum and scalar multiplication in real space. Of special relevance here is the vector of ratios known as perturbation difference defined as x x∗ = x ⊕ ((−1) ⊗ x∗ ) = C(x1 /x∗1 , . . . , xD /x∗D ). As the above electoral example suggests, any scalar measure of difference between compositions should be expressible in terms of perturbation difference (Aitchison 1992). Perturbation and powering define a vector space on the simplex, perturbation being a commutative group operation and powering the external product. An inner product defined as 1 xi x∗i ∗ x, x a = log log ∗ D xj xj i 0.7, it can be seen that the FCM clearly confuses the boundaries between Cluster 4 and the other ones—especially at the extremes of the ellipsoidal clouds of points characterizing these latter. Nevertheless, more surprising behavior arises in the vertices (see e.g. Cluster 1), where it assigns points with a very high probability (u > 0.9). Loosely speaking, the FCM is very confident about the membership of those compositions. But recall that near the vertices a small movement implies a great relative change and, consequently, a higher uncertainty about cluster memberships is expected. This fact is taken into account by the FCM-C, therefore it is in the vertices—as well as in the sides of the triangle—where the differences between both are more evident.
157
Fuzzy C-Means Clustering α = 0.5
x1 .95
x .05
1
.95
.75
.25
.75
.5
.5
.25
.75
.75
.5
.25
.05
x3
.75
.05
x2
.95 .95
.75
.5
.25
x .05
.75
.25
.5
.25
.95 .25
.05
x3
.75
.05
x2
.95 .95
.75
.5
x .05 .95
.75
.25
.5
.25
.75
.5
.25
.25
.05
.5
.25
.95 .75
x3
1
.5
.05 .95
.05
.05
.75
.5
x2
.25
α = 0.9
x1 .95
.5
.25
.05 .5
.25
.5
.75
.75
.05
.75
.5
.95
x3
1
.95
x2
.05
α = 0.7
x1 .95
.5
.25
.95 .95
.25
.5
.05
x2
.05
x3
.75
.05
x2
.95 .95
.75
.5
.25
.05
x3
Figure 2. α-cuts for Scenario A: left-side plots refer to FCM and right-side plots to FCM-C.
5.2 Application to Nutritional Data: Mammals’ Milk Composition
Our approach is now applied to a classic data set that has been widely used as test data in clustering. The data are taken from Hartigan (1975) and describe the percentage composition of 24 mammals’ milk on the basis of 5 different constituents (water, protein, fat, lactose and ash). A 4-group solution has been considered as the optimal grouping of such data. The membership probabilities from the FCM-C approach are shown in Table 7. The last two columns indicate the cluster to which each mammal is assigned—highest membership probability—and the Aitchison distance to the respective prototype. This latter is shown graphically in Figure 4,
158
J. Palarea-Albaladejo, J.A. Mart´ın-Fern´andez, and J. Soto α = 0.5
x1 .95
x .05
1
.95
.75
.25
.75
.5
.5
.25
.75
.75
.5
.25
.05
x3
.75
.05
x2
.95 .95
.75
.5
.25
x .05
.75
.25
.5
.25
.95 .5
.25
.05
x3
.75
.05
x
2
.95 .95
.75
.5
x .05 .95
.75
.25
.5
.25
.75
.5
.25
.05
x3
.5
.75
.05
x2
3
.25
.25
.95 .75
x
1
.5
.05 .95
.05
.05
.75
.5
x2
.25
α = 0.9
x1 .95
.5
.25
.05 .75
.25
.5
.75
.95
.05
.75
.5
x3
1
.95
x2
.05
α = 0.7
x1 .95
.5
.25
.95 .95
.25
.5
.05
x2
.05
.95 .95
.75
.5
.25
.05
x3
Figure 3. α-cuts for Scenario B: left-side plots refer to FCM and right-side plots to FCM-C.
which informs us about the homogeneity of the clusters. Some mammals are clearly assigned to a unique cluster, such as the deer and reindeer to Cluster 3, whereas this does not occur with others such as the elephant, due to its low highest membership probability (0.36). Even more problematic is the case of the hippo, whose highest and second highest membership probabilities are almost the same, thus revealing a great uncertainly in relation to its belonging to either Cluster 2 or 4. Note that there is not necessarily an inverse relationship between membership probability and distance from the prototype, as may appear on first acquaintance. For example, the dolphin is classified in Cluster 3 with probability 0.64, however the rat, which is closer to the prototype, is classified in the same cluster with a smaller probability (0.45). Recall that we are taking into account the position of a mammal with
Fuzzy C-Means Clustering
159
Table 7. Membership probabilities, assigned cluster number and Aitchison distance to the prototype. Membership probabilities 1 2 3 4 Cluster da (xi , νk ) 1. Horse 0.0723 0.7267 0.0268 0.1741 2 0.6229 2. Orangutan 0.1067 0.3695 0.0363 0.4875 4 0.7131 3. Monkey 0.0766 0.5621 0.0253 0.3360 2 0.5727 4. Donkey 0.0131 0.9355 0.0046 0.0468 2 0.2133 5. Hippo 0.1960 0.3407 0.1185 0.3448 4 1.9405 6. Camel 0.0525 0.0651 0.0087 0.8738 4 0.2079 7. Bison 0.1941 0.4059 0.0516 0.3484 2 0.9180 8. Buffalo 0.7162 0.0490 0.0570 0.1778 1 0.3931 9. Guinea Pig 0.7272 0.0409 0.1337 0.0982 1 0.4473 10. Cat 0.6683 0.0714 0.0976 0.1628 1 0.5479 11. Fox 0.8492 0.0277 0.0231 0.1000 1 0.2507 12. Llama 0.1110 0.1526 0.0185 0.7178 4 0.3472 13. Mule 0.0637 0.6576 0.0189 0.2597 2 0.4650 14. Pig 0.7998 0.0420 0.0515 0.1068 1 0.3669 15. Zebra 0.0521 0.0437 0.0094 0.8948 4 0.2048 16. Sheep 0.7967 0.0368 0.0357 0.1308 1 0.3137 17. Dog 0.5155 0.0504 0.3299 0.1042 1 0.6984 18. Elephant 0.3636 0.1255 0.2597 0.2512 1 1.3294 19. Rabbit 0.1033 0.0188 0.8443 0.0336 3 0.4086 20. Rat 0.3872 0.0527 0.4549 0.1052 3 0.8149 21. Deer 0.0256 0.0047 0.9609 0.0088 3 0.1910 22. Reindeer 0.0265 0.0052 0.9588 0.0095 3 0.2072 23. Whale 0.0323 0.0080 0.9460 0.0137 3 0.2853 24. Dolphin 0.1792 0.0705 0.6444 0.1059 3 1.2760
respect to each and every one of the clusters. Thus a membership probability cannot be generally understood as a measure of typicality. On the other hand, we may say that the FCM-C provides a more precise grouping. For example, the mammals numbered from 1 to 7 are usually classified into the same cluster, as can be checked using common clustering software. However, the FCM-C sorts out horse, monkey, donkey and bison (in Cluster 2) from orangutan, hippo and camel (in Cluster 4). In Table 8 it can be observed that the prototypes of both clusters are in fact very similar, but the FCM-C is able to detect that fat contents are slightly higher in the latter. Now, the nutritional pattern characterizing the groups is investigated. We first analyze the relative weight of each milk component in every group. For this, the data are centered by calculating the difference x g, with g denoting the closed geometric mean, which is obtained as the closure of the vector of geometric means of each component. In this way, the geometric center of the observed data is moved to the center of the simplex space—[1/5, 1/5, 1/5, 1/5, 1/5] in our case. Next, we compute the cluster
160
J. Palarea-Albaladejo, J.A. Mart´ın-Fern´andez, and J. Soto
Figure 4. Aitchison distance between mammals and belonging cluster prototypes.
Table 8. Final cluster prototypes (in percentage).
1 2 3 4
Water 81.25 89.73 66.44 87.23
Protein 6.76 2.04 10.79 3.04
Fat 6.86 1.73 19.26 3.82
Lactose 4.06 6.07 2.12 5.26
Ash 1.07 0.43 1.38 0.64
prototypes for the centered data and compare them with that center by division, accordingly with the multiplicative nature of the movements in the simplex. These divisions account for the relative weights of the components and are shown in Figure 5A—expressed in logarithms for easy visualization. Observe that Cluster 1 corresponds with a centered group which does not clearly stand out in any component. Cluster 2 is constituted by mammals with relatively high levels of water and lactose and relatively low levels of protein and fat. Cluster 3 is characterized by relatively high levels of protein and fat and relatively low levels of water and lactose. Lastly, Cluster 4 is similar to Cluster 2, but with more fat and less protein, then confirming what was pointed out above. Additionally, a multivariate analysis of variance (MANOVA), using the belonging cluster as factor, is carried out in the space of coordinates to test for the adequacy of the clustering results. The MANOVA outcomes indicate statistically significant differences between the cluster prototypes based on the Wilks’ Lambda statistic, commonly used when there are more
161
Fuzzy C-Means Clustering
0.8
13
0.6 12 0.4 11 0.2 10 0 9
−0.2
8
−0.4
7
−0.6 −0.8 −1 −1.2
Water Protein Fat Lactose Ash
6
5 Cluster 1
Cluster 2
Cluster 3
Cluster 4
(A)
Cluster 2
Cluster 4
Cluster 1
Cluster 3
(B)
Figure 5. Cluster profile: (A) relative weights of milk components for every cluster and (B) similarity between clusters.
than two groups (λW = 0.003, F = 1613.527, p-value < 0.0001). Besides, a single linkage method-based dendrogram of the cluster prototypes is also generated (Figure 5B). It can again be seen that clusters 2 and 4 are the groups with the most similar characteristics, and also that Cluster 3 is the most different one. Hence, well-separated groups are recognized by the FMC-C algorithm. Finally, we look for the components which help the most to distinguish the groups using a biplot display. This graphic tool (Gabriel, 1971) represents both the objects and the components of the vector of features in the space of the two first principal components—see Aitchison and Greenacre (2002) for its adaptation to compositions—and shows the separation of the groups on the basis of their milk components (Figure 6A). The biplot explains 91% of the total variability in the data, thus the 2-dimensional representation is well-supported. The clusters are basically defined by their projections along the first principal component, that is, according to their contents in lactose, water and fat. In fact, if we represent that subcomposition in a ternary plot (Figure 6B), the clusters are clearly distinguished—the data as much as the grid lines have been centered. Note that the position of the groups with respect to the components agrees with the characterization from Figures 5A and 6A. 6. Discussion
We have shown that most commonly-used distances for compositions do not agree with the main geometric principles of the simplex sample space: scale invariance and subcompositional dominance. Such principles are satisfied by two measures: a distance derived from the own geometry of the simplex, the Aitchison distance; and a measure of divergence, the C-KL dissimilarity. As proven here, the C-KL dissimilarity is closely related to the
162
J. Palarea-Albaladejo, J.A. Mart´ın-Fern´andez, and J. Soto Ash
7
Protein
14
Water
17 1 13
20
16
12
19
10
11 4
Lactose
9
6 15
8
22 21
23
18 2
Water
.95
3
Cluster 1 Cluster 2 Cluster 3 Cluster 4
24
.05
Fat
.75 .25 .50 .25 .05 .95 Fat
5
(A)
.75
.50
.25
.05
.50 .75 95 Lact
(B)
Figure 6. (A) Biplot representation of groups and components. (B) Ternary plot for subcomposition [Water, Fat, Lactose] (centered data).
Aitchison distance. The role of the main log-ratio transformations has been also discussed. Focusing our interest on fuzzy clustering, we adopt an approach based upon expressing compositions in terms of their real coordinates with respect to an orthonormal basis of the simplex. In this way, the Aitchison distance is implicitly used and, at the same time, post-clustering analysis may be suitably conducted in the space of coordinates. Considering two opposite scenarios, numerical results highlight some practical problems concerning the ordinary Euclidean FCM clustering and give support to the FCM-C algorithm proposed. Only when compositions are formed by components taking similar values both approaches may provide similar results. The mammal’s milk data set illustrates how to apply the approach in a practical setting. The resulting grouping is consistent with previous results, whilst particular features are now stressed. Some other works on clustering compositional data can be found in the literature. For instance, Greenacre (1988) introduces an approach based on the chi-squared distance for clustering rows and columns of a contingency table which could be applied on compositional data. Unfortunately the property of subcompositional coherence is not satisfied in this case. DeSarbo, Ramaswamy and Lenk (1993) propose a parametric clustering procedure via a mixture of Dirichlet distributions. As already pointed out by Aitchison (1986), a Dirichlet implies a restrictive independence assumption for the ratios between components. In addition the covariances are all negative—which narrows the range of possible applications—and not perturbation invariant. Chac´on, Mateu-Figueras and Mart´ın-Fern´andez (2011) provide further discussion on these points.
Fuzzy C-Means Clustering
163
On the other hand, Templ, Filzmoser and Reimann (2008) empirically test different clustering strategies and transformations on data sets from geochemistry. In particular, the log-transformation is considered. This transformation is commonly used in many scientific disciplines, but it is not coherent with the geometrical structure of the simplex. It can be checked that the Euclidean distance between log-transformed data is generally greater than the Aitchison distance. However, purely from a practical point of view, the Euclidean distance applied on log-scaled data is expected to produce results very close to the Aitchison distance when working with high-dimensional data, e.g. microarray data as shown in Vˆencio, Varuzza, Pereira, Brentani, and Shmulevich (2007). In this latter work it is also exemplified that the pitfalls of using the Euclidean distance on raw compositions persist in highdimensional data problems. Even though under certain circumstances (typically data rather centered in the simplex) clustering based on either Euclidean or Aitchison geometry may lead to similar results, our experience is that these cases are not commonplace in practice; and still the problems related with covariances and correlations between compositions are a strong argument in favour of adopting a genuine log-ratio approach. Under our point of view, the importance of working within a proper theoretical framework cannot be neglected. If the conditions for applying in our case FCM are not fulfilled, then another approach or extension may be preferable. But this may be the case regardless of whether we are dealing with compositional data or with any other type of data. Finally note that frequently in practice compositions include very small values that are registered as false zeros, usually due to the existence of detection limits or rounded-off errors. The presence of zeros prevent us for applying any measure or technique based upon ratios of components—such as the Aitchison distance and the coordinate representation. This important practical problem has not been dealt with here, instead we refer the reader to Mart´ın-Fern´andez, Barcel´o-Vidal and Pawlowsky-Glahn (2003), PalareaAlbaladejo, Mart´ın-Fern´andez and G´omez-Garc´ıa (2007) and Palarea-Albaladejo and Mart´ın-Fern´andez (2008) for recent proposals on the subject. Appendix A: Proofs of Some Results in Section 3 1. Perturbation invariance for da and dCK−L : d(x⊕x , x∗ ⊕x ) = d(x, x∗ ), ∀ x, x∗ , x ∈ S D . By using the standard properties of the logarithm function and the fact that both measures are defined in terms of ratios of components, it is straightforward to show that the perturbation invariance property is satisfied in both cases.
164
J. Palarea-Albaladejo, J.A. Mart´ın-Fern´andez, and J. Soto
2. Subcompositional dominance for da and dCK−L : d(sx , sx∗ ) ≤ d(x, x∗ ), for any C -component subcompositions sx and sx∗ in S C . In Aitchison (1992) the author includes a proof of the subcompositional dominance property for the Aitchison distance. The proof for dCK−L is as follows: Given that dCK−L (as well as da ) is invariant by permutation of the components of the compositions, we can assume without lost of generality that the subcompositions sx and sx∗ are those resulting from leaving out a certain number of the last components of the compositions x and x∗ in S D . Additionally, given the way subcompositions are built, it is enough to proof that d(x, x∗ ) ≥ d(xD−1 , x∗D−1 ),
where xD−1 and x∗D−1 are the subcompositions resulting from removing the D th component. Given the definition of dC−KL , that is equivalent to proof that D
D ∗ /x xi /x∗ x i i i D log D D i=1 i=1 D−1
xi /x∗ D−1 x∗ /xi i i ≥ (D − 1) log . D−1 D−1 i=1
i=1
By operating on the left-side part and taking into account the fact that logarithm is a concave function, it results that D
D ∗ /x xi /x∗ x i i i D log D D i=1 i=1 D
D
xi /x∗ x∗ /xi i i = D log + D log D D i=1 i=1 D−1
1 xD D − 1 xi /x∗i = D log + D D−1 D x∗D i=1 D−1
1 x∗D D − 1 x∗i /xi +D log + D D−1 D xD i=1
165
Fuzzy C-Means Clustering
D−1
xi /x∗ 1 D−1 xD i log ≥D + log D D−1 D x∗D i=1 D−1
∗ x∗ /xi xD D−1 1 i log +D + log . D D−1 D xD
i=1
Note that 1/D log(xD /x∗D ) and 1/D log(x∗D /xD ) are canceled out and D is simplified. Then,
D log
D
D xi /x∗ x∗ /xi i
i
D i=1 D−1
D−1
xi /x∗ x∗ /xi i i ≥ (D − 1) log + log D−1 D−1 i=1 i=1 D−1
D−1 xi /x∗ x∗ /xi i i = (D − 1) log . D−1 D−1 i=1
D
i=1
3. da (x, x∗ ) ≈
i=1
√ 2dC−KL (x, x∗ ), ∀ x, x∗ ∈ S D .
By squaring the definition of dC−KL in Table 2 and after some algebraic manipulations, we can express such distance in terms of the sample covariance between the ratios x/x∗ and x∗ /x: d2C−KL (x, x∗ ) = = =
D log 2 D log 2
D D 1 ∗ x /x x∗i /xi i i D2
1 D2
i=1 D i=1
xi /x∗i
i=1 D
x∗i /xi
−1+1
i=1
D log (−cov(x/x∗ , x∗ /x) + 1) . 2
It is well-known that the first-order MacLaurin polynomial approximation to the function log(1 − x) is −x. Applying this to the last expression it turns out that d2C−KL (x, x∗ ) ≈ −
D cov(x/x∗ , x∗ /x). 2
166
J. Palarea-Albaladejo, J.A. Mart´ın-Fern´andez, and J. Soto
On the other hand, the squared Aitchison distance can also be expressed in terms of the sample covariance between x/x∗ and x∗ /x as follows: d2a (x, x∗ ) 2 D xi x∗i − log log = g(x) g(x∗ ) i=1 D 2 2 D 1 xi /x∗i x∗i /xi = log + log 2 i=1 g(x)/g(x∗ ) g(x∗ )/g(x) i=1 2 2 D D D 1 xi /x∗i 1 x∗i /xi = log + log 2 D i=1 g(x)/g(x∗ ) D i=1 g(x∗ )/g(x) 2 D D D 1 xi 1 xi = log ∗ − log ∗ 2 D i=1 xi D i=1 xi 2 D D 1 x∗i 1 x∗i + log − log D i=1 xi D i=1 xi D = var log(x/x∗ ) + var log(x∗ /x) 2 D ∗ ∗ ∗ ∗ = var log(x/x ) + log(x /x) − 2cov log(x/x ), log(x /x) 2 = −D cov log(x/x∗ ), log(x∗ /x) .
Again using first-order MacLaurin polynomial approximation, the function log x can be approximated by x − 1. Then, d2a (x, x∗ ) ≈ −D cov(x/x∗ − 1, x∗ /x − 1 = −D cov(x/x∗ , x∗ /x . 2 1 Consequently, it results that − d2C−KL (x, x∗ ) ≈ − d2a (x, x∗ ) and D D the relationship between both measures holds true for the whole simplex space.
4. da (x, x∗ ) ≈ Dde (x, x∗ ) for any two compositions x and x∗ near to the barycenter of S D . Two compositions near to the barycenter bD = [1/D, . . . , 1/D] of S D can be expressed as x = bD ⊕ nx and x∗ = bD ⊕ nx∗ , being nx and nx∗ vectors with all components approximately equal to
167
Fuzzy C-Means Clustering
1/D . Then, perturbation invariance property implies that da (x, x∗ ) = da (nx , nx∗ ).
On the other hand, two points x and x∗ near to the barycenter can be written as x = bD + δx and x∗ = bD + δx∗ under an Euclidean perspective, being δx and δx∗ two vector tending toward the zero vector in RD . Thus, de (x, x∗ ) = de (δx , δx∗ ). Then, taking δx = nx − bD and δx∗ = nx∗ − bD and applying Taylor’s expansion of the logarithm, the approximate relation da (x, x∗ ) ≈ Dde (x, x∗ ) holds. Appendix B: Simulated Data Generation for Scenarios A and B The data sets have been generated using MATLAB programming from the following mixtures of 2-dimensional normal models: A A A A A A A YA ∼ 0.325N (μA 1 , Σ1 )+0.250N (μ2 , Σ2 )+0.275N (μ3 , Σ3 )+0.150N (μ4 , Σ4 ) B B B B B B B YB ∼ 0.325N (μB 1 , Σ1 )+0.250N (μ2 , Σ2 )+0.275N (μ3 , Σ3 )+0.150N (μ4 , Σ4 ),
where μA 1 = μA 3 = μB 1 = μB 3 = μA 2 = μA 4 = μB 2 = μB 4 =
0.15 −1.66 0.89 −0.57 −0.53 −1.75 1.77 −0.06 −0.86 −0.29 0 −0.65 −1.66 1.36 0 0.01
0.28 −0.02 0.14 −0.02 0.16 −0.05 A Σ3 = 0.27 −0.05 0.68 −0.17 B Σ1 = 0.34 −0.17 0.28 −0.01 B Σ3 = 0.33 −0.01 0.15 0.07 A Σ2 = 0.07 0.25 0.06 0 A Σ4 = 0 0.05 0.60 0.22 ΣB 2 = 0.22 0.36 0.11 −0.01 B Σ4 = −0.01 0.26 ΣA 1 =
.
Note that the parameter values have been set according to the specific characteristics for each scenario. The cluster sizes are [480, 364, 426, 230]. References AITCHISON, J. (1986), The Statistical Analysis of Compositional Data, London: Chapman & Hall, reprinted in 2003 by Blackburn Press. AITCHISON, J. (1992), “On Criteria for Measures of Compositional Difference,” Mathematical Geology, 24, 365–379.
168
J. Palarea-Albaladejo, J.A. Mart´ın-Fern´andez, and J. Soto
´ ´ AITCHISON, J., BARCELO-VIDAL, C., MART´IN-FERNANDEZ, J.A., and PAWLOWSKYGLAHN, V. (2000), “Logratio Analysis and Compositional Distance,” Mathematical Geology, 32, 271–275. AITCHISON, J., and GREENACRE, M. (2002), “Biplots for Compositional Data,” Journal of the Royal Statistical Society, Series C, 51, 375–392. BAXTER, M.J., and FREESTONE, I.C. (2006), “Log-ratio Compositional Data Analysis in Archeometry,” Archaeometry, 48, 511–531. BERGET, I., MEVIK, B-H., and NAES, T. (2008), “New Modifications and Applications of Fuzzy C-Means Methodology,” Computational Statistics & Data Analysis, 52, 2403–2418. BEZDEK, J. (1981), Pattern Recognition with Fuzzy Objective Function Algorithms, New York: Plenum Press. BILLHEIMER, D., GUTTORP, P., and FAGAN, W. (2001), “Statistical Interpretation of Species Composition,” Journal of the American Statistical Association, 96, 1205–1214. ´ J.E., MATEU-FIGUERAS, G., and MART´IN-FERNANDEZ, ´ CHACON, J.A. (2011), “Gaussian Kernels for Density Estimation with Compositional Data,” Computers & Geosciences, 37, 702–711. DESARBO, W.S., RAMASWAMY, V., and LENK, P. (1993), “A Latent Class Procedure for the Structural Analysis of Two-Way Compositional Data,” Journal of Classification, 10, 159–193. ¨ DORING, C., LESOT, M-J., and KRUSE, R. (2006), “Data Analysis with Fuzzy Clustering Methods,” Computational Statistics & Data Analysis, 51, 192–214. ´ EGOZCUE, J.J., PAWLOWSKY-GLAHN, V., MATEU-FIGUERAS, G., and BARCELOVIDAL, C. (2003), “Isometric Logratio Transformations for Compositional Data Analysis,” Mathematical Geology, 35, 279–300. EGOZCUE, J.J., and PAWLOWSKY-GLAHN, V. (2005), “CoDa-Dendrogram: A New Exploratory Tool,” in Proceedings of the Second Compositional Data Analysis Workshop - CoDaWork’05, Girona, Spain. GABRIEL, K.R. (1971), “The Biplot Graphic Display of Matrices with Application to Principal Component Analysis,” Biometrika, 58, 453–467. GAVIN, D.G., OSWALD, W.W., WAHL, E.R., and WILLIAMS, J.W. (2003), “A Statistical Approach to Evaluating Distance Metrics and Analog Assignments for Pollen Records,” Quaternary Research, 60, 356–367. GREENACRE, M. (1988), “Clustering the Rows and Columns of a Contingency Table,” Journal of Classification, 5, 39–51. HARTIGAN, J.A. (1975), Clustering Algorithms, New York: Wiley & Sons. ¨ HOPPNER, F., KLAWONN, F., KRUSE, R., and RUNKLER, T. (1999), Fuzzy Cluster Analysis: Methods for Classification, Data analysis, and Image Recognition, Chichester: John Wiley & Sons. LEGENDRE, P., and GALLAGHER, E.D. (2001), “Ecologically Meaningful Transformations for Ordination of Species Data,” Oecologia, 129, 271–280. MART´IN, M.C. (1996), “Performance of Eight Dissimilarity Coefficients to Cluster a Compositional Data Set,” in Abstracts of the Fifth Conference of International Federation of Classification Societies (Vol. 1), Kobe, Japan, pp. 215–217. ´ ´ MART´IN-FERNANDEZ, J.A., BREN, M., BARCELO-VIDAL, C., and PAWLOWSKYGLAHN, V. (1999), “A Measure of Difference for Compositional Data Based On Measures of Divergence,” in Proceedings of the Fifth Annual Conference of the International Assotiation for Mathematical Geology (Vol. 1), Trondheim, Norway, pp. 211–215.
Fuzzy C-Means Clustering
169
´ ´ MART´IN-FERNANDEZ, J.A., BARCELO-VIDAL, C., and PAWLOWSKY-GLAHN, V. (2003), “Dealing with Zeros and Missing Values in Compositional Data Sets,” Mathematical Geology, 35, 253–278. MILLER, W.E. (2002), “Revisiting the Geometry of a Ternary Diagram with the Half-Taxi Metric,” Mathematical Geology, 34, 275–290. ´ ´ ´IA, J. PALAREA-ALBALADEJO, J., MART´IN-FERNANDEZ, J.A., and GOMEZ-GARC (2007), “A Parametric Approach for Dealing with Compositional Rounded Zeros,” Mathematical Geology, 39, 625–645. ´ PALAREA-ALBALADEJO, J., and MART´IN-FERNANDEZ, J.A. (2008), “A Modified EM alr-Algorithm for Replacing Rounded Zeros in Compositional Data Sets,” Computers & Geosciences, 34, 902–917. PAWLOWSKY-GLAHN, V., and EGOZCUE, J.J. (2001), “Geometric Approach to Statistical Analysis on the Simplex,” Stochastic Environmental Research and Risk Assessment, 15, 384–398. PAWLOWSKY-GLAHN, V. (2003), “Statistical Modelling on Coordinates,” in Proceedings of the First Compositional Data Analysis Workshop - CoDaWork’03, Girona, Spain. PAWLOWSKY-GLAHN, V., and EGOZCUE, J.J. (2008), “Compositional Data and Simpson’s Paradox,” in Proceedings of the Third Compositional Data Analysis Workshop CoDaWork’08, Girona, Spain. SOTO, J., FLORES-SINTAS, A., and PALAREA-ALBALADEJO, J. (2008), “Improving Probabilities in a Fuzzy Clustering Partition,” Fuzzy Sets & Systems, 159, 406–421. TEMPL, M., FILZMOSER, P., and REIMANN, C. (2008), “Cluster Analysis Applied to Regional Geochemical Data: Problems and Possibilities,” Applied Geochemistry, 23, 2198–2213. ˆ VENCIO, R., VARUZZA, L., PEREIRA, C., BRENTANI, H. and SHMULEVICH, I. (2007), “Simcluster: Clustering Enumeration Gene Expression Data on the Simplex Space,” BMC Bioinformatics, 8, 246. WAHL, E.R. (2004), “A General Framework for Determining Cut-off Values to Select Pollen Analogs with Dissimilarity Metrics in the Modern Analog Technique,” Review of Palaeobotany and Palynology, 128, 263–280. WANG, H., LIU, Q., MOK, H.M.K., FU, L., and TSE, W.M. (2007), “A Hyperspherical Transformation Forecasting Model for Compositional Data,” European Journal of Operations Research, 179, 459–468. WATSON, D.F., and PHILIP, G.M. (1989), “Measures of Variability for Geological Data,” Mathematical Geology, 21, 233–254.