Non-Euclidean c-means clustering algorithms - Semantic Scholar

1 downloads 0 Views 465KB Size Report
alternative to Euclidean c-means clustering in applications that involve data .... As p → 0, the generalized mean in Eq. (6) approaches the geometric mean of a1 ...
Intelligent Data Analysis 7 (2003) 405–425 IOS Press

405

Non-Euclidean c-means clustering algorithms Nicolaos B. Karayiannisa and Mary M. Randolph-Gipsb a Department

of Electrical and Computer Engineering, N308 Engineering Building 1, University of Houston, Houston, TX 77204-4005, USA b Department of Electrical and Computer Engineering, University of Houston-Clear Lake, 2700 Bay Area Boulevard, Houston, TX 77058, USA Received 26 September 2002 Revised 7 December 2002 Accepted 11 January 2003 Abstract. This paper introduces non-Euclidean c-means clustering algorithms. These algorithms rely on weighted norms to measure the distance between the feature vectors and the prototypes that represent the clusters. The proposed algorithms are developed by solving a constrained minimization problem in an iterative fashion. The norm weights are determined from the data in an attempt to produce partitions of the feature vectors that are consistent with the structure of the feature space. A series of experiments on three different data sets reveal that the proposed non-Euclidean c-means algorithms provide an attractive alternative to Euclidean c-means clustering in applications that involve data sets containing clusters of different shapes and sizes. Keywords: c-means clustering, generalized mean, non-Euclidean norm, reformulation, weighted norm

1. Introduction Consider the set X formed by M real feature vectors of size n × 1, that is, X = {x 1 , x2 , . . . , xM }, with xi ∈ Rn×1 , 1  i  M . Clustering is the process of partitioning the M feature vectors to c < M clusters, which are represented by the prototypes v j ∈ V ⊂ Rn×1, 1  j  c. Clustering algorithms can be classified as hard or fuzzy, depending on the strategy they employ for assigning feature vectors to clusters [2]. The c-means (or k-means) algorithm is a typical example of a hard or crisp clustering algorithm. The c-means algorithm assigns the feature vectors to clusters based on the nearest-prototype condition, that is, each feature vector is assigned to the cluster represented by its closest prototype. The nearest-prototype condition makes the c-means algorithm intuitively appealing and easy to use. On the other hand, the nearest-prototype condition is responsible for most of the disadvantages of the c-means algorithm, such as its dependence on its initialization. This motivated the development of fuzzy and, more recently, soft clustering algorithms. Soft clustering algorithms can be seen as the essential generalization of fuzzy clustering algorithms and include fuzzy clustering algorithms as special cases [13–15,17,18,21–23]. Fuzzy and soft clustering algorithms typically outperform hard clustering algorithms because they quantify the uncertainty associated with the partition of feature vectors into clusters and they exploit this uncertainty to benefit cluster formation. On the other hand, fuzzy and soft clustering algorithms are computationally more demanding than hard clustering algorithms. An alternative approach to balancing the tradeoff between computational complexity and performance is to 1088-467X/03/$8.00  2003 – IOS Press. All rights reserved

406

N.B. Karayiannis and M.M. Randolph-Gips / Non-Euclidean c-means clustering algorithms

retain the nearest-prototype partition of the feature vectors and focus instead on the distance measure employed during the clustering process. Such an approach is supported by the fact that the use of the Euclidean norm to measure the distance between the feature vectors and their prototypes is sensible only if the feature vectors are organized in hyperspherical clusters. Since this is rarely the case in practice, the performance of clustering algorithms can be improved by using data-dependent non-Euclidean norms as distance measures. Clustering algorithms were traditionally developed to solve a constrained minimization problem involving two sets of unknowns, namely, the membership functions that assign feature vectors to clusters and the prototypes. The solution of such problems is often determined using alternating optimization [2, 9,12]. This particular approach makes the development of non-Euclidean clustering algorithms a particularly challenging optimization problem. The development of fuzzy non-Euclidean clustering algorithms was attempted by replacing the Euclidean distance, corresponding to the norm L 2 , by distance measures generated by the norm family L p , with p = 2 [4,9–11]. Non-Euclidean fuzzy clustering algorithms were also developed by using weighted norms to measure the distance between the feature vectors and the prototypes. These approaches differ in terms of the constraints they impose on the norm weights [7, 25]. However, the extension of design methodologies developed for fuzzy non-Euclidean algorithms to the development of hard non-Euclidean clustering algorithms is neither straightforward nor guaranteed. This can be attributed to the fact that the development of fuzzy clustering algorithms employing weighted norms relied on the fuzzy scatter matrix. The fuzzy scatter matrix is defined in terms of the membership functions and is also referred to as the “fuzzy covariance” matrix [2]. In the case of hard clustering, the fuzzy scatter matrix reduces to the sample covariance matrix that is not particularly useful for the development of non-Euclidean clustering algorithms. The development of non-Euclidean soft and fuzzy clustering algorithms was attempted recently by relying on reformulation [21,22]. Reformulation is a design methodology that can reduce the development of soft and fuzzy clustering algorithms into the unconstrained minimization of a reformulation function. A reformulation function involves only one set of unknowns, namely, the prototypes [14,17–19]. This paper extends reformulation to develop hard clustering algorithms. The approach proposed in this paper allows the development of single-norm and multi-norm non-Euclidean c-means algorithms. This is accomplished by solving a constrained minimization problem, with the constraints imposed on the norm weights. The proposed algorithms are as simple to implement as the Euclidean c-means algorithm but they outperform considerably the c-means algorithm when tested on data sets containing clusters of different shapes and sizes.

2. Reformulating c-means clustering Let X be the finite set X = {x1 , x2 , . . . , xM } ⊂ Rn×1 . A family of c ∈ [2, M ) subsets Xj , 1  j  c, of X is a crisp or hard c-partition of X if ∪cj=1 Xj = X , Xi ∩ Xj = , 1  i = j  c, and ⊂ Xj ⊂ X , 1  j  c. Each subset X j is assigned a characteristic function or indicator function uij = uj (xi ), defined as u ij = 1, if xi ∈ Xj , and uij = 0, if xi ∈ / Xj . According to this definition, the M × c matrix U = [uij ] is a hard or crisp c-partition in the set Uc defined as   c M     Uc = U ∈ RM ×c : uij ∈ {0, 1} , ∀i, j; uij = 1, ∀i; 0 < uij < M, ∀j . (1)   j=1

i=1

N.B. Karayiannis and M.M. Randolph-Gips / Non-Euclidean c-means clustering algorithms

407

The c-means algorithm can be derived using alternating optimization to solve the minimization problem [2]   M  c    1 2 min uij xi − vj  , (2) H(U, V) =  M Uc ×Rn×c  i=1 j=1

where V is a matrix whose column vectors are the prototypes v 1 , v2 , . . . , vc , that is, V = [v1 v2 . . . vc ] ∈ Rn×c , and the M × c matrix U = [uij ] ∈ Uc defines a hard c-partition of X = {x1 , x2 , . . . , xM }. The coupled necessary conditions for solutions (U, V) ∈ U c × Rn×c of the minimization problem in Eq. (2) are [2]:  1, if xi − vj 2 < xi − v 2 , ∀ = j, ∗ uij = 1  i  M ; 1  j  c, (3) 0, otherwise, and M

∗ i=1 uij xi ,1 ∗ i=1 uij

vj = M

 j  c.

(4)

If the indicator functions uij = u∗ij are obtained according to Eq. (3) and U ∗ = [u∗ij ], then H(U∗ , V) = H ∗ (V), where M

1  H (V) = min xi − v 2 . 1c M ∗

(5)

i=1

According to Eq. (5), H ∗ (V) is only a function of the prototypes. In fact, H ∗ (V) is the reformulation function corresponding to the minimization problem in Eq. (2). The reformulation function in Eq. (5) measures the average discrepancy associated with the representation of each feature vector by its closest prototype. This is consistent with the form of the resulting indicator function in Eq. (3), which implies that in this case clustering relies on a nearest-prototype partition of the training vectors. According to the reformulation methodology, the c-means algorithm developed by solving the constrained minimization problem in Eq. (2) can also be developed by using gradient descent to perform unconstrained minimization of the reformulation function in Eq. (5) with respect to the prototypes. However, minimization of H ∗ (V) entails certain analytical difficulties due to its dependence on the minimum operator. This problem can be overcome by interpreting the minimum operator as the special case of the generalized mean, which is perhaps the most popular aggregation operator [5,6,24]. The generalized mean of a 1 , a2 , . . . , ac is defined as [6] Mp (a1 , a2 , . . . , ac ) =

c 1  p a c

1

p

,

(6)

=1

with p ∈ R − {0}. As p → −∞, the generalized mean in Eq. (6) approaches the minimum of a1 , a2 , . . . , ac , that is, lim Mp (a1 , a2 , . . . , ac ) = min {a }.

p→−∞

1c

(7)

N.B. Karayiannis and M.M. Randolph-Gips / Non-Euclidean c-means clustering algorithms

408

For p = −1, the generalized mean in Eq. (6) coincides with the harmonic mean of a 1 , a2 , . . . , ac ,

−1 c 1  1 MH (a1 , a2 , . . . , ac ) = . c a

(8)

=1

As p → 0, the generalized mean in Eq. (6) approaches the geometric mean of a 1 , a2 , . . . , ac , MG (a1 , a2 , . . . , ac ) =

c 

1 c

a

.

(9)

=1

For p = 1, the generalized mean in Eq. (6) coincides with the arithmetic mean of a 1 , a2 , . . . , ac , c 1  MA (a1 , a2 , . . . , ac ) = a . c

(10)

=1

As p → ∞, the generalized mean in Eq. (6) approaches the maximum of a 1 , a2 , . . . , ac , that is, lim Mp (a1 , a2 , . . . , ac ) = max {a }.

p→∞

(11)

1c

According to the properties of the generalized mean and the definition of H ∗ (V) in Eq. (5), H ∗ (V) = lim Rp (V),

(12)

p→−∞

where M 1  Rp (V) = M i=1



c

p 1  xi − v 2 c

1

p

.

(13)

=1

The c-means algorithm can be obtained by using gradient descent to minimize R p (V) in the limit p → −∞. For p ∈ (−∞, 0), minimization of Rp (V) produces the fuzzy c-means algorithm, which can be used to generate fuzzy c-partitions of X = {x 1 , x2 , . . . , xM } [14,18,19]. 3. Reformulation functions based on weighted norms The features that compose a feature vector are not equally suited for cluster formation [26]. The suitability of the features depends mainly on their variance. More specifically, the features with the smallest variance provide a more reliable basis for cluster formation. Clustering algorithms relying on the Euclidean norm place an equal importance on all the features that compose the feature vectors regardless of their variance. Therefore, the use of the Euclidean norm in clustering is appropriate if the feature vectors are organized in hyperspherical clusters, which is rarely the case in practice. This is a disadvantage of all clustering algorithms relying on the Euclidean norm. In general, the distance between the feature vectors {x i ∈ Rn×1 } and the prototypes {vj ∈ Rn×1} can be measured by the norm xi − vj 2A = (xi − vj )T A (xi − vj ),

(14)

N.B. Karayiannis and M.M. Randolph-Gips / Non-Euclidean c-means clustering algorithms

409

where “T ” is used to denote the transpose of vectors/matrices and A ∈ R n×n is a norm-inducing matrix that is required to be positive definite. Depending on the choice of A, Eq. (14) leads to popular distance measures such as the Mahalanobis and Euclidean norms. The Euclidean norm x i − vj 2 = (xi − vj )T (xi − vj ) can be obtained as the special case of Eq. (14) that corresponds to A = I, where I denotes the identity matrix. If A = WT W, then the weighted norm in Eq. (14) becomes xi − vj 2W = (xi − vj )T WT W (xi − vj ) = W (xi − vj )2 .

(15)

Using the weighted norm in Eq. (15) to perform clustering of the original set of feature vectors {xi ∈ X ⊂ Rn×1 } is equivalent to using the Euclidean norm to perform clustering of a new set of vectors { xi ∈ X ⊂ Rn×1 } produced through the linear transformation  x i = W xi . The norm employed by clustering algorithms has a significant impact on the partition of the feature vectors based on the nearest-prototype condition. Suppose the distance between the feature vectors and each prototype is measured by the same weighted norm and let W be the weight matrix. If the feature space contains only two prototypes v i and vj , the boundary between the two clusters produced by nearest-prototype partition of the feature vectors is determined as x − v i 2W = x − vj 2W , which can also be written as 2 (Wvj − W vi )T (W x) + W vi 2 − W vj 2 = 0.

(16)

For x ∈ Rn×1 and any diagonal matrix W ∈ R n×n , including the matrix W = I corresponding to the Euclidean norm, Eq. (16) defines a line for n = 2, a plane for n = 3, and a hyperplane for n > 3. If the feature space contains more than two prototypes, a nearest-prototype partition of the feature space produces Voronoi regions. For n = 2, the Voronoi regions are produced by intersecting lines and may include polygons. For n = 3, the Voronoi regions are produced by intersecting planes and may include polyhedra. For n > 3, the Voronoi regions are produced by intersecting hyperplanes. Suppose each of the two prototypes v i and vj is assigned a distinct weighted norm. Let W i and Wj be the weight matrices corresponding to the prototypes v i and vj , respectively. In this case, the boundary between the two clusters represented by the prototypes v i and vj is determined as x − vi 2W = i x − vj 2W . If Wi = Wj , this equation is quadratic in x ∈ R n×1 . As an example, for n = 2 the j

boundary defined by x − v i 2W = x − vj 2W is a conic, that is, an ellipse, a parabola, or a hyperbola. i j This is an indication that the use of multiple norms produces fundamentally different nearest-prototype partitions of the feature vectors. Clustering algorithms are developed in this paper by minimizing reformulation functions that rely on adjustable weighted norms to measure the distance between the feature vectors and their prototypes. Single-norm clustering algorithms can be obtained by employing a single weighted norm as a distance measure. More specifically, single-norm algorithms are developed by minimizing the reformulation function H ∗ (V, W) = limp→−∞ Rp (V, W), where Rp (V, W) is obtained by generalizing Eq. (13) as M 1  Rp (V, W) = M i=1



c

1 (xi − v 2W )p c

1

p

,

(17)

=1

with xi − v 2W = W (xi − v )2 . For simplicity, it is assumed that W ∈ Dn×n ⊂ Rn×n , where Dn×n denotes the set of all n × n real diagonal matrices. If W = diag{w 1 , w2 , . . . , wn }, then the diagonal entries of W constitute a new set of parameters that must be determined by the algorithm.

N.B. Karayiannis and M.M. Randolph-Gips / Non-Euclidean c-means clustering algorithms

410

This can be accomplished by imposing certain constraints on the norm weights {w k }. The weight matrix W = diag{w1 , w2 , . . . , wn } can be determined by requiring that w k > 0, 1  k  n, and the generalized mean of w 1 , w2 , . . . , wn be constant, that is, Mr (w1 , w2 , . . . , wn ) =

n 1  r wk n

1 r

= L,

(18)

k=1

where r ∈ R − {0}. The constant L > 0 can be determined by requiring that Eq. (18) be compatible with 1, 1  k  n, and Eq. (18) is the Euclidean norm, which corresponds to W = I. If W = I, then w k = satisfied with L = 1. For L = 1 and r ∈ (0, 1], Eq. (18) takes the form nk=1 wkr = n. If L = 1 and r → 0, Eq. (18) becomes nk=1 wk = 1. Given this set of constraints, hard clustering algorithms can be developed by minimizing R p = Rp (V, W) in the limit p → −∞ for (V, W) ∈ Rn×c × Dn×n , where r

Dn×n = W ∈ Dn×n : wk > 0, 1  k  n; Mr (w1 , w2 , . . . , wn ) = 1 , r ∈ (0, 1]. (19) r Multi-norm clustering algorithms can be obtained by assigning a distinct weighted norm to each of the prototypes. More specifically, multi-norm algorithms are developed by minimizing the reformulation function H ∗ (V, W1 , . . . , Wc ) = limp→−∞ Rp (V, W1 , . . . , Wc ), where Rp (V, W1 , . . . , Wc ) is obtained by generalizing Eq. (17) as M 1  Rp (V, W1 , . . . , Wc ) = M i=1



c

1 (xi − v 2W )p c

1

p

,

(20)

=1

with W ∈ Dn×n , 1    c. In such a case, hard clustering algorithms can be developed by minimizing r Rp = Rp (V, W1 , . . . , Wc ) in the limit p → −∞ for (V, W1 , . . . , Wc ) ∈ Rn×c × Dn×n × . . . × Dn×n . r r 3.1. Determining the prototypes The prototypes for multi-norm clustering algorithms can be determined by assuming that the weight matrices {Wj } are fixed and using gradient descent to minimize Rp (V) =

M 1 1  (Si ) p , M i=1

where Si = (1/c)

(21)

c

− v 2W )p , and p approaches −∞. The gradient ∇ vj Rp ≡  . . . ∂Rp /∂vjn ]T of Rp with respect to the prototype vj = [vj1 vj2 . . . vjn ]T =1 (xi

[∂Rp /∂vj1 ∂Rp /∂vj2 can be obtained as M

1  ∇vj Rp = αij (p) ∇vj (xi − vj 2Wj ), Mc

(22)

i=1

where {αij (p)} are the competition functions, defined as 1

αij (p) = (Si ) p

−1

(xi − vj 2Wj )p−1 .

(23)

N.B. Karayiannis and M.M. Randolph-Gips / Non-Euclidean c-means clustering algorithms

411

Since the weight matrices {W j } are diagonal, ∇vj (xi − vj 2W ) = −2 Wj2 (xi − vj ), and Eq. (22) j gives M

∇vj Rp = −

2  αij (p) Wj2 (xi − vj ), Mc

(24)

i=1

The update equation for each prototype v j = [vj1 vj2 . . . vjn ]T can be obtained using gradient descent as ∆vjk = −ηjk ∂Rp /∂vjk , which yields ∆vjk =

2 θjk wjk

M 

αij (p) (xik − vjk ), 1  k  n,

(25)

i=1

where θjk = [2/(M c)] ηjk is the learning rate for the kth entry of the prototype vj . Since Wj2 = 2 , w2 , . . . , w2 }, Eq. (25) can be written as diag{wj1 j2 jn ∆vj =

Θj Wj2

M 

αij (p) (xi − vj ),

(26)

i=1

where Θj = diag{θj1 , θj2 , . . . , θjn }. The competitive learning scheme described by the update equations Eq. (26) can be implemented in an iterative fashion. If {vj,ν−1 } are the prototypes obtained after the (ν − 1)th iteration, the new set of prototypes {vj,ν } can be determined at the ν th iteration according to Eq. (26) as M 

2 vj,ν = vj,ν−1 + Θj,ν Wj,ν

αij,ν (p) (xi − vj,ν−1 ).

(27)

i=1

The competitive learning scheme described by Eq. (27) can be implemented as a batch clustering algorithm by constraining the learning rates in such a way that the new prototype v j,ν be obtained only in terms of the training vectors xi ∈ X . According to Eq. (27), this condition is satisfied if 2 Θj,ν Wj,ν

=

M 

−1 αij,ν (p)

I.

(28)

i=1

If Eq. (28) is satisfied, then the update equation Eq. (27) reduces to the ‘centroid’ formula M i=1 αij,ν (p) xi . vj,ν = M i=1 αij,ν (p)

(29)

This formula is the common ingredient of a variety of batch clustering algorithms developed using alternating optimization [2]. c-means algorithms can be obtained The competition functions {αij } for multi-norm non-Euclidean c from Eq. (23) in the limit p → −∞. Since Si = (1/c) =1 (xi − v 2W )p , 

1

(Si ) p = Mp (xi − v1 2W1 , . . . , xi − vc 2Wc ),

(30)

N.B. Karayiannis and M.M. Randolph-Gips / Non-Euclidean c-means clustering algorithms

412

and −1

(Si )

2

p

(xi − vj Wj ) =

c 1  c =1



xi − v 2W

p −1 .



xi − vj 2W

(31)

j

Using Eqs (30) and (31), the competition functions in Eq. (23) take the form

p −1 c xi − v 2W Mp (xi − v1 2W , . . . , xi − vc 2W ) 1  c 1  αij (p) = . c xi − vj 2W xi − vj 2W =1

j

(32)

j

The analysis that follows evaluates the limit of α ij (p) as p → −∞. It can be shown that  c

p −1

p −1 2 2 c  x − v   x − v    i i   1 W W  lim = c lim 1 + 2 p→−∞ c p→−∞ xi − vj W xi − vj 2W =1

=j

j

j

= c u∗ij ,

(33)

where {u∗ij } are the indicator functions associated with the multi-norm non-Euclidean c-means algorithm, defined as  1, if xi − vj 2W < xi − v 2W , ∀ = j, ∗ j  uij = 1  i  M ; 1  j  c. (34) 0, otherwise, Using the properties of the generalized mean, lim Mp (xi − v1 2W1 , . . . , xi − vc 2Wc ) = min

p→−∞

1c



xi − v 2W

 

.

(35)

If xi − vj 2W < xi − v 2W , ∀ = j , then j

lim

p→−∞



Mp (xi − v1 2W , . . . , xi − vc 2W ) 1

2

xi − vj W j

c

= 1.

(36)

Combining Eqs (33) and (36) with Eq. (32) gives lim αij (p) = c u∗ij .

p→−∞

Thus, the centroid formula Eq. (29) takes the form M ∗ i=1 uij,ν xi vj,ν = M . ∗ u i=1 ij,ν

(37)

(38)

The prototypes for single-norm algorithms can be obtained as a special case of the analysis presented above by setting W j = W, 1  j  c. If Wj = I, 1  j  c, the nearest-prototype condition in Eq. (34) and the centroid formula in Eq. (38) can be used to implement the conventional c-means algorithm that employs the Euclidean norm as distance measure. In this case, the analysis presented above can be seen as an alternative derivation the c-means algorithm based on the reformulation methodology.

N.B. Karayiannis and M.M. Randolph-Gips / Non-Euclidean c-means clustering algorithms

413

3.2. Weight matrix for single-norm algorithms For r ∈ (0, 1], the weight matrix W ∈ Dn×n can be determined by assuming that the prototypes {v j } r are fixed and minimizing R p = Rp (W) for W ∈ Dn×n in the limit p → −∞. Using the method of r Lagrange multipliers, this constrained minimization problem can be converted into the unconstrained minimization of n

 p (W, λ) = Rp (W) − λ wr − n , (39) R k

k=1

where Rp (W) is the reformulation function defined in Eq. (17) and λ is the Lagrange multiplier. This problem can be solved by eliminating the Lagrange multiplier λ based on the conditions p (W, λ) ∂Rp (W) ∂R = − λ r wkr−1 = 0, ∂wk ∂wk

(40)

and n

 p (W, λ) ∂R =n− wkr = 0. ∂λ

(41)

k=1

Using the definition of the reformulation function R p (W) in Eq. (17), lim

p→−∞

∂Rp (W) = 2 wk s2k , ∂wk

(42)

where {s2k } are defined in terms of the competition functions α ij = c u∗ij as s2k =

M c 1  αij (xik − vjk )2 . Mc

(43)

i=1 j=1

For r ∈ (0, 1], the diagonal entries {wk } of W can be determined by eliminating the Lagrange multiplier λ between Eqs (40) and (41) as wk (r) =

 r − r1 n  1  s2 r−2 . n s2k

(44)

=1

For r = 1, the weights given in Eq. (44) can also be written in terms of the harmonic mean 2 2 2 2 2 2 2 M nH (s1 , s2 , . . . , sn ) as wk (1) = MH (s1 , s2 , . . . , sn )/sk . The weights corresponding to the constraint k=1 wk = 1 can be obtained from Eq. (44) in the limit r → 0 as 1 wk (0) = sk



n 

1

n

s

,

(45)

=1

which can also be written in terms of the geometric mean M G (s1 , s2 , . . . , sn ) as wk (0) = MG (s1 , s2 , . . . , sn )/sk .

N.B. Karayiannis and M.M. Randolph-Gips / Non-Euclidean c-means clustering algorithms

414

3.3. Weight matrices for multi-norm algorithms The weight matrices {Wj ∈ Dn×n } can be determined for r ∈ (0, 1] by assuming that the prototypes r {vj } are fixed and minimizing R p = Rp (W1 , . . . , Wc ) for (W1 , . . . , Wc ) ∈ Dn×n × . . . × Dn×n in r r the limit p → −∞. This constrained minimization problem can be converted to an unconstrained minimization problem by introducing the set of Lagrange multipliers {λ 1 , λ2 , . . . , λc } to form the objective function

n c   p (W1 , . . . , Wc , λ1 , . . . , λc ) = Rp (W1 , . . . , Wc ) − λ wr − n . (46) R k

=1

k=1

The weight matrices {Wj } can be determined by eliminating the Lagrange multipliers {λ 1 , λ2 , . . . , λc } based on the conditions p (W1 , . . . , Wc , λ1 , . . . , λc ) ∂Rp (W1 , . . . , Wc ) ∂R r−1 = − λj r wjk = 0, ∂wjk ∂wjk

(47)

and n

 p (W1 , . . . , Wc , λ1 , . . . , λc ) ∂R r =n− wjk = 0, 1  j  c. ∂λj

(48)

k=1

Using the definition of the reformulation function R p (W1 , . . . , Wc ) in Eq. (20), lim

p→−∞

∂Rp (W1 , . . . , Wc ) = 2 wjk s2jk , 1  j  c, ∂wjk

(49)

where s2jk are defined in terms of the competition functions α ij = c u∗ij as s2jk =

M 1  αij (xik − vjk )2 , 1  j  c. M

(50)

i=1

Eliminating the Lagrange multipliers {λj } between Eqs (47) and (48) gives 

wjk

n 1   = n =1



s2j

s2jk

r r−2

− 1 r



, 1  j  c.

(51)

For r = 1, the weights given in Eq. (51) can be written in terms of the harmonic mean MH (s2j1 , s2j2 , . . . , s2jn ) as wjk (1) = MH (s2j1 , s2j2 , . . . , s2jn )/s2jk , 1  j  c.  The weights corresponding to the constraints nk=1 wk = 1, 1    c, can be obtained from Eq. (51) in the limit r → 0 as n

1 n  1 wjk (0) = sj , 1  j  c, (52) sjk =1

which can also be written in terms of the geometric mean M G (sj1 , sj2 , . . . , sjn ) as wjk (0) = MG (sj1 , sj2 , . . . , sjn )/sjk , 1  j  c.

N.B. Karayiannis and M.M. Randolph-Gips / Non-Euclidean c-means clustering algorithms

415

4. Error analysis This section investigates the effect of the free parameter r and the statistics of the data on the residual error, that is, the error remaining after a certain iterate of the proposed algorithms. The results of this analysis are used to establish some rules that can be used in practice to select the value of the free parameter r for a given data set. 4.1. Single-norm algorithms The residual error corresponding to single-norm algorithms can be obtained using Eq. (17) as E = lim Rp (V, W),

(53)

p→−∞

where V and W are optimally determined as indicated above. If the competition functions {α ij }, the prototypes {vj } and the weight matrix W are determined at each iteration as indicated in the previous sections, then the residual error after this iteration can be measured as M

E=

c

1  αij xi − vj 2W . Mc

(54)

i=1 j=1

Since xi − vj 2W = E=

n 

n

2 k=1 wk

(xi,k − vj,k )2 , the total residual error can be obtained at each iteration as

Ek ,

(55)

k=1

where Ek = wk2 s2k , and s2k is defined in Eq. (43). According to Eq. (55), E k represents the contribution of the kth feature to the total residual error E . Thus, E k is referred to as the residual error due to the kth feature. The analysis which follows investigates the impact of the parameter r on the contribution of different features to the total residual error. For r ∈ (0, 1], the residual error Ek (r) = wk2 s2k due to the kth feature can be obtained using Eq. (44) as

r−2  r −1  n  n r  2 r 1  s2 r−2 1 r−2 Ek (r) = . (56) s  n n s2k =1

=1

As r → 0, the residual error Ek (0) due to the kth feature can be obtained from Eq. (56) as E k (0) = MG (s21 , s22 , . . . , s2n ). Since E1 (0) = E2 (0) = . . . = En (0), all features have a uniform effect on the total residual error E(0). Unlike the case where r → 0, the residual errors E 1 (r), E2 (r), . . . , En (r) are not equal for r ∈ (0, 1]. Since r/(r − 2) < 0 for r ∈ (0, 1], Eq. (56) also indicates that E k (r) is 2 (s2 , s2 , . . . , s2 )/s2 . This implies a decreasing function of s 2k . For r = 1, in particular, Ek (1) = MH n 1 2 k that as r increases from 0 to 1 the features corresponding to large values of s 2k have a progressively diminishing impact on the partition produced by the corresponding clustering algorithms. Using Eq. (56), the total residual error E(r) = nk=1 Ek (r) corresponding to r ∈ (0, 1] takes the form

r−2 n r r 1   2  r−2 E(r) = n s n =1

= n Mp (s21 , s22 , . . . , s2n ),

(57)

N.B. Karayiannis and M.M. Randolph-Gips / Non-Euclidean c-means clustering algorithms

416

where Mp (s21 , s22 , . . . , s2n ) is the generalized mean of s 21 , s22 , . . . , s2n , and p = r/(r − 2). As r → 0, p → 0 and the residual error E(0) corresponding to the constraint nk=1 wk = 1 can be obtained as E(0) = n MG (s21 , s22 , . . . , s2n ). For r = 1, p = −1 and the residual error E(1) corresponding to the constraint nk=1 wk = n can be obtained as E(1) = n M H (s21 , s22 , . . . , s2n ). Note also that the residual error E I corresponding to the Euclidean norm can be obtained for W = I as E I = n MA (s21 , s22 , . . . , s2n ). The effect of the value of r ∈ (0, 1] on the residual error can be revealed by the behavior of E(r) as r spans the interval (0, 1]. As r increases from 0 to 1, p = r/(r − 2) decreases from 0 to −1. Moreover, the arithmetic mean MA (s21 , s22 , . . . , s2n ) can be obtained from the generalized mean M p (s21 , s22 , . . . , s2n ) for p = 1. Since the generalized mean M p (s21 , s22 , . . . , s2n ) is an increasing function of p, the residual errors E(r) corresponding to W = I and different values of r and the residual error E I corresponding to W = I (Euclidean norm) satisfy the inequality E(1)  E(r)  E(0)  E I , ∀r ∈ (0, 1),

(58)

with the equalities holding if s21 = s22 = . . . = s2n . 4.2. Multi-norm algorithms The residual error corresponding to multi-norm algorithms can be obtained using Eq. (20) as E = lim Rp (V, W1 , W2 , . . . , Wc ), p→−∞

(59)

where V and {Wj } are optimally determined as indicated above. The residual error after an iterate of the multi-norm c-means algorithm can be measured as M

E=

c

1  αij xi − vj 2Wj . Mc i=1 j=1

Since xi − vj 2W =

n

j

E=

n 

2 k=1 wjk

(60)

(xi,k − vj,k )2 , Eq. (60) takes the form

Ek ,

(61)

k=1

where Ek is the residual error due to the kth feature, defined as c

Ek =

1 2 2 wjk sjk , c

(62)

j=1

where s2jk is defined in Eq. (50). For r ∈ (0, 1], the residual error Ek (r) due to the kth feature can be obtained using Eq. (51) as 

r −1 c n 2 1   1  sj r−2  Ek (r) = Mp (s2j1 , s2j2 , . . . , s2jn ), 2 c n s jk j=1

(63)

=1

where Mp (s2j1 , s2j2 , . . . , s2jn ) is the generalized mean of s 2j1, s2j2 , . . . , s2jn , and p = r/(r − 2). If each prototype is assigned a distinct weighted norm, each cluster has a different contribution to the residual

N.B. Karayiannis and M.M. Randolph-Gips / Non-Euclidean c-means clustering algorithms

417

error Ek (r) due to the kth feature. According to the above analysis, the residual error E k (r) corresponding to r ∈ (0, 1] depends on the relative values of {s 2j1 , s2j2 , . . . , s2jn } with respect to each of the clusters represented by the prototypes v j , 1  j  c. If the clustering algorithm relies on the Euclidean norm, then Wj = I, 1  j  c, and Eq. (62) gives c

1 2 EkI = sjk = s2k . c

(64)

j=1

The total residual error E(r) corresponding to r ∈ (0, 1] can be obtained using Eq. (63) as E(r) =

c n  Mp (s2j1 , s2j2 , . . . , s2jn ). c

(65)

j=1

In the limit r → 0, Eq. (65) gives E(0) = (n/c) cj=1 MG (s2j1 , s2j2, . . . , s2jn ). For r = 1, Eq. (65) gives E(1) = (n/c) cj=1 MH (s2j1 , s2j2 , . . . , s2jn ). The residual error corresponding to the Euclidean norm can be obtained using Eq. (64) as E I = (n/c) cj=1 MA (s2j1, s2j2 , . . . , s2jn ). For r ∈ (0, 1], p = r/(r − 2) ∈ (0, −1]. Moreover, the value of p = r/(r − 2) decreases from 0 to −1 as the value of r increases from 0 to 1. Since the generalized mean M p (s2j1 , s2j2 , . . . , s2jn ) is an increasing function of p, Eq. (65) indicates that E(1)  E(r)  E(0)  E I , ∀r ∈ (0, 1).

(66)

4.3. Variance analysis The above analysis indicated that the selection of the value of r for a given data set depend rather critically on the relative sizes of {s2k } for single-norm algorithms and {s21k , s22k , . . . , s2ck } for multi-norm algorithms. The analysis which follows establishes a relationship between the residual error and the sample variances of the features. Let I j be the set of indices of the feature vectors x i ∈ X assigned to the j th cluster by a nearest-prototype partition. According to analysis presented above, α ij = c u∗ij , where u∗ij = 1 if i ∈ Ij and u∗ij = 0 if i ∈ / Ij . In such a case, each prototype v j = [vj1 vj2 . . . vjn ]T is the centroid of {xi }i∈Ij , that is, vj =

1  xi , |Ij |

(67)

i∈Ij

where |Ij | denotes the cardinality of the set I j . In addition, s2jk = c

|Ij | 2 σ , M jk

(68)

2 denotes the sample variance of the k th feature computed over the feature vectors assigned to where σjk the j th cluster, defined as 2 σjk =

1  (xi,k − vj,k )2 , 1  j  c. |Ij | i∈Ij

(69)

418

N.B. Karayiannis and M.M. Randolph-Gips / Non-Euclidean c-means clustering algorithms

According to Eq. (69), {s21k , s22k , . . . , s2ck } are computed in terms of the prototypes v 1 , v2 , . . . , vc . If the clustering process is initialized using a randomly generated set of prototypes, the corresponding variances Eq. (69) may not be reliable. This problem can be overcome by replacing s 21k , s22k , . . . , s2ck by their average s2k = (1/c) cj=1 s2jk in order to estimate the value of r for the implementation of multi-norm algorithms. Under this assumption, the implementation of both single-norm and multi-norm algorithms relies on the values of s 21 , s22 , . . . , s2n . The value of s2k can be roughly estimated in practice as s2k = σk2 , where σk2 is the sample variance of the kth feature computed over the entire feature set. The above analysis indicated that the residual error E = E(r) corresponding to both single-norm and multi-norm c-means algorithms is a decreasing function of r . Thus, selecting the value of r involves a tradeoff between reducing the total residual error and balancing the effect of the features on the total residual error. A value of r close to 0 tends to balance the contribution of residual errors due to different features to the total residual error. On the other hand, a value of r close to 0 tends to increase the value of the total residual error. The total residual error can be reduced by increasing the value of r . However, the effect of different features to the residual error becomes increasingly non-uniform as the value of r increases from 0 to 1. If there are significant differences among the values {s 2k } corresponding to different features, increasing the value of r from 0 to 1 would progressively diminish the role of the features corresponding to the largest values of s 2k . This can be prevented by selecting a value of r close to 0, which would equalize the effect of all features on the total residual error. If the values of s 2k are similar, then all features have an approximately uniform effect on the total residual error. In such a case, the value of r can be selected to reduce the value of the total residual error. This can be accomplished if r approaches 1. Given a data set, the value of r can be selected in practice by relying on the variances σ 12 , σ22 , . . . , σn2 of the features computed over the entire feature set. The range of these variances can be quantified be 2 /σ 2 , where σ 2 2 2 2 computing the ratio θ = σmax max = max1kn {σk } and σmin = min1kn {σk }. A min 2 2 2 value of θ close to 1 indicates that the values of the feature variances σ 1 , σ2 , . . . , σn are similar. This would allow the reduction of the total residual error by selecting a value of r close to 1. The result of such a choice is that the features with the smallest variances have a more significant impact on the clustering process. A value of θ considerably higher than 1 indicates that there are significant differences among the values of the variances σ 12 , σ22 , . . . , σn2 corresponding to different features. In such a case, the effect of different features on the total residual error can be equalized by selecting a value of r close to 0. Such a choice would prevent the features with the largest variances from dominating all the rest. 5. Experimental results This section presents an evaluation of the proposed algorithms on three data sets, which differ in terms of the data structure and the dimensionality of the feature vectors. The proposed single-norm and multi-norm non-Euclidean (NE) c-means algorithms, which are summarized in Tables 1 and 2, respectively, were compared with the c-means clustering algorithm that employs the Euclidean norm as a distance measure. All clustering algorithms were tested in 100 trials, each initialized by a different set of randomly generated prototypes. A trial was considered successful if the majority of the feature vectors belonging to a certain physical class were represented by the same prototype, called the majority prototype for this class, while each physical class had a different majority prototype. The performance of all clustering algorithms was evaluated based on the number of clustering errors, i.e., the number of feature vectors that are assigned to a wrong physical class by terminal nearest-prototype partition of the data.

N.B. Karayiannis and M.M. Randolph-Gips / Non-Euclidean c-means clustering algorithms

419

Table 1 Single-norm non-Euclidean c-means algorithm 1. 2. 3. 4.

Select c and ε; fix N ; set ν = 0. Generate an initial set of prototypes V = {v1 , v2 , . . . , vc }. Initialize the weight matrix W = I. Set ν = ν + 1.



1 if xi − vj 2W < xi − v 2W , ∀ = j, 1  i 0 otherwise. M ; 1  j  c. – αij = c u∗ij , 1  i  M ; 1  j  c. M c – sk = M1c αij (xik − vjk )2 , 1  k  n. i=1 j=1 – u∗ij =



n



− r1 r  r−2

,1k

 n.

– W = diag{w1 , w2 , . . . , wn }.  M M – vj = α x / α , 1j i=1 ij i i=1 ij

 c.

– wk =

1 n

=1

s2 /s2k

M c



– Eν = M1c i=1 j=1 αij xi − vj 2W . – If ν > 1; then compute Eνrel = (Eν−1 − Eν )/Eν−1 . 5. If ν < N and Eνrel > ε; then go to step 4. Table 2 Multi-norm non-Euclidean c-means algorithm 1. 2. 3. 4.

Select c and ε; fix N ; set ν = 0. Generate an initial set of prototypes V = {v1 , v2 , . . . , vc }. Initialize the weight matrices Wj = I, 1  j  c. Set ν = ν + 1.



1 if xi − vj 2Wj < xi − v 2W , ∀ = j, 1  i 0 otherwise. M ; 1  j  c. – αij = c u∗ij , 1  i  M ; 1  j  c. M 1 – sjk = M αij (xik − vjk )2 , 1  j  c; 1  k  n. i=1 –

u∗ij

=



n





r

− r1

– wjk = n1 s2j /s2jk r−2 , 1  j  c; 1 =1 n. – Wj ={wj1 , wj2 , . .. , w jn }, 1  j  c.   M M – vj = α x / α , 1  j  c. ij i i=1 i=1 ij



k

M c

– Eν = M1c i=1 j=1 αij xi − vj 2Wj . – If ν > 1; then compute Eνrel = (Eν−1 − Eν )/Eν−1 . 5. If ν < N and Eνrel > ε; then go to step 4.

5.1. ROCK data The ROCK data set consists of 201 two-dimensional feature vectors belonging to two physical classes. This data set was produced by an experiment designed to simulate a martian subsurface ecosystem in an attempt to explore possible biogenic features of the martian meteorite ALH84001. Water from a deep basaltic aquifer was used to cultivate unweathered basaltic rock. The resulting mineralized features were measured with an electron microscope for the longest and shortest dimensions. The results can be divided into two classes, known bacteria and filaments, which may or may not be biogenic in origin. The “filament” and “bacteria” classes contain 108 and 93 feature vectors, respectively. Figure 1(a) shows the

420

N.B. Karayiannis and M.M. Randolph-Gips / Non-Euclidean c-means clustering algorithms Table 3 Performance of the c-means and NE c-means clustering algorithms tested with different values of r on the ROCK data: Number of clustering errors (E) recorded on average in 100 trials, standard deviation (σE ), percentage of failed trials (F ), and average number of iterations (N ) Euclidean Algorithm c-means Non-Euclidean Algorithm Single-Norm NE c-means (r Single-Norm NE c-means (r Single-Norm NE c-means (r Single-Norm NE c-means (r Single-Norm NE c-means (r Single-Norm NE c-means (r Non-Euclidean Algorithm Multi-Norm NE c-means (r Multi-Norm NE c-means (r Multi-Norm NE c-means (r Multi-Norm NE c-means (r Multi-Norm NE c-means (r Multi-Norm NE c-means (r

= 0.00) = 0.01) = 0.05) = 0.10) = 0.50) = 1.00)

= 0.00) = 0.01) = 0.05) = 0.10) = 0.50) = 1.00)

E – E – – – – 17.00 15.03

σE – σE – – – – 0.00 0.23

F [%] 100 F [%] 100 100 100 100 97 24

N 5.25 N 7.36 7.33 7.41 7.11 6.53 10.21

E 3.91 3.92 3.00 3.00 5.04 12.54

σE 0.29 0.27 0.00 0.00 0.19 2.12

F [%] 26 27 22 22 23 18

N 8.75 8.47 9.20 7.98 5.47 7.12

distribution of the feature vectors that represent the physical classes “filament” and “bacteria” in the 2-D feature space defined by the features “length of crystal” (first feature) and “width of crystal” (second feature). The variances of the two features are: σ 12 = 5.4210 × 105 and σ22 = 0.3906 × 105 . The ratio 2 /σ 2 between the largest and smallest feature variances is θ = σ max min = 13.9. The feature vectors from the ROCK data set were clustered in c = 2 clusters by the c-means algorithm, and single-norm and multi-norm NE c-means algorithms, tested with various values of r . The results of these experiments are summarized in Table 3. The c-means algorithm failed in all 100 trials, which is not surprising given the nature of the ROCK data set. In fact, the distribution of the feature vectors in the feature space indicates that this data set would be particularly challenging for most, if not all, clustering algorithms relying on the Euclidean norm. According to Table 3, the single-norm NE c-means algorithm failed for values of r between 0 and 0.5 but its performance improved considerably as the value of r increased above 0.5 and approached 1. The multi-norm NE c-means algorithm outperformed both the c-means and the single-norm NE c-means algorithms, especially for values of r between 0 and 0.5. For values of r in this interval, the number of clustering errors produced by the multi-norm NE c-means algorithm fluctuated between 3 and 5. However, the performance of the multi-norm NE c-means algorithm degraded slightly for values of r between 0.5 and 1. Figure 1 shows the boundary between the two clusters and the feature vectors that are equidistant from the two prototypes v1 and v2 produced by the c-means algorithm, and the single-norm and multi-norm NE c-means algorithms tested with various values of r . The feature vectors that are equidistant form each of the the prototypes produced by the c-means algorithm belong to circles x − v j 2 = k2 , due to the use of the Euclidean norm, while the boundary between the two clusters is a straight line. Figure 1 reveals that the partition of the feature space produced by the c-means algorithm is not consistent with the distribution of the data. This can be attributed to the fact that the algorithm attempts to create “circular” clusters in a data set that contains no such clusters. The feature vectors that are equidistant from each of the prototypes v1 and v2 produced by the single-norm NE c-means algorithm belong to ellipses x − vj 2W = k2 , while the boundaries between the clusters are straight lines. In this case, the ratio along the main directions of the ellipses corresponding to both prototypes is the same. The value

N.B. Karayiannis and M.M. Randolph-Gips / Non-Euclidean c-means clustering algorithms 3000

3000

o Bacteria + Filament

o Bacteria + Filament 2500

Width of crystal (nm)

Width of crystal (nm)

2500

2000

1500

1000

500

2000

1500

1000

500

0

0 0

1000

2000

3000

4000

5000

6000

0

1000

2000

Length of crystal (nm)

3000

4000

o Bacteria + Filament 2500

Width of crystal (nm)

Width of crystal (nm)

6000

3000

o Bacteria + Filament 2500

2000

1500

1000

500

2000

1500

1000

500

0

0 0

1000

2000

3000

4000

5000

6000

0

1000

2000

Length of crystal (nm)

3000

4000

5000

6000

Length of crystal (nm)

3000

3000

o Bacteria + Filament

o Bacteria + Filament 2500

Width of crystal (nm)

2500

Width of crystal (nm)

5000

Length of crystal (nm)

3000

2000

1500

1000

500

2000

1500

1000

500

0

0 0

1000

2000

3000

4000

5000

6000

0

1000

2000

Length of crystal (nm)

3000

4000

5000

6000

Length of crystal (nm)

3000

3000

o Bacteria + Filament

o Bacteria + Filament 2500

Width of crystal (nm)

2500

Width of crystal (nm)

421

2000

1500

1000

500

2000

1500

1000

500

0

0 0

1000

2000

3000

Length of crystal (nm)

4000

5000

6000

0

1000

2000

3000

4000

5000

6000

Length of crystal (nm)

Fig. 1. (a) The ROCK data. Boundary between the two clusters and feature vectors equidistant from the prototypes v1 and v2 produced for the ROCK data by (b) the c-means algorithm, (c) the single-norm NE c-means algorithm tested with r = 0, (d) the multi-norm NE c-means algorithm tested with r = 0, (e) the single-norm NE c-means algorithm tested with r = 0.5, (f) the multi-norm NE c-means algorithm tested with r = 0.5, (g) the single-norm NE c-means algorithm tested with r = 1, (h) the multi-norm NE c-means algorithm tested with r = 1.

422

N.B. Karayiannis and M.M. Randolph-Gips / Non-Euclidean c-means clustering algorithms

of r affected the slope of the boundary between the two clusters as is evident from Fig. 1. According to Fig. 1, the feature vectors that were equidistant from the prototypes produced by the multi-norm NE c-means algorithm belong to ellipses x − v j 2W = k2 . Due to the use of a distinct weighted norm j for each of the prototypes, the ratios among the main directions of the ellipses corresponding to the two prototypes were different, while the boundary between the two clusters was a conic. This is a particularly useful feature of the multi-norm NE c-means clustering algorithm, especially for data sets containing clusters of different shapes and sizes. 5.2. IRIS data The IRIS data set contains 150 feature vectors of dimension four, which belong to three physical classes representing different IRIS subspecies [1]. Each class contains 50 feature vectors. One of the three classes is well separated from the other two, which are not easily separable due to the overlapping of their convex hulls. The variances of the four features of the IRIS data set are: σ 12 = 0.686, σ22 = 0.190, σ32 = 3.116, 2 /σ 2 and σ42 = 0.581. The ratio between the largest and smallest feature variances is θ = σ max min = 16.4. Table 4 summarizes the performance on the IRIS data set of the c-means algorithm, the single-norm and the multi-norm NE c-means clustering algorithms. On average, the c-means algorithm produced about 17 clustering errors on the IRIS data set, which is typical for a broad variety of clustering algorithms relying on the Euclidean norm [3,16,20]. The c-means algorithm was outperformed by the single-norm NE c-means algorithm, which produced about 10 clustering errors on average for values of r between 0 and 0.1. The number of clustering errors produced by the single-norm NE c-means algorithm decreased considerably for values of r between 0.5 and 1, while the algorithm produced identical nearest-prototype partitions in all successful trials. The multi-norm NE c-means algorithm produced almost the same number of clustering errors with the single-norm NE c-means algorithm for values of r between 0 and 0.1. Compared with the single-norm NE c-means algorithm, the multi-norm NE c-means algorithm resulted in fewer successful trials. Finally, the performance of the multi-norm NE c-means algorithm degraded for values of r close to 1. 5.3. WINE data The WINE data set consists of 13 continuous-valued features belonging to three physical classes. This data set was obtained by chemical analysis of wine produced by three different cultivators from the same region of Italy. This data set contains 178 feature vectors, with 59 in class 1, 71 in class 2, and 48 in class 3. For this particular data set, the feature variances span a wide range of values 2 2 2 = 9.9167 × 104 and correspond to a ratio between σmin = σ82 = 1.5489 × 10−2 and σmax = σ13 2 2 6 θ = σmax /σmin = 6.4 × 10 . With the exception of the feature variances σ 52 = 203.99, σ42 = 11.153, 2 = 5.3744, and σ 2 = 1.2480, the variances of the rest of the features take values below 1. The σ10 2 distribution of the feature variances indicates that the variance corresponding to the thirteenth feature is clearly an outlier. Nevertheless, this feature was kept in the data in order to evaluate the ability of the proposed non-Euclidean c-means clustering algorithms to produce satisfactory partitions of feature vectors when the feature variances span a very wide range of values. The feature vectors from the WINE data set were clustered by the c-means algorithm and the proposed single-norm and multi-norm NE c-means algorithms, tested with various values of r . The results of these experiments are summarized in Table 5. On average, the c-means algorithm clustered incorrectly about 53 out of 178 feature vectors included in the WINE data set. According to Table 5, the proposed NE c-means algorithms performed considerably better than the c-means algorithm for values of r in

N.B. Karayiannis and M.M. Randolph-Gips / Non-Euclidean c-means clustering algorithms

423

Table 4 Performance of the c-means and NE c-means algorithms tested with different values of r on the IRIS data: Number of clustering errors (E) recorded on average in 100 trials, standard deviation (σE ), percentage of failed trials (F ), and average number of iterations (N ) Euclidean Algorithm c-means Non-Euclidean Algorithm Single-Norm NE c-means (r Single-Norm NE c-means (r Single-Norm NE c-means (r Single-Norm NE c-means (r Single-Norm NE c-means (r Single-Norm NE c-means (r Non-Euclidean Algorithm Multi-Norm NE c-means (r Multi-Norm NE c-means (r Multi-Norm NE c-means (r Multi-Norm NE c-means (r Multi-Norm NE c-means (r Multi-Norm NE c-means (r

= 0.00) = 0.01) = 0.05) = 0.10) = 0.50) = 1.00)

= 0.00) = 0.01) = 0.05) = 0.10) = 0.50) = 1.00)

E 16.60 E 9.55 9.59 9.59 9.29 6.00 7.00

σE 0.49 σE 4.14 4.15 4.15 4.11 0.00 0.00

F [%] 19 F [%] 20 21 21 21 20 22

N 9.25 N 7.44 7.36 7.23 6.99 7.33 6.94

E 9.43 9.43 9.43 9.31 8.47 12.12

σE 4.08 4.08 4.08 4.06 3.71 11.43

F [%] 35 35 35 35 38 33

N 7.03 6.98 7.00 6.97 6.32 6.53

Table 5 Performance of the c-means and NE c-means algorithms tested with different values of r on the WINE data: Number of clustering errors (E) recorded on average in 100 trials, standard deviation (σE ), percentage of failed trials (F ), and average number of iterations (N ) Euclidean Algorithm c-means Non-Euclidean Algorithm Single-Norm NE c-means (r = 0.00) Single-Norm NE c-means (r = 0.01) Single-Norm NE c-means (r = 0.05) Single-Norm NE c-means (r = 0.10) Single-Norm NE c-means (r = 0.50) Single-Norm NE c-means (r = 1.00) Non-Euclidean Algorithm Multi-Norm NE c-means (r = 0.00) Multi-Norm NE c-means (r = 0.01) Multi-Norm NE c-means (r = 0.05) Multi-Norm NE c-means (r = 0.10) Multi-Norm NE c-means (r = 0.50) Multi-Norm NE c-means (r = 1.00)

E 53.00 E 9.94 10.12 11.07 13.76 30.94 – E 15.47 16.06 17.69 18.49 28.86 –

σE 0.00 σE 1.10 1.02 0.92 1.07 4.74 – σE 2.16 1.84 1.75 2.00 8.35 –

F [%] 22 F [%] 0 0 0 0 3 100 F [%] 4 4 4 4 23 100

N 8.58 N 7.56 7.44 8.08 8.07 9.96 8.32 N 8.18 7.94 7.66 8.17 8.15 7.99

a certain range. The number of clustering errors produced by the single-norm NE c-means algorithm increased from 9 to 13 as the value of r increased from 0 to 0.1. For values of r in this interval, all trials were successful. The performance of these algorithms degraded as the value of r increased above 0.1, while the single-norm NE c-means algorithm failed to cluster the WINE data for values of r above 0.5. According to Table 5, the multi-norm NE c-means algorithm exhibited its best performance for the same values of r that resulted in the best performance of the single-norm NE c-means algorithm. Compared with the single-norm NE c-means clustering algorithm, the multi-norm NE c-means clustering algorithm reduced the fluctuation of the number of clustering errors caused by changing the value of r .

424

N.B. Karayiannis and M.M. Randolph-Gips / Non-Euclidean c-means clustering algorithms

6. Conclusions This paper showed that reformulation can be the basis for the development of hard or crisp clustering algorithms. More specifically, it was shown that the c-means algorithm can be derived by minimizing a reformulation function that relies on the generalized mean in the limit where the generalized mean approaches the minimum. This alternative derivation of the c-means algorithm allows the development of hard or crisp clustering algorithms that rely on a single weighted norm or multiple weighted norms to measure the distance between the feature vectors and their prototypes. Such algorithms were developed in this paper by solving a constrained minimization problem, with the constraints imposed on the weights involved in the definition of the weighted norms. The clustering algorithms produced by this approach were evaluated and compared with the Euclidean c-means algorithm on three data sets that differ in terms of the data structure and the dimensionality of the feature vectors. This experimental study indicated that the proposed non-Euclidean c-means algorithms provide an attractive alternative to Euclidean c-means clustering in applications that involve data sets containing clusters of different shapes and sizes. In particular, this experimental study revealed the flexibility and versatility of the multi-norm NE c-means algorithm proposed in this paper. It is also remarkable that the advantages offered by the proposed algorithms can be realized while keeping the computational overhead low. This is due to the fact that the proposed algorithms rely on the nearest-prototype condition to assign the feature vectors to clusters. This is the same condition employed for feature vector assignment by the Euclidean c-means algorithm. Thus, the performance gains associated with the proposed NE c-means algorithms can be attributed to the data-dependent computational procedures they employ to compute the weighted norms used to measure the distances between the feature vectors and the prototype. Acknowledgments We would like to thank Sue Wentworth, of Lockheed Space Operations, for providing the ROCK data set. The WINE data set was acquired from the University of California at Irvine web site at http://www.ics.uci.edu/˜mlearn/MLRepository.html. References [1] [2] [3] [4]

E. Anderson, The IRISes of the Gaspe Peninsula, Bulletin of the American IRIS Society, 59 (1939), 2–5. J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum, New York, 1981. J.C. Bezdek and N.R. Pal, Two soft relatives of learning vector quantization, Neural Networks 8(5) (1995), 729–743. L. Bobrowski and J.C. Bezdek, c-means clustering with the 1 and ∞ norms, IEEE Transactions on Systems, Man, and Cybernetics 21(3) (1991), 545–554. [5] D. Dubois and H. Prade, A review of fuzzy set aggregation connectives, Information Sciences 36(1–2) (1985), 85–121. [6] H. Dyckhoff and W. Pedrycz, Generalized means as a model of compensative connectives, Fuzzy Sets and Systems 14(2) (1984), 143–154. [7] D.E. Gustafson and W. Kessel, Fuzzy clustering with a fuzzy covariance matrix, in Proceedings of IEEE Conference on Decision and Control, San Diego, CA, 1979, pp. 761–766. [8] R.J. Hathaway and J.C. Bezdek, NERF c-means: Non-Euclidean relational fuzzy clustering, Pattern Recognition 27(3) (1994), 429–437. [9] R.J. Hathaway and J.C. Bezdek, Optimization of clustering criteria by reformulation, IEEE Transactions on Fuzzy Systems 3 (1995), 241–246. [10] R.J. Hathaway, J.C. Bezdek and Y. Hu, Generalized fuzzy c-means clustering strategies using Lp norm distances, IEEE Transactions on Fuzzy Systems 8(5) (2000), 576–582. [11] K. Jajuga, L1 -norm based fuzzy clustering, Fuzzy Sets and Systems 39 (1991), 43–50.

N.B. Karayiannis and M.M. Randolph-Gips / Non-Euclidean c-means clustering algorithms [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27]

425

N.B. Karayiannis, Fuzzy partition entropies and entropy constrained fuzzy clustering algorithms, Journal of Intelligent and Fuzzy Systems 5(2) (1997), 103–111. N.B. Karayiannis, Reformulating learning vector quantization and radial basis neural networks, Fundamenta Informaticae 37(1999), 137–175. N.B. Karayiannis, An axiomatic approach to soft learning vector quantization based on reformulation, IEEE Transactions on Neural Networks 10(5) (1999), 1153–1165. N.B. Karayiannis, From aggregation operators to soft learning vector quantization and clustering algorithms, in Kohonen Maps, E. Oja and S. Kaski, eds, Elsevier, Amsterdam, 1999, pp. 47–56. N.B. Karayiannis, Generalized fuzzy c-means algorithms, Journal of Intelligent and Fuzzy Systems 8(1) (2000), 63–81. N.B. Karayiannis, Soft learning vector quantization and clustering algorithms based on ordered weighted aggregation operators, IEEE Transactions on Neural Networks 11(5) (2000), 1093–1105. N.B. Karayiannis, Soft learning vector quantization and clustering algorithms based on mean-type aggregation operators, International Journal of Fuzzy Systems 4(3) (2002), 739–751. N.B. Karayiannis and J.C. Bezdek, An integrated approach to fuzzy learning vector quantization and fuzzy c-means clustering, IEEE Transactions on Fuzzy Systems 5(4) (1997), 622–628. N.B. Karayiannis, J.C. Bezdek, N.R. Pal, R.J. Hathaway and P.-I. Pai, Repairs to GLVQ: A new family of competitive learning schemes, IEEE Transactions on Neural Networks 7(5) (1996), 1062–1071. N.B. Karayiannis and M.M. Randolph-Gips, Soft learning vector quantization and clustering algorithms based on nonEuclidean norms: Single-norm algorithms, Proceedings of Fourth International Conference on Neural Networks and Expert Systems in Medicine and Health Care, Milos Island, Greece, June 20–22, 2001, pp. 134–141. N.B. Karayiannis and M.M. Randolph-Gips, Soft learning vector quantization and clustering algorithms based on nonEuclidean norms: Multi-norm algorithms, IEEE Transactions on Neural Networks 14(1) (2003), 89–102. N.B. Karayiannis and N. Zervos, Entropy-constrained learning vector quantization algorithms and their application in image compression, Journal of Electronic Imaging 9(4) (2000), 495–508. G.J. Klir and B. Yuan, Fuzzy Sets and Fuzzy Logic: Theory and Applications, Prentice Hall, Upper Saddle River, NJ, 1995. R. Krishnapuram and J. Kim, A note on the Gustafson-Kessel and adaptive fuzzy clustering algorithms, IEEE Transactions on Fuzzy Systems 7(4) (1999), 453–461. J.T. Tou and R.C. Gonzalez, Pattern Recognition Principles, Addison-Wesley, Reading, MA, 1974. E.C.-K. Tsao, J.C. Bezdek and N.R. Pal, Fuzzy Kohonen clustering networks, Pattern Recognition 27(5) (1994), 757–764.

Suggest Documents