University of Essex. Colchester CO4 3SQ, UK. {szhoup,jqgan}@essex.ac.uk ... support vector clustering (SVC) algorithm developed in [3] from one sphere ...
Mercer Kernel, Fuzzy C-Means Algorithm, and Prototypes of Clusters Shangming Zhou and John Q. Gan Department of Computer Science University of Essex Colchester CO4 3SQ, UK {szhoup,jqgan}@essex.ac.uk
Abstract. In this paper, an unsupervised Mercer kernel based fuzzy c-means (MKFCM) clustering algorithm is proposed, in which the implicit assumptions about the shapes of clusters in the FCM algorithm is removed so that the new algorithm possesses strong adaptability to cluster structures within data samples. A new method for calculating the prototypes of clusters in input space is also proposed, which is essential for data clustering applications. Experimental results have demonstrated the promising performance of the MKFCM algorithm in different scenarios.
1 Introduction Since Vapnik introduced support vector machines (SVM) [1], the Mercer kernel based learning has received much attention in the machine learning community [2] [3] [4]. In [5], a fuzzy SVM was proposed, in which a certain membership degree is assigned to each input point of the SVM so that different points make different contributions to the learning of the decision surface. Chiang and Hao extended the support vector clustering (SVC) algorithm developed in [3] from one sphere scenario to multi-sphere scenarios and then constructed fuzzy membership functions of the clusters manually in terms of the spherical radius in feature space [6]. However, very few efforts were made to introduce the kernel treatment into fuzzy clustering. The objective of this paper is to apply the kernel techniques to the fuzzy c-means (FCM) algorithm [7], which will improve the performance of the FCM algorithm and remove some of its constraints. The FCM algorithm has been applied in various areas, however, it makes implicit assumptions concerning the shape and size of the clusters, i.e., the clusters are hyper-spherical and of approximately the same size. Some efforts have been made to avoid these assumptions. The Mahalanobis distance is used in the Gustafson-Kessel (GK) fuzzy clustering algorithm [7] [8] instead of the Euclidean distance, in which the clusters are in fact assumed to be hyperelliptical. In order to detect non-hyperspherical structural subsets, Bezdek et al. [9] defined linear cluster structures of different dimensions in the fuzzy clustering process. In [10], Jerome et al. proposed a probabilistic relaxation fuzzy clustering algorithm based on the estimation of probability density function without assumptions about the size and shape of clusters. In this paper, a Mercer kernel based FCM (MKFCM) clustering algorithm is proposed to identify naturally occurring clusters while preserving the associated inZ.R. Yang et al. (Eds.): IDEAL 2004, LNCS 3177, pp. 613–618, 2004. © Springer-Verlag Berlin Heidelberg 2004
614
Shangming Zhou and John Q. Gan
formation about the relations between the clusters, which would remove the implicit assumption of hyper-spherical or ellipsoidal clusters within input data samples. However, one of the main problems in using kernel method for unsupervised clustering is that it is usually difficult to obtain the prototypes of clusters in both feature space and input space [11]. A new method for calculating cluster prototypes in input space is developed in this paper. The organization of this paper is as follows. The next section describes the MKFCM clustering algorithm. The proposed algorithm is experimentally evaluated in section 3. Section 4 provides a conclusion.
2 MKFCM: An Unsupervised Fuzzy Clustering Algorithm 2.1 MKFCM Clustering Algorithm Using the Euclidian distance, the FCM algorithm has the tendency to partition the data points into clusters of hyperspherical shape, with an equal number of data points in each cluster. By mapping the observed data points in input space ℜ P into a high dimensional feature space Γ using a nonlinear function Φ (⋅) , the clustering process can be carried out in feature space rather than input space and the restrictions in input space about the cluster shapes may be avoided. The criterion function used in feature space is defined as c
N
c
N
J m (U ,V Φ ) = ∑∑ (ukj ) m Dkj2 = ∑∑ (ukj ) m Φ ( x j ) − VkΦ k =1 j =1
2
(1)
k =1 j =1
where U = (ukj ) is a fuzzy c-partition of Φ ( X ) = {Φ ( x1 ),, Φ( x N ) }∈ Γ , which corresponds to the partition matrix of X ∈ ℜ P , and V Φ = (V1Φ ,,VcΦ ) represent cluster prototypes in feature space, and m ∈ [1, ∞) is a weighting exponent. As m → 1 , (1) becomes hard and converges in theory to a “generalized” kernel version of hard cmeans solution as shown in [4], so m is generally assumed to belong to (1, ∞ ) in (1). The optimal partition is obtained by minimizing (1) subject to the constraints c
∑u
kj
= 1, ∀j .
k =1
Φ
Clearly the cluster prototypes Vk should lie in the span of Φ ( x1 ),, Φ ( xN ) . FurΦ
thermore, by setting the partial derivatives of J m w.r.t. Vk to zero, we obtain N
VkΦ = ∑ u~kj Φ ( x j ) j =1
N
(2)
where u~kj = (ukj ) m / N k = (ukj ) m / ∑ (ukj ) m . Based on expression (2) for VkΦ , through j =1
some manipulations the criterion function (1) can be expressed as follows:
Mercer Kernel, Fuzzy C-Means Algorithm, and Prototypes of Clusters c
N
c
J m (U ) = ∑∑ (ukj ) m K jj − ∑ k =1 j =1
where K ij
is
a
k =1
N N ∑∑ (ukj ) m (uki ) m Kij j =1 i =1
1 Nk
N ×N
symmetric
kernel
matrix
615
(3) defined
by
K ij = k ( xi , x j ) = Φ ( xi ) ⋅ Φ( x j ) . From the above criterion function, it is clear that based on the kernel expression the clustering in feature space makes no implicit assumption concerning the shape of the clusters such as hyper-spherical or hyper-ellipsoidal structure in input space.
2.1.1 Optimal Partitioning Matrix The first problem in developing the MKFCM algorithm is how to optimize the partitioning matrix U by minimizing (1) subject to the constraints
c
∑u
kj
= 1, ∀j . Defining
k =1
the following Lagrangian: N
c
j =1
k =1
L(U , β ) = J m + ∑ β j ( ∑ ukj − 1)
(4)
and setting the Lagrangian’s gradients to zero, by some manipulations we obtain ukj
(ρ ) = ∑ (ρ ) kj
1 1− m
c
kj
(5)
1 1− m
k =1
where
ρ kj =: K jj − When
2 Nk
N
∑ (u i =1
m ki ) K ij +
1 N k2
N
N
∑∑ (u
kl
)m (uki ) m Kli
(6)
l =1 i =1
ρ kj = 0 , i.e., the sample data xj is located at the core center of the fuzzy set for
the kth cluster, special care is needed. Firstly, the data classes can be divided into two ~ ~ groups I j and I j , where I j =: {k | 1 ≤ k ≤ c; ρ kj = 0} and I j =: {1, , c} − I j
~ ( 1 ≤ j ≤ N ) . If I j ≠ φ , then set ukj = 0 for k ∈ I j and make
∑u
k∈I
kj
= 1 for
j
k∈Ij. 2.1.2 Cluster Prototypes in Input Space Another problem in developing the MKFCM algorithm is how to obtain the cluster prototypes in input space. After an optimal partition U that minimizes criterion function (1) is obtained, the cluster prototypes VkΦ can be only expressed as expansions of mapped patterns. However, it is difficult to obtain explicit expressions for the mapped patterns, and even if explicit expressions are available it is not guaranteed that there exists a preimage pattern v k in input space such that Φ ( v k ) = VkΦ since the mapping
616
Shangming Zhou and John Q. Gan
function Φ is nonlinear. The problem about cluster prototypes in kernel based clustering methods is far from being addressed in the literature [11]. This paper proposes a new method to calculate cluster prototypes in input space in terms of kernel function rather than mapping function. The basic idea is to approximate these prototypes in 2 input space by minimizing the functional W (v k ) = Φ(v k ) − VkΦ . By replacing (2) for
VkΦ , W ( v k ) can be expressed as follows: N
N
N
W ( vk ) = ∑∑ u~kl u~ki k ( xl , xi ) − 2 ∑ u~kj k ( x j , vk ) + k ( vk , vk ) l =1 i =1
(7)
j =1
As a matter of fact, the minimization of (7) corresponds to minimizing the distance between VkΦ and the orthogonal projection of VkΦ onto span (Φ(vk )) , which is
(
)
equivalent to maximizing the term: VkΦ ⋅ Φ ( v k ) / (Φ ( v k ) ⋅ Φ ( v k ) ). Scholkopf et al. approximated a preimage pattern in input space for a given mapped pattern in feature space in terms of this term [11]. This paper uses the minimization of (7) since it avoids complicated calculations of derivatives and the use of explicit expression of mapping function Φ . In the following, Gaussian kernels k ( x, y ) = exp − x − y 2 / σ 2 and polynomial 2
kernels k ( x, y ) = (( x ⋅ y ) + θ ) are discussed, where d
(
σ
)
is the width of the Gaussian
kernel, θ and d are the offset and exponential parameters of the polynomial kernel respectively. For Gaussian kernels, setting the gradient of (7) w.r.t. v k to zero we obtain the approximated cluster prototypes in input space as follows: N N v k = ∑ (u kj ) m k ( x j , v k ) x j / ∑ (u kj ) m k ( x j , v k ) j =1 j =1
(8)
For polynomial kernels, by setting the gradient of (7) w.r.t. v k to zero we have
v k = ∑ (u kj ) m (( x j ⋅ v k ) + θ ) N
j =1
d −1
(
x j / N k ⋅ (( x j ⋅ v k ) + θ )
d −1
)
(9)
2.2 Picard Iteration in the MKFCM Algorithm It should be noted that the solutions given by (5), (8), and (9) are iterative. The Picard iteration is adopted in the MKFCM algorithm, which includes the following steps: Step 1. Set the number of clusters c (1