SCIENCE CHINA Information Sciences
. RESEARCH PAPER .
July 2015, Vol. 58 072104:1–072104:15 doi: 10.1007/s11432-014-5267-5
A probabilistic framework for optimizing projected clusters with categorical attributes CHEN LiFei School of Mathematics and Computer Science, Fujian Normal University, Fuzhou 350117, China Received September 30, 2014; accepted November 17, 2014 ; published online January 30, 2015
Abstract The ability to discover projected clusters in high-dimensional data is essential for many machine learning applications. Projective clustering of categorical data is currently a challenge due to the difficulties in learning adaptive weights for categorical attributes coordinating with clusters optimization. In this paper, a probability-based learning framework is proposed, which allows both the attribute weights and the centerbased clusters to be optimized by kernel density estimation on categorical attributes. A novel algorithm is then derived for projective clustering on categorical data, based on the new learning approach for the kernel bandwidth selection problem. We show that the attribute weight substantially connects to the kernel bandwidth, while the optimized cluster center corresponds to the normalized frequency estimator of the categorical attributes. Experimental results on synthesis and real-world data show outstanding performance of the proposed method, which significantly outperforms state-of-the-art algorithms. Keywords projective clustering, projected cluster, categorical data, probabilistic framework, kernel density estimation, attribute weighting
Citation Chen L F. A probabilistic framework for optimizing projected clusters with categorical attributes. Sci China Inf Sci, 2015, 58: 072104(15), doi: 10.1007/s11432-014-5267-5
1
Introduction
Projective clustering is an unsupervised learning technique aimed at grouping data objects into clusters projected in some subspaces. Because of the built-in mechanism for feature selection, projective clustering has sparked wide interest in many real-world applications [1–3]. For example, in the applications for botany, such a method can be used to identify the different plant species (clusters) and to assess their degree of interest by the plant characteristics spanning the projected subspaces. Technically, projective clustering is performed by embedding an automated attribute-weighting procedure in the clustering process [2,4,5]. In this paper, we are interested in the sof t projective clustering method, which assigns each attribute a continuous weight, indicating to what extent the attribute is relevant to the cluster. Owing to its strength in cluster representation and its computational efficiency, K-means-type clustering is one of the mainstream methods for soft projective clustering. Examples include MPC [3] and FWKM [4] and many others. Note that they are inherently designed for numeric data; therefore, it is difficult to directly use them for categorical data, due to the substantial difference between the two data types. Email:
[email protected]
c Science China Press and Springer-Verlag Berlin Heidelberg 2015
info.scichina.com
link.springer.com
Chen L F
Sci China Inf Sci
July 2015 Vol. 58 072104:2
The challenges arise due to the difficulties in the formulation of cluster center for categorical clusters, and the adaptive attribute-weighting scheme used to measure the contribution of categorical attributes in forming the clusters, during an unsupervised learning process. Since the set mean 1) concept of a categorical data set is undefined [6,7], most of the existing methods resort to the mode [8] for representing the “center” of a categorical cluster. Accordingly, the attribute weights are computed based on the mode category [9–11]. From a statistical perspective, such a “center” is inadequate to represent the data objects in a cluster; consequently, the weights usually yield a biased indication for the attributes because the non-mode categories are eliminated from the learning process. Recently, a few alternative solutions have been suggested. In K-centers [12], a statistical center is defined and the attribute weights are computed in terms of the deviation of categories, in order to maximize the entropy of the weight distribution. The weighting scheme of [13] is based on the complement entropy; however, the mode is still being used for the center. It can be seen that both the methods optimize the weights and the cluster centers by different mechanisms. Since the two issues are closely related to each other, ideally, they should be addressed in a consistent way. In fact, virtually all of the existing algorithms designed for numeric data weight the attribute according to the variance [2,4], which is precisely defined on the mean (the cluster center for numeric data). However, such a measure is currently undefined for categorical data. In this paper, we solve the problems in a unified framework by kernel density estimation (KDE) for categorical data. Our first contribution is the proposal of the probabilistic framework, in which the attribute weight is formulated as being inversely proportional to the kernel bandwidth of a categorical attribute, while the cluster center corresponds to the smoothed, normalized frequency estimator for the categories. The second contribution is the derivation of a KDE-based algorithm based on the probabilistic framework, called KPC, for K-means-type projective clustering of categorical data. The algorithm is defined on the weighted distance of categorical objects and the cluster center, measuring in fact the dissimilarity between their probability distributions. A series of experiments on synthesis and real-world data sets are conducted to support the conclusions. The remainder of this paper is organized as follows: Section 2 describes some preliminaries and related work. Section 3 defines the probabilistic framework. In Section 4, the new clustering algorithm is presented. Experimental results are presented in Section 5. Section 6 gives our conclusion.
2
Preliminaries and related work
In this section, some preliminaries for categorical data and a few related work on projective clustering are described. We will begin by introducing the notation used throughout the paper. 2.1
Basic notation and preliminaries
The data set to be clustered is denoted by DB = {x1 , x2 , . . . , xN }. Here xi = (xi1 , xi2 , . . . , xiD ) for i = 1, 2, . . . , N are data objects. For the dth attribute, where d = 1, 2, . . . , D, we denote the set of categories by Od : i.e., the attribute d takes cd = |Od | discrete values. The lth category in Od is denoted by odl ∈ Od . Unlike the numeric case, computation of the distance between xi and xj cannot be done straightforwardly, because each attribute of xi or xj can only take a discrete value. The commonly used measures for categorical data objects include the chi-square distance [11] and the simple matching PD coefficient (SMC) distance [14], given by d=1 SMC(xid , xjd ) with SMC(xid , xjd ) = I(xid 6= xjd ),
(1)
where I(·) is an indicator function with I(true) = 1 and I(f alse)=0. 1) In this paper, the term set mean is used to refer to the central value of a finite set of data objects. For a numeric data set, the set mean is precisely the average value of the numbers in the set. As the objects in a categorical data set are encoded in discrete symbols, the concept of set mean is meaningless.
Chen L F
Sci China Inf Sci
July 2015 Vol. 58 072104:3
Given DB and the number of clusters K, a hard clustering algorithm aims to group DB into K disjoint PK subsets, denoted by π1 , . . . , πk , . . . , πK , such that DB = ∪K k=1 πk and N = k=1 nk , where nk is the number of data objects in πk . We denote the set of K subsets by Π = {πk |k = 1, 2, . . . , K}. Moreover, a projected cluster is an ensemble of data objects, each of which is associated with a subset of attributes spanning the projected subspace where the cluster exists. The goal of a projective clustering algorithm is to find such projected clusters in the given data set, typically based on attribute weighting [15]. Formally, the cluster πk is associated with a weight vector wk = hwk1 , wk2 , . . . , wkD i, satisfying ( PD k = 1, 2, . . . , K d=1 wkd = 1, (2) 0 6 wkd 6 1, k = 1, 2, . . . , K; d = 1, 2, . . . , D. Here the weight wkd is defined to measure the relevance of the dth attribute to πk . The greater the relevance, the larger the weight. The set of attribute weights is denoted by W = {wkd }K×D . Based on the preliminaries, we define a projected cluster of DB as follows: Definition 1 (Projected Cluster). where k = 1, 2, . . . , K. 2.2
The kth projected cluster of DB is denoted by PCk = (πk ,wk ),
Projective clustering of categorical data
Based on the way the weights are determined, projective clustering algorithms can be divided into two groups: hard projective clustering and sof t projective clustering algorithms. In the first group, attributes of each cluster are assigned weighting values of either 0 or 1, i.e., ∀k, d : wkd ∈ {0, 1}. SUBCAD [16] is one of the representatives in this group, aimed at seeking a suboptimal solution of the clustering objective function, which combines compactness and separation measures for both object relocation and subspace determination. A Monte-Carlo type optimization method is used in this algorithm, and thus a large number of iterations are needed for cluster optimization and subspace optimization, which results in a high time complexity of the algorithm. The algorithms in the second group assign weights in the range [0,1], which are typically optimized in a K-means-type clustering process. Currently, such a method has become popular due to its strengths in the geometrical interpretability and the clustering efficiency, where each cluster is represented by its center. To perform a K-means-type projective clustering on categorical data, the following two issues must be addressed: the difficulty of defining the center for categorical clusters, and the need to weight the categorical attributes according to the statistics of the attributes. The first issue arises because, as discussed previously, the concept of set mean is undefined for categorical data [6,12]. In the K-modes [8] and its numerous variants, the cluster center is defined by the mode category of each attribute. A few extensions to the mode include the frequency center [7] and the weighted frequency center [17]. However, they do not perform automated attribute-weighting in their clustering process. The performance of a K-means-type projective clustering algorithm also depends on the attributeweighting schemes used to automatically identify the individual relevance of attributes (the second issue mentioned above). Existing weighting schemes fall in two groups. The algorithms in the first group, such as [9,10], compute the weights according to the frequency of the mode category on each attribute. For example, in [9], the weight of the dth attribute of πk is calculated as wkd ∝
X
x∈πk
(1−SMC(xd , modekd ))
1 − β−1
∝
1 1 − fk (modekd )
1 β−1
with modekd being the mode category on the attribute, β > 1 the weighting exponent and fk (·) the frequency estimator defined by P (3) fk (odl ) = n1k x∈πk I(xd = odl ).
Thus, in these algorithms, the non-mode categories are ignored in the weighting process. The algorithms in the second group, including the recently published K-centers [12] and the complement-entropyweighting K-modes (CWKM) [13], assign the weights based on the overall distribution of the categories,
Chen L F
Sci China Inf Sci
July 2015 Vol. 58 072104:4
more than the mode category. For example, CWKM [13] computes the weight wkd as X wkd ∝ exp −
odl ∈Od
where
fk (odl )[1 − fk (odl )] = exp(−s2kd ),
s2kd = 1 −
X
odl ∈Od
[fk (odl )]2
(4)
is the measure of the sample deviation of the attribute, called the Gini diversity index of a categorical attribute [6,18]. However, since the weights are optimized on the mode-based clusters in CWKM [13], the weighting scheme is independent of the cluster center optimization to some extent. In this paper, we propose a novel method to optimize the non-mode cluster center and the automatic attribute-weight in a unified learning framework. The framework, described in the next section, is built on kernel density estimation for categorical attributes. The probability-based modeling process is adopted because of its additional advantages, which gives the learning framework the capacity to describe the cluster structures in the data [3,19]. Moveover, based on a probabilistic framework where only numeric values are involved, computation of distance between categorical objects by some well-defined measures will also become realistic, which in turn facilitates a K-means-type clustering on categorical data.
3
The probabilistic framework for projective clustering
The aim of this section is to propose a probability-based framework that allows the non-mode cluster center and the attribute weight of categorical attributes to be learned in a consistent way. For the purpose, we first represent each categorical object in a probability space; then, the cluster center and the attribute weight are formulated with the new space. We will begin by proposing the probability-based representation for categorical objects. 3.1
Probability-based representation for categorical objects
Traditionally, the probability density of a discrete random variable is measured by the probability mass function, based on frequency estimate on the categories that the variable takes. Such an estimator has the least sample bias; however, it may also have a large estimation variance for a finite-sample set [20,21]. We thus turn to the kernel smoothing method for the probability estimation. Letting Xd be a random variable associated with the observations for attribute d of πk , we define its probability density on the kernel function κ(Xd , odl , λkd ), with odl ∈ Od and λkd being the smoothing parameter called bandwidth. Here, a variation on the Aitchison & Aitken’s kernel [22] is used, given by κ(Xd , odl , λkd ) =
(
1 − (cd − 1)λkd ,
Xd = odl ,
λkd ,
Xd 6= odl ,
(5)
with λkd ∈ [0, 1/cd] being the bandwidth. Note that (5) is a probability estimator for Xd , since it does P sum to 1 over all the categories in Od , i.e., Xd ∈Od κ(Xd , odl , λkd ) = (cd − 1)λkd + 1 − (cd − 1)λkd = 1. Based on (5), we represent each data object xi = (xi1 , . . . , xid , . . . , xiD ) ∈ πk as D probability distributions, each for the distribution of Xd given the observation xid . Denoting the dth distribution by p(Xd |xid ), we define p(Xd , xid ) def κ(Xd , xid , λkd ) = . p(Xd |xid ) = p(xid ) p(xid ) P Here, p(xid ) can be viewed as a normalization factor and can thus be canceled out since odl ∈Od κ(Xd , odl , λkd ) = 1. As such, it can be obtained that p(Xd |xid ) = λkd + (1 − cd λkd )I(Xd = xid ).
(6)
Chen L F
3.2
Sci China Inf Sci
July 2015 Vol. 58 072104:5
Formulation of the attribute weight
Using the new representation, the distance between two objects xi and xj can be numerically computed, by measuring the dissimilarity of their probability distributions p(Xd |xid ) and p(Xd |x pjd ) on each attribute d. Such measures include the well-known Hellinger distance, which is given by 1 − BC(p1 , p2 ) with BC(p1 , p2 ) being the Bhattacharyya coefficientpof the discrete probability distributions p1 and p2 . For our P case, the coefficient is BC(p1 , p2 ) = odl ∈Od p(odl |xid )p(odl |xjd ); then, the squared Hellinger distance between xid and xjd can be derived: 2 q q X p 1X H(xid , xjd ) = 1 − p(odl |xid )p(odl |xjd ) = p(odl |xid ) − p(odl |xjd ) . odl ∈Od odl ∈Od 2 Furthermore, according to (6), we have p p p p p(odl |xid ) = λkd + 1 − (cd − 1)λkd − λkd I(odl = xid ).
Thus, based on the fact that of xi , xj ∈ πk is measured as
P
odl ∈Od [I(odl
Distk (xi , xj ) = with
XD
d=1
wkd =
(7)
= xid ) − I(odl = xjd )]2 = 2I(xid 6= xjd ), the squared distance
H(xid , xjd ) =
XD
d=1
wkd × I(xid 6= xjd )
p p 2 1 − (cd − 1)λkd − λkd .
(8)
It is important to remark that Distk (·, ·) is a weighted version of the SMC distance defined in (1), which has been popularly used in the literature to measure the distance between categorical objects [8–10]. According to this view, wkd here serves as an attribute-weight assigned to the attribute d of πk . Note that wkd ∈ [0, 1] since λkd ∈ [0, 1/cd]. The smaller the bandwidth λkd , the larger the weighting value wkd . 3.3
Formulation of the cluster center
As the set mean is undefined for categorical data, we formulate the cluster center by probability distributions, like the representation of data objects in the cluster as defined in (6) and subsequently (7). def Denoting the cluster center of πk on attribute d by mkd = p(Xd |mkd ) and referring to (7), we formulate the center as p p p p (9) p(Xd |mkd ) = λkd + 1 − (cd − 1)λkd − λkd vk (Xd ) subject to
X
odl ∈Od
p(odl |mkd ) = 1
(10)
with vk (Xd ) > 0 defining the cd values of the center associated with each category in Od . The formal definition for cluster center of πk is given in the following Definition 2. Definition 2. The cluster center of πk is the vector mk = hmk1 , . . . , mkd , . . . , mkD i that minimizes the objective function XD X H(xid , mkd ), OBJ(mk ) = xi ∈πk
d=1
subject to (10) for d = 1, 2, . . . , D.
Replacing p(Xd |xid ) and p(Xd |mkd ) in H(xid , mkd ) according to (7) and (9), respectively, the objective P P PD 2 function in Definition 2 becomes OBJ1 (mk ) = xi ∈πk d=1 wkd odl ∈Od [I(xid = odl ) − vk (odl )] + PD P odl ∈Od p(odl |mkd ) − 1], where ξd is the Lagrange multiplier enforcing the constraints defined in d=1 ξd [ ˜ k = hm (10). Thus, for a given data set πk , its optimal cluster center, denoted by m ˜ k1 , . . . , m ˜ kd , . . . , m ˜ kD i, can be solved by setting the gradients of vk (odl ) and ξd for ∀odl ∈ Od and ∀d to zeros, yielding def
˜ kd ) = P m ˜ kd = p(Xd |m
[uk (Xd )]2 2 odl ∈Od [uk (odl )]
(11)
Chen L F
with uk (odl ) =
Sci China Inf Sci
July 2015 Vol. 58 072104:6
p p p λkd + 1 − (cd − 1)λkd − λkd fk (odl ),
where fk (odl ) is the frequency estimator of odl with regard to πk , as shown in (3). We remark that the cluster center (11) is a smoothed, normalized frequency estimator for the categories distributed on each attribute. In fact, when the bandwidth λkd → 0, the center becomes p(Xd |m ˜ kd ) = P [fk (Xd )]2 / odl ∈Od [fk (odl )]2 , which is precisely the normalized frequencies of the categories on attribute d. According to this view, the center defined in [7] as well as [12] is an unnormalized implementation of (11).
4
Projective clustering of categorical data
In this section, a K-means-type clustering algorithm, named KPC (Kernel-based Projective Clustering), is proposed to discover K projected clusters from the given data set DB, based on the probability-based learning framework presented in the previous section. We will begin by proposing a data-driven method for the kernel bandwidth selection problem, which in effect endeavors to attribute-weights optimization. 4.1
Kernel bandwidths optimization
By the formulation of attribute-weights in (8), the weights are dependent on the kernel bandwidths. Thus the problem of assigning a set of attribute-weights for the categorical attributes of πk can be transformed into the new problem of optimizing λkd for each attribute d of πk . The latter is exactly the bandwidth selection problem in a KDE method [21]. In this subsection, we solve it by a data-driven method, aimed at learning an optimal bandwidth that minimizes the mean squared error (MSE) of the kernel estimate. Letting p(odl ) be the (unknown) population probability of odl ∈ Od with regard to the data subset πk , and pˆ(odl |λkd ) be the kernel estimator of p(odl ) based on (5), we have pˆ(odl |λkd ) =
1 X κ(odl , xd , λkd ) = λkd + (1 − cd λkd )fk (odl ) x∈πk nk
(12)
with fk (odl ) being the frequency estimator defined in (3). The MSE of the estimate is then computed as MSE(odl , λkd ) = E[(ˆ p(odl |λkd ) − p(odl ))2 ], where E[·] denotes the expectation of a random variable. The resulting errors for all odl ∈ Od are accumulated to derive the following objective function that needs to be minimized for bandwidth optimization: X Φ(λkd ) = MSE(odl , λkd ). (13) odl ∈Od
Theorem 1.
For the attribute d of πk , the optimal bandwidth λkd that minimizes (13) is λkd =
2 σkd 2 (cd − 1)nk − cd (nk − 1)σkd
(14)
with 2 σkd =1−
X
odl ∈Od
[p(odl )]2
being the squared standard deviation of the attribute. Proof. First, we show two preliminary consequences without proof (see [21] for details): E[fk (odl )] = p(odl ) and var[fk (odl )] = n1k p(odl )(1 − p(odl )), where var[·] denote the variance of a random variable. Then, using the transformation that var[·] = E[·]2 − (E[·])2 , we have MSE(odl , λkd ) = E[ˆ p(odl |λkd )]2 − 2 2 2 (E[ˆ p(odl |λkd )]) + λkd (1 − cd p(odl )) . Substituting pˆ(odl |λkd ) according to (12) and making use of the two P consequences, (13) becomes Φ(λkd ) = [c2d λ2kd − (1 − cdλkd )2 n1k ] odl ∈Od [p(odl )]2 + (1 − cdλkd )2 n1k − cd λ2kd . 2 2 = (1 − cd λkd )σkd . Setting the gradient with respect to λkd to 0 yields that (cd − 1)nk λkd − cd nk λkd σkd Then, Eq. (14) follows.
Chen L F
Sci China Inf Sci
July 2015 Vol. 58 072104:7
Algorithm 1 The bandwidth-learning algorithm BLA Input: Π = {π1 , . . . , πk , . . . , πK } Output: Λ = {λkd |k = 1, 2, . . . , K; d = 1, 2, . . . , D} for k = 1 to K and d = 1 to D do Compute s2kd using (4); 2 replaced by s2 . Compute λkd according to (14) with σkd kd end for
2 However, σkd is uncomputable in practice because p(odl ) remains unknown. In the plug-in method developed for numeric data estimation [21], the problem is responded by instead using the sample standard 2 deviation. Following this method, we estimate σkd by the Gini diversity index s2kd as defined in (4), which 2 in fact is obtained by replacing p(odl ) in σkd with the frequency estimator fk (odl ). The BLA algorithm, as outlined in Algorithm 1, aims at learning a set of bandwidths Λ = {λkd |k = 1, 2, . . . , K; d = 1, 2, . . . , D} for the K data subsets in Π. Once the bandwidths are learned by calling BLA(Π), the set of attribute weights W can be obtained by computing wkd using (8) and the known λkd . Here are some comments on the resulting bandwidths and the attribute weights: (i) An attribute with low dispersion of the category distribution has a large attribute weight. In the extreme case where the attribute takes a unique categorical value, s2kd = 0 such that the weight reaches its maximum, i.e., wkd = 1; in the opposite case where the categories are uniformly distributed, wkd = 0 since for this case s2kd = 1 − 1/cd . (ii) The bandwidth λkd → 0 as nk → ∞. This is consistent with the asymptotic properties that a kernel density estimate should satisfy [23]. Moreover, the rate of convergence of the bandwidth can be given as λkd = Op (n−1 k ), which is the same as the one obtained by the leave-one-out cross-validation method [21].
4.2
Clustering algorithm
Given DB and K, generally, the goal of data clustering is to obtain an optimal set of K clusters such that the scatter of each cluster is minimized [24,25]. In the well-known K-means designed for numeric data, the within-cluster scatter is measured by the summarized distances of data objects to their center. As the center of a categorical cluster has been formulated as mk in the probabilistic framework, the scatter P PD of πk can be defined, given by xi ∈πk d=1 H(xid , mkd ), which implicitly involve the weight vector wk . To enforce the constrains for the weights (2), i.e., the weight vector associated with each cluster should be in the same length, we assign each cluster k a class-weight 1/ωk with ωk =
XD
d=1
wkd .
In this way, the objective function to be minimized by the projective clustering algorithm is obtained: J(Π, M ) =
XK
k=1
XD X 1 H(xid , mkd ), × d=1 x ∈π ωk i k
subject to the constrains for cluster centers (10), where M = {mkd |k = 1, 2, . . . , K; d = 1, 2, . . . , D} is the set of centers that needs to be optimized along with Π. Here the set of weights W is not considered as a parameter because, given Π, the weights are dependent on the bandwidths in the sense of (8). The usual method of achieving the minimization of J(·, ·) is to use the partial optimization for each parameter, i.e, to optimize Π and M in a sequential structure analogous to the K-means clustering [25]. ˆ and solve Π as Π ˆ to minimize J(Π, M ˆ ). Next, Π = Π ˆ is set and the In each iteration, we first set M = M ˆ ˆ optimal M , say M , is solved to minimize J(Π, M ). The resulting bandwidths are then used to generate the weights set W . The first problem can be solved by assigning each input xt to its most similar center y according to 1 XD H(xtd , m ˆ kd ). (15) y = argmin∀k d=1 ωk
Chen L F
Sci China Inf Sci
July 2015 Vol. 58 072104:8
The second problem is solved in two consecutive steps. The first step is designed to learn the optimal ˆ by calling the BLA algorithm (see Algorithm 1). Then, in the bandwidths for each data subset π ˆk ∈ Π second step, we solve the optimization problem according to the following Theorem 2. ˆ J(Π, ˆ M ) is minimized iff m Theorem 2. Set Π = Π. ˆ kd = m ˜ kd as formulated in (11), for k = 1, 2, . . . , K and d = 1, 2, . . . , D. ˆ is fixed, the problem of minimizing J(Π, ˆ W ) can be decomposed into K inProof. When Π = Π P P 1 dependent optimization problems: min Jk (mk ) = ωk xi ∈πk D d=1 H(xid , mkd ) subject to (10) for k = 1, 2, . . . , K and d = 1, 2, . . . , D. As ωk is irrelevant to mk , it can be seen that argminmk Jk (mk ) = argminmk OBJ(mk ) subject to (10). See the definition of OBJ(·) in Definition 2. Since m ˜ kd is precisely the optimal solution to minimize OBJ(·), we have m ˆ kd = m ˜ kd . The KPC algorithm is outlined in Algorithm 2. In terms of algorithmic structure, KPC can be viewed as an extension to the K-means clustering algorithm [25], by adding Step (1) to compute the attribute weights and Step (3) to learn the bandwidths for each attribute according to Theorem 1. Hence, like the K-means algorithm, KPC is able to converge in a finite number of iterations. The time complexity of KPC is O(KN D), which is also the same to that of K-means. Algorithm 2 The outline of the KPC algorithm Input: DB and K Output: K projected clusters {PCk = (πk , wk )}K k=1 Let t be the number of iterations, t = 1; Set all the bandwidths of Λ to 0; Randomly choose K objects as the initial centers, and compute the initial M (denoted as M (1) ) using (11); repeat (1) Obtain W using (8) and Λ; ˆ = M (t) , assign all the data objects according to (15) and obtain Π(t+1) ; (2) Letting M ˆ = Π(t+1) , update the bandwidths by Λ = BLA(Π); ˆ (3) Letting Π (4) Update the cluster centers using (11) for ∀l, d and ∀odl ∈ Od , and obtain M (t+1) ; (5) t = t + 1; until Π(t−1) = Π(t) (t) Output {(πk , wk )}K k=1 .
5
Experimental evaluation
Below, we evaluate the performance of KPC on synthesis and real-world data sets. We also experimentally compare KPC with some mainstream clustering algorithms for categorical data. 5.1
Experimental setup and evaluation measures
For the comparison, three projective clustering algorithms for categorical data: the distance-basedweighting K-modes DWKM [9], the mixed-attributes-weighting K-modes MWKM [10] and the recently published complement-entropy-weighting K-modes CWKM [13] were chosen. Since the three algorithms are extensions to the K-modes (KM) [8], it was also used to provide a reference point for comparison. Additionally, because all five algorithms (including KPC) are K-means-type, we also used EBC [26], which performs non-central clustering on categorical data by Monte-Carlo optimization. In the experiments, the convergence of EBC was asserted if the value of criterion was not reduced during a successive 100 movements of the data objects. The weighting exponents β of DWKM and MWKM were set to the author-recommended value 2. Both KPC and CWKM are parameter-free for automated attributeweighting. Two approaches are adopted to measure the quality of the clustering results. One is based on the use of a category utility (CU), which evaluates the quality independent of object labeling, given by [10] CU =
1 XK n k XD X 2 2 [fk (odl )] − [f (odl )] . k=1 d=1 o ∈O K N dl d
Chen L F
Sci China Inf Sci
July 2015 Vol. 58 072104:9
1.00
1.00
1.00
0.75
0.75
0.75
0.50
0.50
0.50
0.25
0.25
0.25
0 0
0.25
0.50 (a)
0.75
1.00
0 0
0.25
0.50 (b)
0.75
1.00
0 0
1.00
1.00
1.00
0.75
0.75
0.75
0.50
0.50
0.50
0.25
0.25
0.25
0 0
0.25
0.50 (d)
0.75
1.00
0 0
0.25
0.50 (e)
0.75
1.00
0 0
0.25
0.50 (c)
0.75
1.00
0.25
0.50 (f)
0.75
1.00
Figure 1 Synthetic data set with two clusters (2 for π1 and • for π2 ) in the three-dimensional subspaces of A1, A3, A4 and A1, A2, A3. The numeric values are binned to create the categories for each attribute. (a) The subspace of A1, A2; (b) the subspace of A1, A3; (c) the subspace of A1, A4; (d) the subspace of A2, A3; (e) the subspace of A2, A4; (f) the subspace of A3, A4.
Here, f (odl ) is the frequency of odl with regard to the entire data set. The other approach is a classificationbased one, which is dependent on the “class” label of each cluster (the ground truth of the data sets). We have made use of the FScore, which is defined as follows [3]: XK N k 2 × R(classk , cl ) × P (classk , cl ) , max16l6K FScore = k=1 N R(classk , cl ) + P (classk , cl )
where R(classk , πl ) and P (classk , πl ) are the recall and precision values of the kth class classk with respect to the lth cluster πl , and Nk is the size of classk . R(classk , πl ) is defined as Nkl /Nk , and P (classk , πl ) is defined as Nkl /Nl , where Nl is the size of πl and Nkl the number of data points occurring in both class classk and cluster πl . The larger the values of both CU and FScore, the higher the clustering quality. In the tables reporting the average quality, we use the format average ± 1 standard deviation. 5.2
Experiment on synthetic data
This set of experiments was designed to evaluate KPC on synthetic data, which are often used to validate a clustering algorithm [4]. In our experiments, the synthetic data set was constructed to simulate categorical clusters sited in different subspaces. 5.2.1 Synthetic data generation method The synthetic data set contains 300 objects, each consisting of four attributes, named A1, A2, A3 and A4, respectively. The objects are divided into two clusters π1 and π2 , with the numbers of objects n1 = 130 and n2 = 170. The first cluster π1 is associated with the subspace A1, A3 and A4, while π2 exists in the subspace consisting of A1, A2 and A3. Figure 1 plots the two clusters in different two-dimensional subspaces. As the figure shows, we generated each categorical attribute by binning an originally numeric attribute. In the generation process for an irrelevant attribute (i.e., A2 for π1 and A4 for π2 ), the numeric coordinates of the objects were uniformly distributed in the range [0,1]. On a relevant attribute, they were generated according to a normal distribution with the mean randomly chosen from [0,1]. Then all the numeric
Chen L F
Sci China Inf Sci
July 2015 Vol. 58 072104:10
Table 1 Average FScore and category utility (CU) of different algorithms on the synthesis data. The highest results are marked in bold typeface KM
DWKM
MWKM
CWKM
EBC
KPC
Average FScore
0.694±0.079
0.736±0.095
0.773±0.119
0.780±0.116
0.893±0.119
0.937±0.109
Average CU
0.130±0.033
0.171±0.043
0.169±0.039
0.167±0.043
0.189±0.029
0.201±0.027
Table 2 Detailed clustering results of KPC on the synthesis data. The categories in the center (mk for k = 1, 2) with the P highest probability are marked in bold typeface. The “Mean” is calculated by Xd ∈Od Xd × p(Xd |mkd ), and “the true mean” is the mean computed on the numeric values which were converted to the categories in the data generation process Cluster
π1
π2
Attribute
Cluster center
Mean (the true mean)
Weight
Bandwidth
A1
h0.010, 0.025, 0.117, 0.397, 0.383, 0.047, 0.010, 0.010i
4.35(4.27)
0.271
0.004043
A2
h0.168, 0.086, 0.114, 0.091, 0.134, 0.139, 0.095, 0.174i
4.61(4.65)
0.156
0.032379
A3
h0.009, 0.007, 0.011, 0.007, 0.031, 0.314, 0.550, 0.071i
6.54(6.63)
0.278
0.002984
A4
h0.002, 0.070, 0.879, 0.042, 0.002, 0.002, 0.002, 0.002i
3.00(2.94)
0.295
0.001096
A1
h0.034, 0.047, 0.159, 0.265, 0.224, 0.173, 0.071, 0.028i
4.54(4.55)
0.259
0.008511
A2
h0.177, 0.777, 0.035, 0.002, 0.002, 0.002, 0.002, 0.002i
1.90(1.82)
0.311
0.001098
A3
h0.083, 0.878, 0.028, 0.002, 0.003, 0.003, 0.002, 0.002i
1.99(1.91)
0.314
0.000874
A4
h0.114, 0.128, 0.096, 0.150, 0.126, 0.131, 0.137, 0.117i
4.57(4.67)
0.115
0.049368
coordinates were converted to categories using an equal-width binning method. The number of bins was set to 8, i.e., each attribute takes 8 categories, denoted by the integers in the range [1,8]. 5.2.2 Evaluation results In applying the clustering algorithms to a given data set, the number of clusters K must be specified in advance. Generally, K can be estimated by a cluster validity method, such as [24,27]. However, the exiting methods are typically designed for numeric data clustering. In fact, estimation of K for a categorical data set currently remains an unresolved problem, due to the nature of discrete space where the categorical clusters exist. Therefore, in our experiments, the parameter K for all six algorithms is simply set to the true number of clusters in the data set. Clearly, such a method requires that the ground truth of the data set be known, which is the case in this set of experiments. The synthesis data set was clustered by each algorithm for 100 times, and the average performances are summarized in Table 1. From the table, we can see that KPC is significantly more accurate than all the competing algorithms. The three mode-based projective clustering algorithms DWKM, MWKM and CWKM achieve better clustering quality than the original KM algorithm, while EBC outperforms them in terms of FScore and CU. The results of EBC largely due to the small number of data objects in the data set (300 objects in 2 classes). For such data, the Monte-Carlo optimization method can easily reach the optimal solution, by the repeated movements of objects from one cluster to another. To further understand the reason of the performance, we use the clustering results produced by each algorithm in the following analysis. Table 2 shows the detailed results of KPC in its best resulting clusters. In KPC, the cluster center is represented by a set of probability distributions for the categories on each attribute, see (11). Thus, all the categories, more than the mode category, are taken into consideration with the cluster optimization. Actually, the mode on each attribute corresponds to the category that has the highest probability in the center. Such a probability-based representation also allows KPC to recover the “mean” of the ordinal attributes (the attributes in this synthesis data set are actually in ordinal type, see Subsection 5.2.1), as the column “Mean” in the table shows. The results indicate that KPC can approximate the true means of the attributes, which consequently enables KPC to generate high-quality clusters. However, it is not the case in the competing algorithms, as shown in Table 3, which illustrates the cluster “center” in the best clustering results generated by different algorithms except KPC. Since EBC is a partition-based algorithm, we calculate the categories frequency in the resulting clusters to represent
Chen L F
Sci China Inf Sci
July 2015 Vol. 58 072104:11
Table 3 Cluster center yielded by different algorithms (except KPC) on the synthesis data. The category having the highest frequency in the frequency center of EBC is marked in bold typeface Cluster
π1
π2
Attribute
Frequency center (EBC)
Mode (KM)
Mode (DWKM)
Mode (MWKM)
Mode (CWKM)
A1
h0.000, 0.038, 0.167, 0.371, 0.348, 0.076, 0.000, 0.000i
4
4
4
4
A2
h0.197, 0.068, 0.121, 0.068, 0.136, 0.144, 0.076, 0.189i
8
1
1
1
A3
h0.015, 0.030, 0.015, 0.000, 0.061, 0.318, 0.439, 0.121i
7
7
7
7
A4
h0.000, 0.167, 0.712, 0.121, 0.000, 0.000, 0.000, 0.000i
3
3
3
3
A1
h0.024, 0.048, 0.173, 0.244, 0.226, 0.190, 0.083, 0.012i
5
4
4
4
A2
h0.274, 0.631, 0.095, 0.000, 0.000, 0.000, 0.000, 0.000i
2
2
2
2
A3
h0.190, 0.685, 0.101, 0.000, 0.012, 0.012, 0.000, 0.000i
2
2
2
2
A4
h0.107, 0.137, 0.036, 0.179, 0.131, 0.143, 0.155, 0.113i
2
4
4
4
A1
A2
A3
0.5
A4
0.4
0.4
0.3
0.3
Weights
Weights
0.5
0.2
A1
A2
A3
A4
0.2 0.1
0.1
0
0 DWKM
MWKM (a)
CWKM
KPC
DWKM
MWKM (b)
CWKM
KPC
Figure 2 Attribute weights generated by different algorithms for the two clusters in the synthetic data. KPC yields obviously small weights for the noise attributes A2 (for π1 ) and A4 (for π2 ) comparing with those for the other attributes in the same cluster. (a) π1 ; (b) π2 .
their centers. It can be seen that such a frequency-based center encounters difficulty on noise attributes. For example, on A4 of π2 , one can see the obvious difference in the frequency of different categories, which is an unexpected outcome because there is typically no dominating category on a noise attribute. For the same attribute, KPC obtains a large bandwidth approaching 0.05, with which the difference is smoothed to a certain extent, as Table 2 shows. The results in Table 1 exhibit that the kernel learning method used in KPC is indeed useful for discovering clusters in subspaces. Table 3 also shows that KM tends to yield incorrect modes for the noise attributes (see the results for A2 of π1 and A4 of π2 ), while the three mode-based projective clustering algorithms DWKM, MWKM and CWKM obtain correct results by incorporating the automated attribute-weighting approaches. Figure 2 shows the attribute weights generated by DWKM, MWKM, CWKM and our KPC. MWKM computes two weights for each attribute, one of which is proportional to the mode frequency and thus is similar to that computed by DWKM. For ease of comparison, only this kind of weight is drawn in the figure for MWKM. From the figure, one can see the different behaviors of the algorithms in identifying the noise attributes. In fact, only KPC assigns distinguishingly small weights to them, indicating that A2 of π1 and A4 of π2 are less important in forming the clusters. As shown in Table 2, in KPC, the weight is measured based on the kernel bandwidth, which essentially connects to the optimized cluster center (see (8)). Due to the fact that KPC optimizes the center and the projected subspace for a categorical cluster in the unified probabilistic framework, more precisely results can thus be obtained for categorical data involving noise attributes. 5.3
Experiment on real data
The second set of experiments was designed to evaluate KPC in real applications, with comparison to other mainstream algorithms . The comparison of performance was done on a number of real-world data
Chen L F
Sci China Inf Sci
Table 4 Data set
July 2015 Vol. 58 072104:12
Details of the real-world data sets
#Attributes
True number of clusters
Size
Balance
4
3
625
Car
6
4
1728
Breastcancer
9
2
699
Vote
16
2
435
Mushroom
21
2
8124
Dermatology
33
6
366
USCensus10k
68
N/A
10000
Coil2000
86
N/A
5822
Table 5 Comparison of average category utility in terms of CU on various data sets. The highest CU is marked in bold typeface for each data set Data set
KPC
KM
CWKM
MWKM
DWKM
EBC
Balance
0.104±0.016
0.082±0.005
0.087±0.004
0.088±0.006
0.091±0.012
0.004±0.001
Car
0.183±0.014
0.114±0.004
0.131±0.006
0.131±0.005
0.175±0.014
0.121±0.017
Breastcancer
0.589±0.106
0.421±0.179
0.423±0.154
0.434±0.169
0.285±0.053
0.554±0.032
Vote
1.471±0.000
1.447±0.005
1.459±0.000
1.458±0.002
1.251±0.268
1.384±0.058
Mushroom
0.806±0.096
0.689±0.151
0.728±0.147
0.739±0.129
0.362±0.227
0.500±0.119
Dermatology
0.765±0.028
0.642±0.104
0.737±0.055
0.675±0.096
0.400±0.116
0.658±0.041
USCensus5.000
2.970±0.218
2.872±0.198
2.893±0.377
2.904±0.066
0.861±1.206
2.233±0.333
Coil2000
0.670±0.013
0.533±0.037
0.560±0.033
0.548±0.033
0.009±0.019
0.481±0.048
sets, which include various number of attributes, data objects and clusters. 5.3.1 Real data sets The experiments were conducted on eight widely used UCI data sets (available at ftp.ics.uci.edu: pub/ machine-learning-databases). Table 4 lists the details. We created the USCensus10k data set by randomly choosing 10,000 records from the original USCensus1990 database. Note that the true numbers of clusters in USCensus10k and Coil2000 are unknown. For the reason described in Subsection 5.2.2, we used the common settings [10], i.e., K = 2 and K = 3 for these two data sets, and set K to the true numbers of clusters shown in the table for the other data sets. For all the data sets, the missing value in each attribute was considered as a special category in our experiments. For example, in the Vote data set, the 4th attribute takes its value from {0, 1, 2}; however, there are 4 samples miss values in this attribute. For this case, we inserted an additional category denoted “?” into the original set and obtained a new categories set {0, 1, 2, ?}. 5.3.2 Experimental results Each data set in Table 4 was clustered by each algorithm for 100 executions and the average performances are reported. For fair comparison, the same initial cluster centers were used for KPC and the four Kmodes-type algorithms. Tables 5 and 6 illustrate the clustering results. According to the results, KPC is able to achieve high-quality overall results and outperform the competing algorithms in the most cases. Since the K-modes-type algorithms, including KM, CWKM, MWKM and DWKM, use the mode category to represent the cluster, they easily fall into local minima of the clustering objective [25]. EBC obtains high accuracy on the data sets having relatively low dimensionality (for example, the Balance and Car data sets); however, its performance drops when the dimensionality increases. The table shows similar results with the KM algorithm. This is because they lack an adaptive attribute-weighting scheme to distinguish different contributions of the attribute to clusters. KPC owes its good performance to optimization of the projected clusters in the unified kernel learning
Chen L F
Sci China Inf Sci
July 2015 Vol. 58 072104:13
Table 6 Comparison of average clustering accuracies in terms of FScore. The highest FScore is marked in bold typeface. The USCensus10k and Coil2000 data sets are omitted because the ground truth of the object labels are unknown Data set
KPC
KM
CWKM
MWKM
DWKM
EBC
Balance
0.477±0.040
0.486±0.032
0.470±0.048
0.471±0.046
0.477±0.042
0.592±0.000
Car
0.464±0.067
0.437±0.039
0.427±0.036
0.425±0.038
0.499±0.065
0.481±0.055
Breastcancer
0.954±0.061
0.842±0.136
0.856±0.113
0.855±0.134
0.768±0.034
0.943±0.017
Vote
0.881±0.000
0.863±0.007
0.868±0.004
0.870±0.005
0.826±0.077
0.861±0.017
Mushroom
0.779±0.143
0.725±0.129
0.746±0.138
0.745±0.138
0.662±0.059
0.658±0.060
Dermatology
0.750±0.104
0.626±0.096
0.728±0.090
0.660±0.094
0.577±0.088
0.638±0.066
0.08
Edible
0.5
Poisonous
Weights
Weights
0.04 0.02
0.3 0.2 0.1 0
A1 A3 A5 A7 A9 A11 A13 A15 A17 A19 A21
(a) 1.0
A1 A3 A5 A7 A9 A11 A13 A15 A17 A19 A21
(b) Edible
0.100
Poisonous
0.8
Edible
Poisonous
0.075 Weights
Weights
Poisonous
0.4
0.06
0
Edible
0.6 0.4
0.050 0.025
0.2 0
0 A1 A3 A5 A7 A9 A11 A13 A15 A17 A19 A21
(c)
A1 A3 A5 A7 A9 A11 A13 A15 A17 A19 A21
(d)
Figure 3 Attribute weights generated by different algorithms for the two clusters in the Mushroom data set. KPC identifies that A10 (say, the 10th attribute named “stalk-shape”) is a noise attribute for both the clusters. (a) KPC; (b) DWKM; (c) MWKM; (d) CWKM.
framework. To show this, we take the Mushroom data set for example, which contains 21 nominal attributes and 2 classes: Edible and Poisonous. Figure 3 illustrates the attribute weights generated by KPC, DWKM, MWKM and CWKM on the data set. We observe from the figure that DWKM, MWKM and CWKM tend to identify the relevant attributes for clusters, while KPC more emphasizes on the identification of noise attributes for each cluster. Note that identifying those noise attributes and subsequently eliminating them from the clusters construction are critical for a projective clustering algorithm [4,15]. According to the figure, KPC suggests that the 10th attribute is a noise in the Mushroom data set. In fact, KPC optimizes the cluster center on the attribute as (0.420,0.580) and (0.367,0.633) for the two clusters, respectively. With the closed probability of the two categories (say, “e” and “t”) distributed on the attribute, KPC obtains two large bandwidths 0.014519 and 0.004074, which in turn result in the obviously small weights assigned to the attribute, as Figure 3 shows. To further examine the effectiveness of the attribute-weighting method used in KPC, we created a reduced data set on the original Mushroom data by removing A10. Figure 4 shows the change in FScore values of different algorithms on the data set. As expected, all the algorithms including the non-central clustering algorithm EBC obtain better clustering results in varying degree when A10 is absent. This exhibits that KPC can identify more precisely the projected subspaces for categorical clusters, yielding its good performance in projective clustering for real applications.
Chen L F
FScore
0.9
6
July 2015 Vol. 58 072104:14
The original Mushroom data Reduced Mushroom data with the 10th attribute removed
0.7
0.5 Figure 4
Sci China Inf Sci
KPC
KM
CWKM MWKM DWKM
EBC
Clustering accuracy of the algorithms on the Mushroom data set with various attribute sets.
Conclusion and perspectives
In this paper, we have proposed a probabilistic learning framework for soft projective clustering on a categorical data set. We derived a weighted distance measure and demonstrated that the attribute weights, designating the projected subspace for clusters, are dependent on the kernel bandwidths of the categorical attributes. We proposed a data-driven method to learn the weights by bandwidth optimization for the kernel estimate. We also proposed a K-means-type clustering algorithm, called KPC, where the cluster centers are formulated as kernel-smoothing frequency estimators. The experiments were conducted on a synthesis and eight real-world data sets and the results showed its outstanding effectiveness compared with the existing methods. There are many directions that are clearly of interest for future exploration. One avenue of further study is to extend the method to the general kernel functions, such that the differences between categorical values of same attribute can be modeled in the learning framework. Our further efforts will also be directed toward developing techniques to build a robust initial condition for the algorithm.
Acknowledgements This work was supported by National Natural Science Foundation of China (Grant No. 61175123) and Fujian Normal University Innovative Research Team (IRTL1207). The author is grateful to the anonymous reviewers for their helpful comments.
References 1 Aggarwal C C, Procopiuc C, Wolf J L, et al. Fast algorithm for projected clustering. ACM SIGMOD Rec, 1999, 28: 61–72 2 Moise G, Sander J, Ester M. Robust projected clustering. Knowl Inf Syst, 2008, 14: 273–298 3 Chen L, Jiang Q, Wang S. Model-based method for projective clustering. IEEE Trans Knowl Data Eng, 2012, 24: 1291–1305 4 Huang J Z, Ng M K, Rong H, et al. Automated variable weighting in k-means type clustering. IEEE Trans Patt Anal Mach Intell, 2005, 27: 657–668 5 Poon L, Zhang N, Chen T, et al. Variable selection in model-based clustering: to do or to facilitate. In: Proceedings of the 27th International Conference on Machine Learning, Haifa, 2010. 887–894 6 Light R J, Marglin B H. An analysis of variance for categorical data. J Am Stat Assoc, 1971, 66: 534–544 7 San O M, Huynh V N, Nakamori Y. An alternative extension of the k-means algorithm for clustering categorical data. Int J Appl Math Comput Sci, 2004, 14: 241–247 8 Huang Z. Extensions to the k-means algorithm for clustering large data sets with categorical value. Data Min Knowl Discov, 1998, 2: 283–304 9 Chan E Y, Ching W K, Ng M K, et al. An optimization algorithm for clustering using weighted dissimilarity measures. Patt Recogn, 2004, 37: 943–952 10 Bai L, Liang J, Dang C, et al. A novel attribute weighting algorithm for clustering high-dimensional categorical data. Patt Recogn, 2011, 44: 2843–2861 11 Xiong T, Wang S, Mayers A, et al. DHCC: divisive hierarchical clustering of categorical data. Data Min Knowl Discov, 2012, 24: 103–135
Chen L F
Sci China Inf Sci
July 2015 Vol. 58 072104:15
12 Chen L, Wang S. Central clustering of categorical data with automated feature weighting. In: Proceedings of the 23rd International Joint Conference on Artificial Intelligence, Beijing, 2013. 1260–1266 13 Cao F, Liang J, Li D, et al. A Weighting k-modes algorithm for subspace clustering of categorical data. Neurocomputing, 2013, 108: 23–30 14 Boriah S, Chandola V, Kumar V. Similarity measures for categorical data: a comparative evaluation. In: Proceedings of the 8th SIAM International Conference on Data Mining, Atlanta, 2008. 243–254 15 Parsons L, Haque E, Liu H. Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newslett, 2004, 6: 90–105 16 Gan G, Wu J. Subspace clustering for high dimensional categorical data. ACM SIGKDD Explor Newslett, 2004, 6: 87–94 17 Bai L, Liang J, Dang C, et al. The impact of cluster representatives on the convergence of the k-modes type clustering. IEEE Trans Patt Anal Mach Intell, 2013, 35: 1509–1522 18 Sen P K. Gini diversity index, hamming distance and curse of dimensionality. Metron Int J Stat, 2005, LXIII: 329–349 19 Tao J, Chung F, Wang S. A kernel learning framework for domain adaptation learning. Sci China Inf Sci, 2012, 55: 1983–2007 20 Ouyang D, Li Q, Racine J. Cross-validation and the estimation of probability distributions with categorical data. Nonparametr Stat, 2006, 18: 69–100 21 Li Q, Racine J S. Nonparametric Econometrics: Theory and Practice. Princeton: Princeton University Press, 2007 22 Aitchison J, Aitken C. Multivariate binary discrimination by the kernel method. Biometrika, 1976, 63: 413–420 23 Hofmann T, Scholkopf B, Smola A J. Kernel methods in machine learning. Ann Stat, 2008, 36: 1171–1220 24 Zhou K, Fu C, Yang S. Fuzziness parameter selection in fuzzy c-means: the perspective of cluster validation. Sci China Inf Sci, 2014, 57: 112206 25 Jain A K, Murty M N, Flynn P J. Data clustering: a review. ACM Comput Surv, 1999, 31: 264–323 26 Li T, Ma S, Ogihara M. Entropy-based criterion in categorical clustering. In: Proceedings of the 21st International Conference on Machine Learning, Alberta, 2004. 536–543 27 Wang K, Yan X, Chen L. Geometric double-entity model for recognizing far-near relations of clusters. Sci China Inf Sci, 2011, 54: 2040–2050