DCf:
Clustering framework information granulation
A Double
Giovanna Castellano
Department of Informatics University of Bari Bari, Italy 70126 Email:
[email protected]
Anna M. Fanelli
Department of Informatics University of Bari Bari, Italy 70126 Email:
[email protected]
Abstract- In this paper we present a framework for extracting well-defined and semantically sound information granules. The framework is mainly centered on a double clustering process, hence it is called DCf (Double Clustering framework). A first clustering process identifies cluster prototypes in the multidimensional data space, then the projections of these prototypes are further clustered along each dimension to provide a granulation of data. Finally, the extracted granules are described in terms of fuzzy sets that meet interpretability constraints so as to provide a qualitative description of the information granules. Different implementations of DCf are presented and compared on a medical diagnosis problem to show the utility of the proposed framework. I. INTRODUCTION
The key feature of Information Granulation [1] consists of forming meaningful "chunks" of information (the so-called information granules) that exhibit a functional and descriptive representation of some observed data. Such a representation could take different forms, depending on the nature of data as well as on applicative needs. Through qualitative representation of information granules data are described in terms of linguistic properties such as "small", "low", etc. This form of representation provides a way for human-centric information processing, one of the key issues of Granular Computing agenda [2]. Behind linguistic description of data, semantically sound fuzzy sets need to be defined for describing qualities on single attributes. Semantic soundness can be achieved by imposing a number of interpretability constraints, which depend on the nature of data as well as on applicative needs. For numerical data, a common set of general-purpose interpretability constraints is: normality, convexity, continuity, justifiable number of elements, distinguishability, a-coverage, leftmost/rightmost fuzzy sets [3]. The process of automatically extracting qualitative information granules from data implies two major problems: (1) the discovery of hidden relationships among multi-dimensional data, and (2) the representation of such relationships in a linguistically interpretable form. Usually, the problem of discovering hidden relationships among data is tackled with cluster analysis techniques, which however may fail in providing a qualitative description of data in terms of interpretable fuzzy information granules. On the other hand, the a-priori definition of interpretable fuzzy sets is relatively simple, but it is rather 0-7803-9017-2/05/$20.00 02005 IEEE
for fuzzy
Corrado Mencar
Department of Informatics University of Bari Bari, Italy 70126 Email:
[email protected]
hard to guarantee that information granules based on such definition adequately represent the relationships underlying the distribution of data [4]. In this paper, a framework is proposed, called DCf (Double Clustering framework), whose key feature is the extraction of qualitatively described information granules from numerical data through efficient clustering techniques. The methodology underlying the framework is mainly centered on two clustering steps: the first, applied on the multi-dimensional data, is aimed at discovering the hidden relationships among data, while the second, which operates at the level of each dimension, enables the definition of interpretable fuzzy sets. Differently from standard clustering schemes, DCf is able to extract fuzzy information granules that satisfy all the above mentioned interpretability constraints, so that linguistic interpretation of granules is immediate. Moreover, the framework is quite general and does not depend on specific clustering algorithms, which could be chosen according to different applicative needs. The paper is organized as follows. The Double Clustering framework is described in detail in Section II, while some implementations of the framework that include specific clustering algorithms are presented in Section III. Then, an illustrative example is portrayed in Section IV and conclusive remarks are drawn in Section V. II. THE DOUBLE CLUSTERING FRAMEWORK
DCf is a framework that enables the extraction of fuzzy information granules representable in terms of qualitative linguistic labels. The framework tries to exploit the features of both multi-dimensional and one-dimensional clustering. The multi-dimensional clustering captures the granularity of the data in the multi-dimensional space, but the fuzzification of the resulting granules may result in fuzzy sets that cannot be associated with qualitative linguistic labels. Conversely, the one-dimensional clustering provides interpretable fuzzy sets but may loose information about the granularity of the multidimensional data. The integration of one-dimensional and multi-dimensional clustering enables an interpretable granulation of information. Specifically, DCf is made of three main steps: 397
Fig. 1. The first step of DCf. The available data (grey ellipses) are clustered providing multi-dimensional prototypes.
Fig. 3. The third step of DCf. Derivation of fuzzy sets for each dimension. Starts represent combinations of fuzzy sets that form multidimensional granules
dimensional prototypes: (2)
Cl, C2, . . .,cp e X
(C (l)ICi2)
Ci) i = 1,2,...,p. The multidimensional prototypes are then projected onto each dimension, resulting in n sets:
being ci
I -
=
C(i
Fig. 2. The second step of DCf. Multi-dimensional prototypes are projected onto each dimension and then clustered.
{Iczj) e [mj, Mj] : i = 1, 2,
. . .
, p}
(3)
for j = 1, 2,..., n. In the second step of DCf, the points of each C( are subject to one-dimensional clustering, yielding to n sets of one-dimensional prototypes: 1) Data Clustering. Clustering is performed in the multi(4) P PI vP2 v vPKj } dimensional space of numerical data to embrace similar data into granules. At this stage, the information gran- being Kj the number of clusters in the j-th dimension, j = ules are described by multi-dimensional cluster proto- 1, 2, ..., n. Such number should be fixed so as to satisfy the types (see fig. 1). Generally speaking, multi-dimensional "justifiable number of elements" interpretability constraint. prototypes are here intended as elements of the Universe The last step of DCf involves the derivation of fuzzy of Discourse which characterize the hidden relationships information granules. This is achieved by first fuzzifying the discovered through clustering; one-dimensional granules defined by the prototypes in each 2) Prototype Clustering. Multi-dimensional prototypes P() and then by aggregating one-dimensional fuzzy sets to obtained from the first clustering step are projected form multi-dimensional fuzzy information granules. on each attribute dimension. Such projections are then The fuzzification of one-dimensional information granules clustered so as to obtain a number of one-dimensional is attained by defining Gaussian fuzzy sets that verify the prototypes on each dimension (see fig. 2); above mentioned interpretability constraints. Precisely, for 3) Granule Fuzzification. Multi-dimensional and one- each dimension j = 1, 2,...,n, Kj fuzzy sets are defined dimensional prototypes provide useful information to with the following membership functions: derive information granules that can be conveniently represented by fuzzy sets. Moreover, such fuzzy sets 2(()) 2) t(U( (5) are built in accord to the interpretability constraints that allow a qualitative description of the information granules (see fig. 3). for k = 1,2,...,Kj. The definition of the centers wu) Formally, the Double Clustering framework can be deand the widths a(j) should take into account the information scribed as follows. Let provided by the clustering stages and, at the same time, should (1) meet the required interpretability constraints. To satisfy both X [m., M.] C R' X = [,ml, Ml] X requirements, the following "cut points" are defined: be the n-dimensional Universe of Discourse in which a set of for k = O 3 1 2mj -t(i) numerical data D = {xi E X: i = 1, 2,..., N} is available. + /2 for O < k < K t(j) = The first step of DCf performs a multi-dimensional clusfor k = Kj 2Mj - tKtering on the dataset D, providing a collection of p multi...
(pk) pj)
398
(6)
The hierarchical clustering was chosen for its simplicity and the additional property of being very efficient for one-dimensional numerical data, provided that such data are sorted. This type of implementation, which we call FDC (Fuzzy Double Clustering) is particularly suited to enhance existing fuzzy clustering algorithms in order to perform interpretable fuzzy information granulation. FDC has been previously presented by the authors in [5]. * Crisp Double Clustering. To reduce the computational effort due to calculation of the partition matrix in the first stage, it is more convenient to use a vector quantization technique in place of the fuzzy clustering algorithm in the multi-dimensional data clustering stage of DCf. This leads to another implementation of DCf, called CDC (Crisp Double Clustering), in which the Linde-BuzoGray (LBG) [6] vector quantization scheme is used to accomplish the first clustering step and, like in FDC, a hierarchical clustering algorithm is used for the second step. Details about the CDC can be found in our previous paper [7]. . Double Clustering for Classification. In case of classification problems, the class information can be effectively exploited to improve the granulation process. In this case, the first step is implemented through the LVQ1 (Learning Vector Quantization, version 1) [8] algorithm so that each multi-dimensional prototype is associated to a class label that is used in the second step of DCf to automatically determine the number of fuzzy sets onto each dimension. This leads to an implementation of DCf that is particularly appropriate to solve classification problems, called DCClass (Double Clustering for Classification). The detailed description of DCClass can be found in [9].
Cut points are used to define centers and widths of the Gaussian membership functions according to the following + W relations: g(j) - k-+ tk (7) 2 Wk t(') W and
(j)
k
k-1
(8)
-2 -2lne where E is the maximum allowed overlap between two adjacent fuzzy sets. Multi-dimensional fuzzy information granules can be then formed by combining one-dimensional fuzzy sets, one for each dimension. Among all possible combinations of onedimensional fuzzy sets, only those that better represent clusters discovered in the first step are selected. In this way, the selected combinations of fuzzy sets represent meaningful relations among data and combinatorial explosion of information granules is avoided. The selection of such granules is accomplished on each dimension, by considering, for each cluster i = 1, 2,... , p, the fuzzy set in the j-th dimension with highest membership value on the j-th projection of the i-th multidimensional cluster prototype c(2) The final linguistic representation of the derived information granule is a conjunction of soft constraints (one for each attribute) like: k
G: attribute1 is low AND attribute2 is medium AND.. .AND attributen is high
The semantical facet of each information granule is defined by the t-norm composition of all compounding one-dimensional fuzzy sets, such as: attribUten(9) =attributel A pattiibute2 When the granulation process is completed, a fuzzy rulebased model can be built on the basis of the derived fuzzy granules. This is aimed to verify how much the fuzzy granules identified from data are useful in providing good mapping properties or classification capabilities. To derive a rule base, the granulation process is carried out on the input/output product space as Universe of Discourse, and then each extracted information granule is split into an antecedent part and a consequent part. III. SOME IMPLEMENTATIONS OF DCf DCf is a general framework that can be customized by choosing appropriate clustering algorithms, either for the first and the second step. The sole requirement for such algorithms is to produce prototypes in conformity with the granulation process specified above. The choice of specific clustering algorithms defines a particular implementation of DCf. Here, three possible implementations of DCf are briefly described: Fuzzy Double Clustering. This implementation integrates the Fuzzy C-means algorithm for the multidimensional clustering (first step), and a hierarchical clustering scheme for the prototype clustering (second step).
IV. ILLUSTRATIVE EXAMPLE To illustrate the applicability of the Double Clustering framework in real-world problems, the well-known Wisconsin Breast Cancer (WBC) classification problem has been considered. The employed dataset consists of 683 examples by removing tuples with missing values. Full details about the dataset can be found in [10] The dataset has been split according to the stratified 10-fold cross-validation scheme. For each fold, the multi-dimensional granules derived by DCf have been used to define a fuzzy rulebased model aimed to infer breast cancer diagnosis. All the above mentioned implementations of DCf have been considered. Simulations with FDC and CDC have been performed by varying both the number of multi-dimensional prototypes and the number of fuzzy sets per input (equal for each dimension). In particular, the number of prototypes ranges from 2 to 10, while the number of fuzzy sets per dimension varies from 2 to 4. In each trial, the coverage level E was fixed to 0.5. Classification results of the fuzzy models based on granules derived by FDC and CDC are reported in table I. Such results are comparable with those obtained by other techniques proposed in literature (see, e.g. [11]), with the additional 399
TABLE I MEAN CLASSIFICATION ERROR ON THE TEST SET OF FDC AND CDC
FM
CDC
nr%-l
=
Fuzzy 2 3.97% 3.68% 4.68% 3.53% 3.68%
p 2 4 6 8 10
Fuzzy 2 3.97% 3.83% 4.42% 4.41% 5.44%
sets per input 4 3 -
5.00% 4.41% 4.71% 3.97%
4.42% 6.62% 3.68% 5.00%
=
sets per input 4 3 -
4.56% 3.83% 3.53% 3.39%
4.56% 4.71% 4.12% 3.97%
Fig. 5. Fuzzy sets with corresponding labels derived by DCClass
noting that for the second feature two fuzzy sets are enough to granulate data. On the other hand, for the first feature DCClass found a finer granulation for an appropriate description of data. 0~~ 0
0.1
0.2
0.3
0.4
O 0.5 o.e
0.7
0.9I10
0.1
O02
0.3
O.4
O.S
0.6
0.7
0.8
0.9I1
Fig. 4. Fuzzy sets with corresponding labels derived by CDC
feature that the derived rules are expressed in a nicely humaninterpretable way. To show this feature, the derived onedimensional fuzzy sets for two input features are illustrated in fig. 4. In the following, an example of generated rule is portrayed: If
ClumpThickness is HIGH AND UniformityOfCellSize is MED-LOW AND UniformityOfCellSize is MED-LOW AND MarginalAdhesion is MED-HIGH AND SingleEpithelialCellSize is MED-LOW AND BareNuclei is HIGH AND BlandChromatin is MED-LOW AND NormalNucleoli is MED-HIGH AND Mitoses is MED-LOW
V. FINAL REMARKS In this paper, a framework for fuzzy information granulation has been presented. Its key feature is the ability of generating information granules from data which can be represented in terms of well-defined and semantically sound fuzzy sets. As supported by the considered experimental example, this feature turns very useful in all applicative contexts - like fuzzy diagnosis - where an intelligible description of a phenomenon (like symptoms/diagnosis relationships) is of great importance. The framework is general enough to be extended with more sophisticated techniques that may support a greater number of interpretability constraints, which could lead to the generation of knowledge structures that are more comprehensible by human users. [1]
[2]
[3]
Then
[4]
cancer is BENIGN
[5]
The same classification problem was tackled through the application of DCClass. In this case, however, the number of fuzzy sets per input was not specified in advance as it is automatically defined within the granulation process. By fixing the number of multi-dimensional prototypes to six, the application of DCClass provides ten rule bases with average classification error on the test sets of 3.975%, while the mean number of rules is 3.6. The achieved results are comparable with those reported in [12] thus showing that our framework is also a valid tool to extract accurate knowledge from data. To show the interpretability of the information granules derived by DCClass, the one-dimensional fuzzy sets of are illustrated in fig. 5 for two input dimensions. It is worth
[6] [7]
[8]
REFERENCES L.A. Zadeh, Toward a Theory of Fuzzy Information Granulation and its Centrality in Human Reasoning and Fuzzy Logic , in Fuzzy Sets and Systems, vol. 90, pp. 111-117, 1997 A. Bargiela and W. Pedrycz, Granular Computing. An introduction, Boston: Kluwer Academic Publisher, 2003 J. Valente de Oliveira,Towards Neuro-Linguistic Modeling: Constraints for Optimization of Membership Functions Fuzzy Sets & Sys., vol. 106, pp. 357-380, 1999 S. Guillaume, Designing Fuzzy Inference Systems from Data: An Interpretability-Oriented Review, IEEE Trans. on Fuzzy Sys., vol. 9(3), pp. 426-443, 2001 G. Castellano, A. M. Fanelli, and C. Mencar, A double-clustering approach for interpretable granulation of data in Proc. of IEEE SMC'02, Tunisy, 2002. A.B.Y. Linde and R.M. Gray, An Algorithm for Vector Quantization Design,IEEE Transactions on Communications, vol 28, pp. 84-95, 1980 G. Castellano, A. M. Fanelli, and C. Mencar, Fuzzy granulation of multidimensional data by a crisp double clustering algorithm in Proc. of SCI 2003, Orlando, Fl, 2003, pp. 372-377. T. Kohonen, Self Organizing Maps, 2nd ed., Berlin:Springer-Verlag, 1997
[9] G. Castellano, A. M. Fanelli, and C. Mencar, DCClass: A tool to extract human understandable fuzzy information granules for classification in Proc. of SCIS-ISIS 2003, South Korea, 2003, pp. 376-379. [10] C. Blake and C. Merx, UCI repository of machine learning databases, http://www.ics.uci.edu/ mlearn/MLRepository.html,Irvine, CA, 1998. [11] J. Abonyi, R. Babuska, and F. Szeifert, Modified gath-geva fuzzy clustering for identification of takagi-sugeno fuzzy models IEEE Trans. on Sys., Man and Cyb.(B), vol. 32(5), pp. 612-621, 2002. [12] D. Nauck and R. Kruse, Obtaining interpretable fuzzy classification rules from medical data Art. Int. in Med., vol. 16, pp. 149-169, 1999.
400