International Journal of Database Theory and Application Vol. 3, No. 1, March, 2010
Rough Set Approach for Categorical Data Clustering1 Tutut Herawan*1, Rozaida Ghazali2, Iwan Tri Riyadi Yanto3, and Mustafa Mat Deris2 1
Department of Mathematics Education Universitas Ahmad Dahlan, Yogyakarta, Indonesia 2 Faculty of Information Technology and Multimedia Universiti Tun Hussein Onn Malaysia, Johor, Malaysia 3 Department of Mathematics Universitas Ahmad Dahlan, Yogyakarta, Indonesia
[email protected]* (corresponding author),
[email protected],
[email protected],
[email protected]
Abstract Clustering categorical data is an integral part of data mining and has attracted much attention recently. In this paper, we focus our discussion on the rough set theory for categorical data clustering. We propose MADE (Maximal Attributes DEpendency), an alternative technique for categorical data clustering using rough set theory taking into account maximum attributes dependencies degree in categorical-valued information systems. Experimental results on two benchmark UCI datasets show that MADE technique is better with the baseline categorical data clustering technique with respect to computational complexity and clusters purity. Keywords: Clustering; Categorical data; Information system; Rough set theory; Attributes dependencies.
1. Introduction Clustering a set of objects into homogeneous classes is a fundamental operation in data mining. The operation is required in a number of data analysis tasks, such as unsupervised classification and data summation, as well as in the segmentation of large homogeneous datasets into smaller homogeneous subsets that can be easily managed, modeled separately and analyzed. Recently, many attentions have been paid on the categorical data clustering [1,2], where data objects are made up of non-numerical attributes. For categorical data clustering, several new trends have emerged for the techniques in handling uncertainty in the clustering process. One of the popular approaches for handling uncertainty is based on rough set theory [3]. The main idea of the rough clustering is the clustering dataset is mapped as the decision table. This can be done by introducing a decision attribute and consequently, a divide-and-conquer method can be used to partition/cluster the objects. The first attempt on rough set-based technique is to select clustering attribute proposed by Mazlack et al. [4]. They proposed two techniques, i.e., Bi-Clustering and TR techniques which are based on the bi-valued attribute and maximum total roughness in each attribute, respectively. One of the most successful pioneering rough clustering techniques is 1
An early version of this paper appeared in the Proceeding of International Conference, DTA 2009, held as Part of the Future Generation Information Technology Conference, FGIT 2009, Jeju Island, Korea, December 10-12, 2009, CCIS 64 Springer-Verlag, pp. 179–186, 2009.
33
International Journal of Database Theory and Application Vol. 3, No. 1, March, 2010
Minimum-Minimum Roughness (MMR) proposed by Parmar [5]. The technique is based on lower, upper and quality of approximations of a set [6]. However, since application of rough set theory in categorical data clustering is relatively new, the focus of MMR is still on the evaluation its performance. To this, the computational complexity and clusters purity are still outstanding issues since all attributes are considered for selection and objects in different class appear in a cluster, respectively. In this paper, we propose MADE (Maximal Attributes DEpendency), an alternative technique for categorical data clustering. The technique differs on the baseline method, where the rough attributes dependencies in categorical-valued information systems is used to select clustering attribute based on the maximum degree. Further, we use a divide-andconquer method to partition/cluster the objects. We have succeed in showing that the proposed technique is able to achieve lower computational complexity with higher purity as compared to MMR. The rest of this paper is organized as follows. Section 2 describes rough set theory. Section 3 describes the analysis and comparison of Mazlack’s TR and MMR techniques. Section 4 describes the Maximum Attributes Dependency (MADE) technique. Comparison tests of MADE with MMR techniques based on Soybean and Zoo datasets are described in section 5. Finally, the conclusion of this work is described in section 6.
2. Rough Set Theory The syntax of information systems is very similar to relations in relational data bases. Entities in relational databases are also represented by tuples of attribute values. An information system is a 4-tuple (quadruple) S U , A, V , f , where U u1 , u 2 , u 3 , , u U is a non-empty finite set of objects, A a1 , a 2 , a 3 , , a A is a
non-empty finite set of attributes, V a A V a , V a is the domain (value set) of attribute a,
f : U A V is an information function such that f u , a V a , for every u , a U A , called information (knowledge) function. An information system is also called a knowledge representation systems or an attribute-valued system and can be intuitively expressed in terms of an information table (see Table 1). Table 1. An information system
U
a1
…
a2
…
ak
aA
u1
f u1 , a1
f u1 , a 2
…
f u1 , a k
…
u2
f u 2 , a1
f u 2 , a 2
…
f u 2 , a k
…
f u , a
uU
f u U , a1
f u U , a2
…
f u U , ak
…
f u1 , a A 2
A
f uU ,a A
The time complexity for computing an information system S U , A, V , f is U A
since there are U A values of f u i , a j to be computed, where i 1,2,3, , U and
j 1,2,3, , A . Note that tinduces a set of maps t f u , a : U A V . Each map is a
tuple t i f u i , a1 , f u i , a 2 , f u i , a 3 , , f u i , a A , where where i 1,2,3, , U . Note that the tuple t is not necessarily associated with entity uniquely (see Table 7). In an information table, two distinct entities could have the same tuple representation
34
International Journal of Database Theory and Application Vol. 3, No. 1, March, 2010
(duplicated/redundant tuple), which is not permissible in relational databases. Thus, the concept of information systems is a generalization of the concept of relational databases. Definition 1. Two elements x, y U are said to be B-indiscernible (indiscernible by the set of attribute B A in S) if and only if f x, a f y , a , for every a B .
Obviously, every subset of A induces unique indiscernibility relation. Notice that, an indiscernibility relation induced by the set of attribute B, denoted by INDB , is an equivalence relation. The partition of U induced by INDB is denoted by U / B and the equivalence class in the partition U / B containing x U , is denoted by x B . The notions of lower and upper approximations of a set are defined as follows. Definition 2. (See [6].) The B-lower approximation of X, denoted by B X and B-upper
approximations of X, denoted by B X , are defined by
B X x U
x
B
X and B X x U
x
B
X , respectively.
It is easily seen that the upper approximation of a subset X U is expressed using set complement and lower approximation by B X U BX ,
where X denote the complement of X relative to U. The accuracy of approximation (accuracy of roughness) of any subset X U with respect to B A , denoted B X is measured by
B X B X / B X ,
(1)
where X denotes the cardinality of X. For empty set , we define B 1 . Obviously,
0 B X 1 . If X is a union of some equivalence classes, then B X 1 . Thus, the set X is crisp with respect to B, and otherwise, if B X 1 , X is rough with respect to B.
The accuracy of roughness in equation (1) can also be interpreted using the well-known Marczeweski-Steinhaus (MZ) metric [7]. By applying the Marczeweski-Steinhaus metric to the lower and upper approximations of a subset X U in information system S, we have
D B X , B X 1
B X B X B X B X
1
B X B X
1 B X .
(2)
The notion of the dependency of attributes in information systems is given in the following definition. Definition 3. Let S U , A, V , f be an information system and let D and C be any subsets of A. Attribute D is called depends totally on attribute C, denoted C D , if all values of attributes D are uniquely determined by values of attributes C.
35
International Journal of Database Theory and Application Vol. 3, No. 1, March, 2010
In other words, attribute D depends totally on attribute C, if there exist a functional dependency between values D and C. The notion of generalized attributes dependency is given in the following definition. Definition 4. Let S U , A, V , f be an information system and let D and C be any subsets of A. Degree of dependency of attribute D on attributes C, denoted C k D , is defined by
k
X U / D
CX
U
(3)
.
Obviously, 0 k 1 . Attribute D is said to be (totally dependent) depends totally (in a degree of k) on the attribute C if k 1 . Otherwise, D is depends partially on C. Thus, attribute D depends totally (partially) on attribute C, if all (some) elements of the universe U can be uniquely classified to equivalence classes of the partition U / D , employing C.
In the following section, we analyze and compare the Total Roughness (TR) and Min-Min Roughness (MMR) techniques for selecting a clustering attribute.
3. TR and MMR Techniques 3.1. The TR Technique
The definition of information system is based on the notion of information system as stated in section 2. From the definition, suppose that attribute ai A has k-different values, say k , k 1,2, , n . Let X ai k , k 1,2, , n be a subset of the objects having kdifferent values of attribute ai . The roughness of TR technique of the set X a i k ,
k 1,2,, n , with respect to a j , where i j , denoted by R a j X a i k , is defined by
R a X ai k j
X a a i k j
X a a i k
, k 1,2, , n .
(4)
j
From TR technique, the mean roughness of attribute ai A with respect to attribute a j A , where i j , denoted Rough a a i , is evaluated as follow j
V ai
Rough a a i j
R X a k 1
aj
i
V a i
k ,
(5)
where V a i is the set of values of attribute a i A . The total roughness of attribute ai A with respect to attribute a j A , where i j ,
denoted TR a i , is obtained by the following formula
36
International Journal of Database Theory and Application Vol. 3, No. 1, March, 2010
A
TR a i
Rough a aj
j 1
i
A 1
.
(6)
As stated in Mazlack et al. [4], the highest value of TR, is the best selection of partitioning attribute. 3.2. The MMR Technique
The definition of information system is based on the notion of information system as stated in section 2. From the definition, suppose that attribute a i A has k-different values, say k , k 1,2, , n . Let X a i k , k 1,2, , n be a subset of the objects having kdifferent values of attribute ai . The roughness of MMR technique of the set X a i k ,
k 1,2, , n , with respect to a j , where i j , denoted by R a X a i k , is defined by j
MMR a X a i k 1 j
X a a i k j
X a a i k
, k 1,2, , n .
(7)
j
It is clear that MMR technique uses MZ metric to measure the roughness of the set X a i k , k 1,2, , n , with respect to a j , where i j . The mean roughness of MMR technique is defined by V ai
MMRough a
j
a i
MMR X a k 1
aj
k
i
.
V a i
(8)
According to Parmar et al. [5], the least mean roughness is the best selection of partitioning attribute. 3.3. Comparison of TR and MMR techniques Proposition 5. The value of roughness of MMR technique is the opposite of that TR technique. Proof. Since MMR technique uses MZ metric to measure the roughness of the set X a i k , k 1,2, , n , with respect to a j , where i j , i.e.,
MMR a X a i k 1 j
X a a i k j
X a a i k
,
j
then from (7), we have MMR a X a i k 1 R a X a i k . j
j
(9)
Thus, the value of mean roughness of MMR technique is also the opposite of that TR technique (5), i.e.,
37
International Journal of Database Theory and Application Vol. 3, No. 1, March, 2010
V ai
MMRough a j a i
MMR X a aj
k 1
k
V a i
V ai
i
1 R X a aj
k 1
i
k
V ai
V ai
V ai
k 1
k 1
1 R X a aj
i
k
V a i
V ai
V ai V ai
R X a k 1
aj
i
k
V a i
1 Rough a a i , for i j .
(10)
j
The MMR technique is based on the minimum value of mean roughness in (10), without calculating total roughness (6). This analysis and comparison has shown that TR and MMR techniques are providing the similar result when used in determining the clustering attribute. To illustrate that MMR and Mazlack’s techniques provide the same results, we consider to the following example. Example 6. We consider the dataset in illustrative example of Table 2 in [5]. Table 2. An information system in [5]
U
a1
a2
a3
a4
a5
a6
1 2 3 4 5 6 7 8 9 10
Big Medium Small Medium Small Big Small Small Big Medium
Blue Red Yellow Blue Yellow Green Yellow Yellow Green Green
Hard Moderate Soft Moderate Soft Hard Hard Soft Hard Moderate
Indefinite Smooth Fuzzy Fuzzy Indefinite Smooth Indefinite Indefinite Smooth Smooth
Plastic Wood Plush Plastic Plastic Wood Metal Plastic Wood Plastic
Negative Neutral Positive Negative Neutral Positive Positive Positive Neutral Neutral
In Table 2, there are ten objects
U
10 with six categorical-valued attributes:
a1 , a 2 , a 3 , a 4 , a 5 and a 6 . Each attribute has more than two values Va i 2 , i 1,2,3,4,5,6 . Since in this case there is no bi-valued attributes, then we cannot employ Mazlack’s BC technique. The calculation of TR and MMR techniques must be applied on all of the attribute values for obtaining the clustering attribute. The calculation of TR value is based on formulas in (4), (5) and (6). The techniques of TR and MMR are implemented in MATLAB version 7.6.0.324 (R2008a). They are executed sequentially on a processor Intel Core 2 Duo CPUs. The total
38
International Journal of Database Theory and Application Vol. 3, No. 1, March, 2010
main memory is 1G and the operating system is Windows XP Professional SP3. The results of TR and MMR are given in the following Table 3 and 4, respectively. Table 3. The TR of all attributes of Table 2
Attribute
TR mean roughness
a1
Rough a2
Rough a3
Rough
Rough
Rough
a2
0.3889 Rough a1
0.4762 Rough a3
0 Rough
0.0476 Rough
0 Rough
a3
0.2500 Rough a1
0.1071 Rough a2
0 Rough
0.0357 Rough
0.2500 Rough
a4
0.4762 Rough a1
0.0556 Rough a2
0 Rough
0.0333 Rough
0 Rough
a5
0 Rough a1
0.3333 Rough a2
0 Rough
0.1587 Rough
0 Rough
a6
0 Rough a1
0.1574 Rough a2
0.1000 Rough
0.0667 Rough
0.0667 Rough
0
0.3750
0
0
0.0333
Table 4. The MMR of all attributes of Table 2
Attribute
MMR mean roughness
a1
Rough a 2
Rough a3
Rough a 4
Rough a5
Rough a6
a2
0.6111 Rough a1
0.5238 Rough a3
1 Rough a4
0.9048 Rough a5
1 Rough a6
a3
0.7500 Rough a1
0.8929 Rough a 2
1 Rough a4
0.9286 Rough a5
0.7500 Rough a6
a4
0.5238 Rough a1
0.9444 Rough a2
1 Rough a3
0.9074 Rough a5
1 Rough a 6
a5
1 Rough a1
0.6667 Rough a2
1 Rough a3
0.7639 Rough a4
1 Rough a 6
a6
1 Rough a1
0.8820 Rough a2
1 Rough a3
1 Rough a 4
0.9500 Rough a5
1
0.6250
1
1
0.9333
Based on Figure 1, attribute a1 , i.e., 0.1825 has higher TR as compared to ai , i 2,3,4,5,6 . Thus, attribute a1 is selected as the clustering attribute. Meanwhile, based on Figure 2, two attributes are of equally of MMR ( a1 and a3 , i.e. 0.5238). But, the second value corresponding to attribute a1 , i.e. 0.6111 is lower than that of a3 , i.e. 0.9074. Therefore, attribute a1 is selected as the clustering attribute.
39
International Journal of Database Theory and Application Vol. 3, No. 1, March, 2010
Figure 1. The TR value of all attributes of Table 2
Figure 2. The MMR value of all attributes of Table 2 Table 5. The computation and response time of TR and MMR
Computation
40
TR
237
Response time (Sec) 0.047
MMR
237
0.047
International Journal of Database Theory and Application Vol. 3, No. 1, March, 2010
Figure 3. The computation of TR and MMR
Figure 4. The response time of TR and MMR
Based on the result on selecting clustering attribute in Figures 1, 2, 3 and 4, it is easily seen that the decision, computation complexity and processing time of TR and MMR techniques are totally the same. Thus based on Proposition 5, the statement that MMR is an extension of an approach proposed Mazlack et al. in comparison example [5] is therefore considered as incorrect and unreasonable. On the other hand, to achieve lower computational complexity in selecting partitioning attribute using MMR, Parmar et al. suggested that the measurement of the roughness to be based on relationship between an attribute a i A and the set defined as A a i instead of calculating the maximum with respect to all a j where a i a j [5]. As has been observed by us, this technique only can be applied to a very special dataset. To illustrate this problem, we consider to the following example.
41
International Journal of Database Theory and Application Vol. 3, No. 1, March, 2010
Example 7. In Table 2, if we consider to measure the roughness of attribute ai A with respect to the set of attributes A a i , then we get the value of modified MMR as in Table 6. Table 6. The modified MMR of all attributes of dataset in [5]
Attribute w.r.t. a1
a2 a3 a4 a5 a6
Mean Roughness
MMR
Rough A a1 0 Rough A a 2 0 Rough A a 3 0 Rough A a 4 0 Rough A a 5 0 Rough A a 6 0
0 0 0 0 0 0
Based on Table 6, we have not been able to select a clustering attribute. Thus, the suggested technique would lead a problem, i.e., after calculation of mean roughness of attribute ai A with respect to the set of attributes A a i , the value of MMR usually cannot preserve the original decision. Thus, this modified technique is therefore not relevant to all type of dataset. To overcome the problem of computational complexity of MMR, in section 4, we introduce the Maximum Attributes Dependencies (MADE) technique to deal with the problem of categorical data clustering.
4. Maximum Attributes DEpendencies (MADE) Technique 4.1. MADE technique
The MADE technique for selecting partitioning attribute is based on the maximum degree of dependency of attributes. The justification that the higher of the degree of dependency of attributes implies the more accuracy for selecting partitioning attribute is stated in the Proposition 8. Proposition 8. Let S U , A, V , f be an information system and let D and C be any subsets of A. If D depends totally on C, then D X C X , for every X U . Proof. Let D and C be any subsets of A in information system S U , A, V , f . From the hypothesis, we have INDC INDD . Furthermore, the partitioning U / C is finer than that U / D , thus, it is clear that any equivalence class induced by INDD is a union of
42
International Journal of Database Theory and Application Vol. 3, No. 1, March, 2010
some equivalence class induced by INDC . Therefore, for every x X U , we have xC xD . And hence, for every X U , we have D X C X X C X D X . Consequently
D X
D X D X
CX CX
C X . □
4.2. Complexity
Suppose that in an information system S U , A, V , f , there is A attributes. For MADE, the computation of calculating of dependency degree of attribute ai on attribute a j , where i j is A A 1 . Thus, the computational complexity for MADE technique
is of the polynomial O A A 1 .
The MADE’s algorithm for selecting clustering attribute is given in Figure 5. Algorithm: MADE Input: Dataset without clustering attribute Output: Clustering attribute Begin Step 1. Compute the equivalence classes using the indiscernibility relation on each attribute. Step 2. Determine the dependency degree of attribute a i with respect to all a j , where i j . Step 3. Select the maximum of dependency degree of each attribute. Step 4. Select a clustering attribute based on the maximum degree of dependency of attributes. End Figure 5. The MADE algorithm
As the same procedure for selecting clustering attribute of MMR, in using MADE technique, it is recommended to look at the next lowest dependencies degree inside the attributes that are tied and so on until the tie is broken. 4.3. Example The dataset is an animal dataset from Hu [8]. In Table 7, there are nine animals U 9
with nine categorical-valued attributes A 9 ; Hair, Teeth, Eye, Feather, Feet, Eat, Milk, Fly and Swim. The attributes Hair, Eye, Feather, Milk, Fly and Swim have two values. Attributes Teeth has three values, and other attributes have four values.
43
International Journal of Database Theory and Application Vol. 3, No. 1, March, 2010
a. To obtain the dependencies degree of all attributes, the first step of the techniques is to obtain the equivalence classes induced by indiscernibility relation of singleton attributes, i.e., disjoint classes of objects which are contain indiscernible objects. b. By collecting the equivalence classes, a partition of objects can be obtained. The partitions are shown in Figure 6. c. The dependency degree of attributes can be obtained using formula in (3). For attribute Hair depends on attributes Teeth, Eye, Feather, Feet, Eat, Milk, Fly and Swim, we have the degrees as shown in Figure 7. Table 7. Animal world dataset from [8] Animal
Hair
Teeth
Eye
Feather
Feet
Eat
Milk
Fly
Swim
Tiger Cheetah Giraffe Zebra Ostrich Penguin Albatross Eagle Viper
Y Y Y Y N N N N N
Pointed Pointed Blunt Blunt N N N N Pointed
Forward Forward Side Side Side Side Side Forward Forward
N N N N Y Y Y Y N
Claw Claw Hoof Hoof Claw Web Claw Claw N
Meat Meat Grass Grass Grain Fish Grain Meat Meat
Y Y Y Y N N N N N
N N N N N N Y Y N
Y Y N N N Y Y N N
a. b.
c. d. e.
f.
g. h. i.
X Hair yes 1,2,3,4 , X Hair no 5,6,7,8,9 , U / Hair 1,2,3,4, 5,6,7,8,9 . X Teeth pointed 1,2,9 , X Teeth blunt 3,4 , X Teeth no 5,6,7,8 , U / Teeth 1,2,9, 3,45,6,7,8 . X Eye Forward 1,2,8,9 , X Eye Side 3,4,5,6,7 , U / Eye 1,2,8,9, 3,4,5,6,7 . X Feather no 1,2,3,4,9, X Feather yes 5,6,7,8, U / Feather 1,2,3,4,9, 5,6,7,8. X Feet claw 1,2,5,7,8 , X Feet hoof 3,4, X Feet web 6 , X Feet no 9 . U / Feet 1,2,5,7,8,9, 3,4, 6, 9 . X Eat Meat 1,2,8,9 , X Eat grass 3,4, X Eat grain 5,7 , X Eat fish 6. U / Eat 1,2,8,9, 3,4, 5,7, 6 . X Milk yes 1,2,3,4 , X Milk no 5,6,7,8,9 , U / Milk 1,2,3,4, 5,6,7,8,9 . X Fly no 1,2,3,4,5,6,9 , X Fly yes 7,8 , U / Fly 1,2,3,4,5,6, 7,8 . X Swim yes 1,2,6,7 , X Swim no 3,4,5,8,9 , U / Swim 1,2,6,7, 3,4,5,8,9 . Figure 6. The partitions using singleton attributes
44
International Journal of Database Theory and Application Vol. 3, No. 1, March, 2010
Teeth k Hair , where k
U / Hair Teeth X 3,4 5,6,7,8 6 . U 9 9
U / Hair Eye X
Eye k Hair , where k
U
Feather k Hair , where k Feet k Hair , where k
Eat k Hair , where k
9
0.
U / Hair Feather X 5,6,7,8 4 . U 9 9
U / Hair Feet X 3,4 6 9 4 . 9 9 U
U / Hair Eat X 3,4 5,7 6 5 . U 9 9
Milk k Hair , where k Fly k Hair , where k
U / Hair Milk X 1,2,3,4 5,6,7,8,9 1. U 9
U / Hair Fly X
Swim k Hair , where k
U
7,8 9
2 . 9
U / Hair Swim X 0. U 9
Figure 7. The attributes dependencies
Similar calculations are performed for all the attributes. These calculations are summarized in Table 8. Table 8. The dependencies degree of all attributes from Table 7
Attribute
Hair
Degree of dependency
Milk
Teeth 0.666 Hair 0 Hair 0 Hair 0.444 Hair 0 Hair 0 Hair
Fly
1 Hair
Teeth Eye Feather Feet Eat
0.44
Eye Feather Feet Eat Milk 0 0.444 0.444 0.555 1 Eye Feather Feet Eat Milk 0 0.444 0.444 0.555 0 Teeth Feather Feet Eat Milk 0.555 0 0.444 1 0 Teeth Eye Feet Eat Milk 1 0 0.444 0.555 0.444 Teeth Eye Feather Eat Milk 0.222 0 0 0.555 0 Teeth Eye Feather Feet Milk 0.555 0.444 0 0.333 0 Teet Eye Feather Feet Eat h 0.666 0 0.444 0.444 0.555 Teet Eye Feathe Feet Eat h r 0.555 0 0.555 0.44 0.33
Fly 0.222 Fly 0.222 Fly 0 Fly 0.222 Fly 0.2222 Fly 0 Fly 0.222 Milk 0.444
Swim 0 Swim 0 Swim 0 Swim 0 Swim 0 Swim 0 Swi m 0 Swi m 0
45
International Journal of Database Theory and Application Vol. 3, No. 1, March, 2010
Swim
4 Hair 0
Teet h 0.222
Eye 0
Feathe r 0
4 Feet
3 Eat
0.44 4
0.33 3
Milk
Fly
0
0
Figure 8. The maximal attributes dependencies
With the MADE technique, the first maximum degree of dependency of attributes, i.e. 1 occurs in attributes Hair (Milk), Eye and Feather (i.e., 1) as Figure 8 shows. The second maximum degree of dependency of attributes, i.e. 0.666 occurs in attributes Hair. Thus, based on Figure 8, attribute Hair is selected as clustering attribute. 4.4. Objects splitting
For objects splitting, we use a divide-conquer method. For example, in Table 7 we can cluster (partition) the animals based on the decision attribute selected, i.e., Hair/Milk. Notice that, the partition of the set of animals induced by attribute Hair/Milk is 1,2,3,4, 5,6,7,8,9 . To this, we can split the animals using the hierarchical tree as follows. Tiger, Cheetah, Giraffe, Zebra, Ostrich, Penguin, Albatross, Eagle, Viper
Tiger, Cheetah, Giraffe, Zebra
Tiger, Cheetah
Giraffe, Zebra
The objects
1st possible clusters
Ostrich, Penguin, Albatross, Eagle, Viper
Ostrich, Penguin, Albatross, Eagle
Viper
2nd possible clusters
Figure 9. The objects splitting
The technique is applied recursively to obtain further clusters. At subsequent iterations, the leaf node having more objects is selected for further splitting. The algorithm terminates
46
International Journal of Database Theory and Application Vol. 3, No. 1, March, 2010
when it reaches a pre-defined number of clusters. This is subjective and is pre-decided based either on user requirement or domain knowledge.
5. Comparison Tests In order to test MADE and compare it with MMR, we use two datasets obtained from the benchmark UCI Machine Learning Repository. We use Soybean and Zoo datasets are with 47 and 101 objects. The purity of clusters was used as a measure to test the quality of the clusters [5]. The purity of a cluster and overall purity are defined as Purity i
the number of data occuring in both the ith cluster and its corresponding class
the number of data in the data set # of cluster Purity i i 1 Overall Purity # of cluster
According to this measure, a higher value of overall purity indicates a better clustering result, with perfect clustering yielding a value of 1 [5]. The algorithms of MMR and MADE for Soybean and Zoo datasets are implemented in MATLAB version 7.6.0.324 (R2008a). They are executed sequentially on a processor Intel Core 2 Duo CPUs. The total main memory is 1 Gigabyte and the operating system is Windows XP Professional SP3. 5.1. Soybean dataset
The Soybean dataset contains 47 objects on diseases in soybeans. Each object can be classified as one of the four diseases namely, Diaporthe Stem Canker (D1), Charcoal Rot (D2), Rhizoctonia Root Rot (D3), and Phytophthora Rot (D4) and are described by 35 categorical attributes [9]. The dataset is comprised 17 objects for Phytophthora Rot disease and 10 objects for each of the remaining diseases. Since there are four possible diseases, the objects will be split into four clusters. The results are summarized in Table 9. All of 47 objects belong to the majority class label of the cluster in which they are classified. Thus, the overall purity of the clusters is 100%. Table 9. The purity of clusters
Cluster number 1 2 3 4
D1
D2
10 0 0 10 0 0 0 0 Overall Purity
D3
D4
Purity
0 0 10 0
0 0 0 17
1 1 1 1 1
5.2. Zoo dataset
The Zoo dataset is comprised of 101 objects, where each data point represents information of an animal in terms of 18 categorical attributes [10]. Each animal data point is classified into seven classes. Therefore, for MADE, the splitting data is set at seven clusters. Table 10 summarizes the results of running the MADE algorithm on the Zoo dataset.
47
International Journal of Database Theory and Application Vol. 3, No. 1, March, 2010
Table 10. The purity of clusters
Cluster number 1 2 3 4 5 6 7
C1
C2
41 0 0 0 0 0 0
0 20 0 0 0 0 0
C3
C4
0 0 0 0 5 0 0 13 0 0 0 0 0 0 Overall Purity
C5
C6
C7
Purity
0 0 0 0 4 0 0
0 0 0 0 0 8 0
0 0 0 0 0 0 10
1 1 1 1 1 1 1 1
All of 101 objects belong to the majority class label of the cluster in which they are classified. Thus, the overall purity of the clusters is 100%. 5.3. Comparison
The comparison of overall purity, computation and response time of MADE and MMR on Soybean and Zoo datasets are given in Figures 10, 11 and 12, respectively. Based on Table 11, the MADE technique provides better solution compared to MMR technique both in Soybean and Zoo dataset. Table 11. The overall improvement of MMR by MADE
Soybean Zoo
Clusters Purity 17% 9%
Improvement Computation 64% 77%
Response Time 63% 67%
Figure 10. The comparison of overall purity
48
International Journal of Database Theory and Application Vol. 3, No. 1, March, 2010
Figure 11. The comparison of computation
Figure 12. The comparison of response time
6. Conclusion Categorical data clustering technique has emerged as a new trend in technique of handling uncertainty in the clustering process. In this paper, we have proposed MADE, an alternative technique for categorical data clustering using rough set theory based on attributes dependencies. We have proven that MADE technique is a generalization of MMR technique which is able to achieve lower computational complexity and higher clusters purity. With this approach, we believe that some applications through MADE will be applicable, such as for decision making, clustering very large datasets and etc.
Acknowledgement This work was supported by the grant of Universiti Tun Hussein Onn Malaysia.
49
International Journal of Database Theory and Application Vol. 3, No. 1, March, 2010
References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]
50
Huang, Z. “Extensions to the k-means algorithm for clustering large data sets with categorical values”. Data Mining and Knowledge Discovery 2 (3) (1998) 283–304. Kim, D., Lee, K., Lee, D. “Fuzzy clustering of categorical data using fuzzy centroids”. Pattern Recognition Letters 25 (11) (2004) 1263–1271. Pawlak, Z. “Rough sets”. International Journal of Computer and Information Science. 11, 1982, 341– 356. Mazlack, L.J., He, A., Zhu, Y., Coppock, S. “A rough set approach in choosing partitioning attributes”. Proceedings of the ISCA 13th, International Conference, CAINE-2000, 2000, 1–6. Parmar, D., Wu, T. and Blackhurst, J. ”MMR: An algorithm for clustering categorical data using rough set theory”. Data and Knowledge Engineering 63, 2007, 879–893. Pawlak, Z. and Skowron, A. “Rudiments of rough sets”. Information Sciences, 177 (1), 2007, 3–27. Yao, Y.Y. “Two views of the theory of rough sets in finite universes”. Approximate Reasoning, 15 (4), 1996, 191–317. Hu, X. “Knowledge discovery in databases: An attribute oriented rough set approach”. PhD thesis, University of Regina, 1995. http://archive.ics.uci.edu/ml/datasets/Soybean+%28Small%29 http://archive.ics.uci.edu/ml/datasets/Zoo
International Journal of Database Theory and Application Vol. 3, No. 1, March, 2010
Authors
Tutut Herawan He is a Ph.D. candidate in Data Mining at Universiti Tun Hussein Onn Malaysia (UTHM). His research area includes Data Mining, KDD and Real Analysis.
Rozaida Ghazali She received her B.Sc. (Hons) degree in Computer Science from Universiti Sains Malaysia, and M.Sc. degree in Computer Science from Universiti Teknologi Malaysia. She obtained her Ph.D. degree in Higher Order Neural Networks at Liverpool John Moores University, UK. She is currently a teaching staff at Faculty of Information technology and Multimedia, Universiti Tun Hussein Onn Malaysia (UTHM). Her research area includes neural networks, fuzzy logic, financial time series prediction and physical time series forecasting.
Iwan Tri Riyadi Yanto He is a M.Sc. candidate in Data Mining at Universiti Tun Hussein Onn Malaysia (UTHM). His research area includes Data Mining, KDD and Real Analysis.
Mustafa Mat Deris He received the B.Sc. from University Putra Malaysia, M.Sc. from University of Bradford, England and Ph.D. from University Putra Malaysia. He is a professor of computer science in the Faculty of Information Technology and Multimedia, UTHM, Malaysia. His research interests include distributed databases, data grid, database performance issues and data mining. He has published more than 80 papers in journals and conference proceedings. He was appointed as one of editorial board members for International Journal of Information Technology, World Enformatika Society, a reviewer of a special issue on International Journal of Parallel and Distributed Databases, Elsevier, 2004, a special issue on International Journal of Cluster Computing, Kluwer, 2004, IEEE conference on Cluster and Grid Computing, held in Chicago, April, 2004, and Malaysian Journal of Computer Science. He has served as a program committee member for numerous international conferences/workshops including Grid and Peer-to-Peer Computing, (GP2P 2005, 2006), Autonomic Distributed Data and Storage Systems Management (ADSM 2005, 2006), WSEAS, International Association of Science and Technology, IASTED on Database, etc.
51
International Journal of Database Theory and Application Vol. 3, No. 1, March, 2010
52