A modified C-means clustering algorithm (PDF Download Available)

4 downloads 6215 Views 455KB Size Report
Key words: C-means, Cluster analysis, Data Mining (DM), Knowledge Discovery in Database (KDD), Clustering ..... Logo Main window MCrisp Data Main.
RECENT ADVANCES on DATA NETWORKS, COMMUNICATIONS, COMPUTERS

A modified C-means clustering algorithm FARAJ A. EL-MOUADIB1, ZAKARIA SULIMAN ZUBI2, HALIMA S. TALHI 3 1

Computer Science Department, Faculty of Information Technology, Garyounis University, Benghazi, Libya, 2 Computer Science Department, Faculty of Science, Altahadi University, Sirte , Libya, 3 Great Manufactures Made River Project, Benghazi, Libya, < [email protected]>

Abstract: - Recently, huge amount of data have been collected over the past three decades or so. The availability of such data and the imminent need for transforming such data is the functionality of the field of Knowledge Discovery in Database (KDD). The most essential step in KDD is the Data Mining (DM) step which the engine of finding the implicit knowledge from the data. DM tasks can be classified into two types, namely; predictive and descriptive according to the sought functionality. One of the old and well-studied concepts in data mining is cluster analysis. In general, clustering methods can be either hierarchal or partitioning. One of the very well known clustering methods is the C-means. In this paper, the focus is on cluster analysis in general and on the partitioning method C-means in particular. Our attention is on the modification of the C-means algorithm in the way of calculating the means of the clusters. We considered the mean of a cluster to be an object instead of being an imaginary point in the cluster. Our modified Cmeans algorithm is implemented in a system developed in visual basic.net programming language. A number of data sets in the form of experiments tests our system. Our study will concluded with analysis and discussion of the experiments’ result on the bases of several criteria. Key words: C-means, Cluster analysis, Data Mining (DM), Knowledge Discovery in Database (KDD), Clustering algorithm, Fuzzy clustering.

knowledge mining from databases, knowledge extraction, data/pattern analysis, data archaeology, and data dredging. [4]. Many people treats the term Data Mining (DM) as a synonym to KDD while others view DM as one step in the KDD process as depicted in fig.-11. DM is the extraction of useful interesting knowledge (in the form of patterns or regularities) from large data sets, databases or other repository system. [5].

1 Introduction Advances in computer technology and data acquisition tools and techniques have resulted in the generation of terabytes of data. These huge amounts of data have been stored in database or other repository systems. These massive volumes of data exceeded human and traditional data analysis tools in transforming them into useful knowledge. The implicit knowledge in these volumes of data necessitates the invention of some automatic new data analysis tools and techniques to discover the implicit knowledge in one form or another. These tools and techniques had been implemented in a new emerging field known as Knowledge Discovery in Databases (KDD). The term KDD also known by other names such as; data mining,

ISSN: 1790-5109

1

85

This figure borrowed from [3].

ISBN: 978-960-474-134-2

RECENT ADVANCES on DATA NETWORKS, COMMUNICATIONS, COMPUTERS

descriptive. Predictive mining tasks are those processes used to formulate a model described by the current data set and then the model is used to make prediction for new data objects. Classification and Prediction are some examples of predictive mining tasks. Descriptive mining tasks are those characterize the general properties of the current data set. Characterization, Discrimination, Cluster Analysis, Outlier Analysis and Association Analysis are examples of descriptive mining tasks.

Evaluation Knowledg

Data Mining Preprocessin g

Patterns

Selectio Preprocessed Data

Data Target Data

3 Cluster analysis

Fig.1: The KDD process.

According to [4], the KDD process consists of: 1. Selection of the relevant date for the current mining task and make it ready for analysis. 2. Pre-processing is to clean the data from noise or outliers or irrelevant data. 3. Data mining is to apply intelligent methods in order to extract patterns or regularities from the data. 4. Pattern evaluation step is to identify the truly interesting patterns and/or regularities via some interestingness measures. 5. Knowledge presentation is to present the mined knowledge to the user. KDD defined as the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. [10]. The DM field is a confluence of many disciplines such as; database systems, data warehousing, statistics, machine learning, and other areas. Data mining can be applied to a wide variety of problems. Recently, KDD and DM has become one of the most active research areas that have attracted the attention of many researchers. Many successful applications been reported from different sectors such as marketing, finance, banking, manufacturing, security, medicine, multimedia, education telecommunications, etc… [10].

Cluster analysis is the process of partitioning data objects into groups or clusters, so that objects in one group are as similar to each other as possible, and objects in different clusters are as dissimilar as possible. Cluster analysis has been in use since a long time ago in many fields such as: Biology, Chemistry, Botany and others. In the past, clusterings were conducted in a subjective manner and researchers relied on their perception and judgment in interpreting the results, this was done quite easily done due to the low dimensionality (at most 2-D) and size of the data. Nowadays, old clustering methods are not suitable any more due to the increase of extent and/or dimensionality of the data to be clustered. Automatic classification of data is a very young scientific discipline in vigorous development. [6]. Automatic classification has establishing itself as an independent scientific discipline due to the thousands of articles scattered over many periodicals such as; journals of statistics, biology, psychometrics, computer science, and marketing. Since the mid of the 80’s cluster analysis has been established as an independent discipline where some scientific periodicals such as; the Journal of Classification and the International Federation of Classification Societies are dedicated for such discipline. Computer scientists consider cluster analysis as a branch of pattern recognition and artificial intelligence.

2 Data mining tasks

3.1 Cluster analysis methods Generally speaking, most of the clustering methods can be classified into one of two

Generally, DM has many tasks that are classified into two main categories namely: predictive and

ISSN: 1790-5109

86

ISBN: 978-960-474-134-2

RECENT ADVANCES on DATA NETWORKS, COMMUNICATIONS, COMPUTERS

between the selected object and any other object within the set of objects). 2. The number of density-reachable objects from the selected object (core object) is more than or equal to MinPts (minimum number of objects that satisfy the condition 1 above). The parameters E-neighborhood and MinPts are specified by the user. When the above conditions are specified then a cluster is formed.

categories; partitioning methods or hierarchical methods. 3.2 Partitioning clustering method A partitioning clustering method constructs k mutual-exclusive clusters out of n data objects. The partitioning method must satisfy the following conditions: 1. Each cluster must contain at least one data object. 2. Each data object must belong to only one cluster. The first condition implies that there are as many clusters as there are objects: k ≤ n. The second condition states that two different clusters cannot have any objects in common. Partitioning algorithms creates clusters or groups of data objects in such a way that objects within a cluster have high similarity (Intra-cluster) and objects in different clusters have high dissimilarity (Inter-cluster) as depicted in fig.-2.

3.3 Hierarchical clustering methods Hierarchical methods are unlike partitioning methods in the sense that all of the values of k (number of clusters) are present in the constructed dendrogram (tree like hierarchy). Starting from the partition of k = 1 (all objects are together in one cluster) and the partition with k = n (singleton clusters) there are, all values of k = 2, 3, ..., n – 1. Hierarchical clustering methods construct a treelike hierarchy of data objects. There are two types of hierarchical techniques namely; agglomerative and divisive as depicted in fig.3. These two techniques construct the tree-like hierarchy in the opposite directions.

3.3.1 The agglomerative technique

Starting with n singleton clusters and in each step two of the clusters will be combined into one cluster until all objects are merged into one cluster with n data objects.

: Inter-cluster : Intra-cluster Fig.2: Intra-cluster and Inter-cluster similarities.

The relocation algorithms of the partitioning technique start by picking representative objects (the number of representative objects depends on the number of clusters) as seeds of clusters and then assigns each data object to its nearest cluster. The density-based algorithms of the partitioning technique begin with an arbitrary selection of an object from the data set. Retrieve all densityreachable objects from the selected object. Each object of data set can be called directly densityreachable from the selected object if it satisfies the following conditions: 1. The distance between any object and the selected object is less than or equal to Eneighborhood (the maximum distance

ISSN: 1790-5109

3.3.2 The divisive technique

Starting with one cluster with n data objects and in each step, one of the clusters will be split up into two clusters, until there are n singleton clusters.

4 K-means methods Here, we give some details of the K-means methods that are of relocation partitioning type. This type of clustering depends on the calculation of the mean of each cluster in the process of assigning objects to clusters. The purpose of focusing on the K-means methods is due to the main concern of this work. The K-means methods

87

ISBN: 978-960-474-134-2

RECENT ADVANCES on DATA NETWORKS, COMMUNICATIONS, COMPUTERS

C1, C2, …, Ck. These k clusters are called the initial clustering. Step 4. Calculate the mean of each cluster and update the location of each object. Each object is reassigned to the cluster that has the minimum distance to its mean. Step 5. Repeat steps 4 until some termination condition is met. The creation of the initial clustering depends on the concept of nearest, which is calculated by the Euclidean distance:

can be of hard clustering type (called Crisp Cmeans) and can be of soft clustering type (called Fuzzy C-means). step0 5clusters

step1 step 2 step 3 4clusters 3clusters 2clusters

step 4 1cluster

d(i, j) = (xi1 − xj1)2 + (xi2 − xj2)2 +... + (xip − xjp)2

step 4 5clusters

step 3 4clusters

step 2 3clusters

step 1 2clusters

The formula corresponds to the true geometrical distance between the objects i, j with coordinates (xi1, ..., xip) and (xj1, …, xjp), where p represents the number of variables or level describing the objects. The steps of the standard Crisp C-means clustering algorithm is as the follows: Input: A data set, k: number of clusters, n: number of data objects, k ≤ n. Output: Clusters

step 0 1cluster

Fig.3: Agglomerative and divisive techniques.

4 K-means methods

While (true) do: [Outer loop]. if y < 2 then (y is number of iteration) [Check]. Randomize ( ) [Initialize].

Here, we give some details of the K-means methods that are of relocation partitioning type. This type of clustering depends on the calculation of the mean of each cluster in the process of assigning objects to clusters. The purpose of focusing on the K-means methods is due to the main concern of this work. The K-means methods can be of hard clustering type (called Crisp Cmeans) and can be of soft clustering type (called Fuzzy C-means). Crisp C-means clustering algorithm is a widely used and effective method for finding clusters in data; where each data object belongs to one and only one cluster. According to [7, 8, and 11] Crisp C-means clustering algorithm was first proposed by MacQueen 1967. The steps of the Crisp Cmeans clustering algorithm are as the follows: Step 1. The user is asked to provide the number of cluster k to be sought. Step 2. Randomly, choose k objects as seeds (representative) to the k clusters. Step 3. For the rest of the objects in the data set, assign each object to the nearest representative object. Repeat this process until all the objects are assigned to clusters and therefore we end up with k clusters:

ISSN: 1790-5109

for I ← 1 to k [Loop]. Rnd ( ) ∗ Val (n-1) (choose seeds). End for Clustering ( ) (create initial clustering). Else Means ( ) [Calculate the means]. End if Clustering ( ) (relocate objects) [Create the clusters]. if BCV/WCV (true) then [Test clustering quality]. Exit while (find best clustering in this data set). Else [Repeat].

y+1 [Iterations counter]. Set y ← End if End while

5 Modified k-means algorithm In the standard Crisp C-means clustering algorithm, after the creation of the initial clustering, the mean of each cluster is calculated and some changes will take place in the location of objects in the clusters. The mean of each cluster could be an object in the cluster or it could be an imaginary point in the cluster. Our modification is in determining the mean of a cluster. After calculating the mean as in the standard C-means algorithm then the closest

88

ISBN: 978-960-474-134-2

RECENT ADVANCES on DATA NETWORKS, COMMUNICATIONS, COMPUTERS

Here, we will demonstrate only the activity diagram, class diagram and the sequence diagram, of the MCCS subsystem, due to their importance and capabilities to give a clear view of the system. Fig.4 depicts the activity diagram for the MCCS subsystem that shows the system processes and helps to define the sequence of tasks, the needed conditions and the concurrency of the system.

object to the mean is considered to be the mean of the cluster (Medoid). So, the mean of a cluster is an actual object instead of an imaginary object. In such case there will be substantial reduction in the calculation and time in subsequent iterations of the algorithm. This modification is accomplished via adding one step (designated by *) to the standard C-means algorithm and assigning the closest object (Medoid) to be the mean of the cluster. The steps of our modified Crisp C-means clustering algorithm is as the follows: Input: A data set, k: number of clusters, n: number of data objects, k ≤ n. Output: Clusters While (true) do: [Outer loop]. if y < 2 then (y is number of iteration) [Check]. Randomize ( ) [Initialize].

*

for I ← 1 to k [Loop]. Rnd ( ) ∗ Val (n-1) (choose seeds). End for Clustering ( ) (create initial clustering). Else Means ( ) [Calculate the means]. End if Clustering ( ) (relocate objects) [Create the clusters]. Medoids ( ) (the closest object to the calculated mean) [Find the medoid]. Clustering ( ) (considers the mean of a cluster to be an object) [Create the clusters]. if BCV/WCV (true) then [Test clustering quality]. Exit while (find best clustering in this data set). Else [Repeat].

 y+1 Set y ← counter]. End if End while

[Iterations

Fig.4: Activity diagram for the MCCS sub-system.

6 Design and implementation

The class diagram gives an overview of a system via describing its attributes and operations of each of the classes, and the various types of static relationships among those classes. Fig.5 depicts the class diagram of the MCCS subsystem, which includes the classes and the relationships between them.

Here, we will demonstrate the design and implementation of our clustering system Clustering Software System (CSS), which consists of two sub systems; Crisp Clustering Software (CCS) and Modified Crisp Clustering Software (MCCS). These sub systems are independent of each other. Therefore, the execution of any of the sub systems is independent. We have used the Unified Modeling Language (UML) [3, 9, 12, and 14] in the analysis and design phases, and Visual Basic.NET 2005 programming language in the actual implementation of the system.

ISSN: 1790-5109

There is a method, which is common to all classes; this method is New (constructor), which initializes the objects’ parameters when it is created. The Data Class is to establish the appropriate connection with the database, read its content, to execute SQL commands and display the data. The Main window Class is to show

89

ISBN: 978-960-474-134-2

RECENT ADVANCES on DATA NETWORKS, COMMUNICATIONS, COMPUTERS

RAM memory, 100MB Hard disk and Microsoft Windows XP Home Edition operating system.

which sub-system to execute. The Logo Class is the starting point to perform the desired sub system. The Crisp, MCrisp Classes have the same methods but they perform different tasks depending on the chosen sub-system to be executed, to read the actual data from the database and update it after some operations have been carried out. And to display the results, save it and print after performance of the sought clustering.

Logo

Main window

MCrisp

Data

Main

1: New()

2: Showdialog() 3: New()

4: Showdialog 5: Dispose() 6: New()

7: Open() 8: ReadDB()

9: Return data

10: ExecuteNonQuery()

11: New() 12: Count()

13: Return()

14: Display() 15: New()

16: Fill()

17: Clustering()

18: Update()

Fig.5: The class diagram of the MCCS subsystem.

The sequence diagram describes the behaviour of the system and depicts the messages passed between the objects during the period of the system execution. Fig.6 depicts the sequence diagram for the MCCS subsystem.

19: Clus_ResDisy()

20: Save_ResDisy()

21: Print_ResDisy()

Fig.6: The sequence diagram for the MCCS sub-system.

7 Experiments and results Here, we will demonstrate the obtained results from the experiments conducted on our system. The interpretation and comparisons of results are presented. Number of real datasets that are used in testing data mining systems and algorithms are used to test our system performance. All the experiments are performed on a Laptop computer with 1.73GHz Pentium IV processor, 1GB of

ISSN: 1790-5109

7.1 First experiment This experiment is concerned with a well-known data set called Wisconsin breast cancer database [13, 15], that represents a reasonably complex problem with 9 continuous input attributes and two possible output classes. William H. Wolberg of the University of Wisconsin hospitals donated

90

ISBN: 978-960-474-134-2

RECENT ADVANCES on DATA NETWORKS, COMMUNICATIONS, COMPUTERS

this data set. The dataset consists of 683 instances. The problem is to find whether the evidences indicate a benign or malignant neoplasm. This experiment is set to form two clusters (k = 2); (cluster for benign, cluster for malignant). The representative objects are randomly chosen and so this experiment is run fifteen times to get the best possible results. The results of the fifteen runs for the CCS sub system are depicted in table-1. In addition, the results of the fifteen runs for the MCCS sub system are depicted in table-2.

2 3 215955 28 3 2 205469 28 4 4 208089 35 5 3 213460 28 6 3 212530 28 7 3 198910 33 8 3 220886 28 9 4 210707 33 10 2 204507 28 11 3 213027 28 12 4 223210 28 13 3 216900 28 14 2 203736 28 15 2 214216 28 From the obtained results and as depicted graphically in Fig.7 and Fig.8, there has been an increase in the performance of the C-means algorithm after our modification over the original version. In fact, on the average the MCCS algorithm was better than the CCS algorithm by 6.69% as far as the time is concern. Moreover, on the average the MCCS algorithm was better than the CCS algorithm by 37.68% as far as the number of passes is concerned.

Table-1: Fifteen runs for the CCS sub system

Test run 1 2 3 4

Number Number of Time of misclassified (ns) passes objects 4 223491 29 4 213208 27 5 232676 26 214954 26 3

5

6

6

5

7

5

245931

26

233921

26

233947

26

5

CCS, 4.6

4.5

9 10

4 4 5

11

3

12

3

214965

29

213526

30

234369

26

205239

32

205012

27

212980

28

266060

26

243232

26

4 Average number of Passes

8

3.5 3

MCCS, 2.866

2.5 2 1.5 1 0.5 0 Sub-system

13

4

14

8

15

6

Fig.7: Average number of passes of sub-systems.

Table-2: Fifteen runs for the MCCS sub system

Test run 1

Number of Number Time of misclassified (ns) passes objects 2 204653 28

ISSN: 1790-5109

91

ISBN: 978-960-474-134-2

RECENT ADVANCES on DATA NETWORKS, COMMUNICATIONS, COMPUTERS

12 13 14 15

230,000 CCS, 226,234 225,000

Time

220,000

215,000

MCCS, 211,084

205,000

Number Number of Time of misclassified (ns) passes objects 1 2 24965 183 2 7 46217 183 3 6 42086 183 4 4 33620 183 5 3 29423 183 6 2 18548 196 7 3 29505 183 8 2 25865 183 9 2 32201 135 10 2 29358 171 11 5 37243 167 12 5 37995 183 13 3 26792 191 14 2 26907 155 15 4 33378 183 From the obtained results in table-3 and table-4 and as depicted graphically in Fig.9 and Fig.10, there has been an increase in the performance of the modified C-means algorithm over the original version. In fact, on the average the MCCS algorithm out performed the CCS algorithm by about 40.00% as far as the time is concern. Moreover, on the average the MCCS algorithm was better than the CCS algorithm by 60.00% as far as the number of passes is concerned.

Sub-system

Fig.8: Time of sub-systems in nanoseconds.

7.2 Second experiment This experiment is concerned with a well-known data set called the BUPA liver disorders data set [1]. This database created by BUPA Medical Research Ltd. and donated Richard S. Forsyth. The data set represents some blood tests, which thought to be measures of sensitivity of liver disorders that might arise from excessive alcohol consumption. The database consists of 345 objects each of which is described by seven variables. This experiment is set to form two clusters (k = 2). Since the required number of representative objects is chosen randomly, so this experiment is run fifteen times to get the best possible results. Table-3 lists the final obtained results of fifteen runs in detail of the CCS subsystem. Moreover, table-4 depicts the results of the fifteen runs for the MCCS sub system. Table-3 Results of CCS sub-system with the BUPA liver disorders data set experiment.

ISSN: 1790-5109

Time (ns) 58385 57695 58733 47710 47710 52693 58683 58602 52928 53211 63722

Number of misclassified objects 188 190 190 190 190 190 190 190 190 190 190

60,000 CCS, 52,688 50,000

40,000 Time

1 2 3 4 5 6 7 8 9 10 11

190 190 138 190 186.4

Test run

200,000

Number of passes 10 10 10 8 8 9 10 10 9 9 11

58729 30711 31516 59296 52688.27

Table-4: Results of MCCS sub-system with the BUPA liver disorders data set experiment.

210,000

Test run

10 4 2 10 8.67

MCCS, 31,607

30,000

20,000

10,000

0 Sub-system

Fig.9: Average number of passes for

92

ISBN: 978-960-474-134-2

RECENT ADVANCES on DATA NETWORKS, COMMUNICATIONS, COMPUTERS

the BUPA liver disorders data set.

y.html]. Irvine, CA: University of California, School of Information and Computer Science.

10 9

2. Fayyad, U., Piatetsky-shapiro, G. and Smyth, P., (1996), From Data Mining to Knowledge Discovery in data base, Association for Artificial Intelligence, American, Pages 1 – 54.

CCS, 8.67

Average number of passes

8 7 6 5 4

3. Fowler, M. and Scott, K., UML Distilled A Brief Guide to the Standard Object Modeling Language, Addison Wesley, USA, 2000, Pages 1– 339.

MCCS, 3.47

3 2

4. Han, J. and Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann publishers, Canada, 2000, Pages 6 – 21.

1 0 Sub-system

5. Hand, D., Mannila, H., and Smyth P., Principles of Data Mining, Massachusetts Institute of Technology Press, Cambridge, Massachusetts London England, 2001, Pages 2 – 21.

Fig.10: Time of sub-systems for the BUPA liver disorders data set in nanoseconds.

6. Kaufman, L. and Rousseeum, P., Finding Groups in Data, John Wiley & Sons, Inc, United States of America, 1990, Pages 1 – 66.

8 Conclusion The objectives of this work were set to implement a modified version of the C-means algorithm a system developed in visual basic.net programming language and to test the modified version by two well-known databases on the criteria such as time and number of passes. The results we had obtained from the experiments are summarized in the following points: 1. The two sub-systems experiments results were identical to the original results that came with the data. 2. The MCCS sub-system had out performed the CSS sub-system on the comparison criteria. From the humble experience gained during the time of carrying out this work, the author would like to make the following recommendations for further research: 1. More experiments with different data sizes are needed because of the nature of cluster analysis. 2. To find some other comparison criteria to certify our findings.

7. Kogan J., Introduction to Clustering Large and High-Dimensional Data, Cambridge University Press, United States of America, 2007, Pages 9 – 37. 8. Larose D., Discovering Knowledge in Data, John Wiley & Sons, Inc, New Jersey, 2005,Pages 2 – 23. 9. Loton, T., McNeish, K., Schoellmann, B., Slater, J. and Wu, Chaur., Professional UML with Visual Studio .NET Unmasking Visio for Enterprise Architects, Wrox Press Ltd, United Kingdom, 2002, Pages 1– 343. 10. Mitra, S. and Acharya, T., Data Mining Multimedia, John Wiley & Sons, Inc, New Jersey, 2003, Pages 1 – 28. 11. Miyamoto, S., Ichihashi, H. and Honda, K., Algorithms for Fuzzy Clustering Methods, Springer, Verlag Berlin Heidelberg, 2008, Pages 16 – 39. 12. Pender T., UML Weekend Crash Course, Wiley Publishing Inc, Indianapolis, Indiana, 2002, Pages 1– 224.

References:

13. Taha, I. and Ghosh, J., Characterization of Wisconsin Breast Cancer database using a hybrid symbolic connectionist system, Tech. Rep. UT-CVIS-TR-97-007, University of Texas, Austin, 1997.

1. Asuncion, A. & Newman, D.J. (2007). UCI Machine Learning Repository [http://www.ics.uci.edu/~mlearn/MLRepositor

ISSN: 1790-5109

93

ISBN: 978-960-474-134-2

RECENT ADVANCES on DATA NETWORKS, COMMUNICATIONS, COMPUTERS

14. Weilkiens T., Systems engineering with SysML/UML: modeling, analysis, design, Morgan Kaufmann Publishers, United States of America, 2007, Pages 1– 320. 15. Wolberg, W., H., and Mangasarian, O.,L., Multisurface method of pattern separation for medical diagnosis applied to breast cytology, in Proceedings of the National Academy of Sciences, U.S.A., 87, 9193-9196, 1990.

ISSN: 1790-5109

94

ISBN: 978-960-474-134-2

Suggest Documents