Matlab Gui Package for comparing Data clustering algorithms Anirban Mukhopadhyay1, Sudip Poddar2 1
Department of Computer Science and Engineering, University of Kalyani, Kalyani- 741235
[email protected] 2 Advanced Computing and Microelectronics Unit, Indian Statistical Institute, Kolkata, 700108,
[email protected]
Abstract. The result of one clustering algorithm can be very different from that of another for the same input dataset as the other input parameters of an algorithm can substantially affect the behavior and execution of the algorithm. Cluster validity indices measure the goodness of a clustering solution. Cluster validation is very important issue in clustering analysis because the result of clustering needs to be validated in most applications. In most clustering algorithms, the number of clusters is set as a user parameter. There are a number of approaches to find the best number of clusters. Validity measures can be used to find the partitioning that best fits the underlying data (to find how good the clustering is). This paper describes an application (CLUSTER) developed in the Matlab/GUI environment that represents an interface between the user and the results of various clustering algorithms. The user selects algorithm, internal validity index, external validity index, number of clusters, number of iterations etc. from the active windows. In this Package we compare the results of kmeans, fuzzy c-means, hierarchical clustering and multiobjective clustering with support Vector machine (MocSvm). This paper presents a MATLAB Graphical User Interface (GUI) that allows the user to easily "find" the goodness of a clustering solution and immediately see the difference of those algorithms graphically. Matlab (R2008a) Graphical User Interface is used to implement this application package. Keywords. Clustering; Validity Index; Matlab; Graphical user Interface; CAD; Interface;
1
Introduction
A graphical user interface provides the user with a familiar environment for an application. This environment contains pushbuttons, toggle buttons, lists, menus, text boxes, and so forth, all of which are already familiar to the user, so that he or she can concentrate on using the application rather than on the mechanics involved in doing things. However, GUIs are harder for the programmer because a GUI-based program must be prepared for mouse clicks (or possibly keyboard input) for any GUI element
at any time. Such inputs are known as events, and a program that responds to events is said to be event driven. The three principal elements required to create a MATLAB Graphical User Interface are:1. Components: - Each item on a MATLAB GUI (pushbuttons, labels, edit boxes, etc.) is a graphical component. The types of components include graphical controls (pushbuttons, edit boxes, lists, sliders, etc.), static elements (frames and text strings), menus, and axes. 2. Figures: - The components of a GUI must be arranged within a figure, which is a window on the computer screen. 3. Callbacks: - Finally, there must be some way to perform an action if a user clicks on a button with mouse or types information on a keyboard. A mouse click or a key press is an event, and the MATLAB program must respond to each event if the program is to perform its function. For example, if a user clicks on a button, that event must cause the MATLAB code that implements the function of the button to be executed. The code executed in response to an event is known as a call-back. There must be a callback to implement the function of each graphical component on the GUI. Clustering is a popular unsupervised pattern classification technique which partitions the input space of n objects into K regions based on some similarity/dissimilarity measure. The value of K may or may not be known a priori. Output of a clustering technique is a K × n matrix U = [U ki]. Uki denotes the membership degree of ith object to the kth cluster. For crisp clustering, U ki {0,1} and for fuzzy Clustering, 0 U ki 1 .
In this article we have designed a MATLAB/GUI package called CLUSTER that implements different clustering algorithms and also computes the values of different cluster validity indices. The results are presented to the user in graphical and tabular forms. Here we implemented k-means, fuzzy c-means, hierarchical clustering and multiObjective clustering with support Vector machine (MocSvm) clustering algorithms in this application Package. The rest of the article is organized as follows. The next section gives short descriptions of different clustering algorithms included in the package. Cluster validity indices are described in Section III. We have demonstrated the use of the application package (CLUSTER) in Section IV. Finally, Section V concludes the article.
2
Clustering Algorithms
In this paper following clustering algorithms have been implemented. 2.1
K-Means
K-means [1][2] , is one of the simplest unsupervised learning algorithm that solves the well-known clustering problem. This algorithm aims at minimizing an objective function, in this case a squared error function. k
n
J || X i(j) C j || 2
(2.1.1)
j1 i 1
Where
|| X i(j) C j || a chosen distance measure between a data point Xi(j) and the
cluster center Cj ,is an indicator of the distance of the n data points from their respective cluster centers. K-means minimizes the global cluster variance J to maximize the compactness of the clusters. It has been shown that the k-means algorithm may converge to values that are not optimal. 2.2
Fuzzy C-Means
Fuzzy C-means (FCM) [3][4], is a method of clustering which allows one data point to belong to two or more clusters with different membership degrees. This method is frequently used in pattern recognition. It is based on minimization of the following objective function. N
C
J m u ijm || X i C j || 2
1≤m