Quality and Performance Evaluation of the Algorithms KMART and FCM for Fuzzy Clustering and Categorization Michal Rojček*, Igor Černák* and Róbert Janiga* *
Department of Informatics, Faculty of Education, Catholic University in Ružomberok, Hrabovská cesta 1A, 034 01 Ružomberok, Slovakia
[email protected] [email protected] [email protected] Abstract — In this paper we present a comparison of two fuzzy text document clustering algorithms. First of them is KMART algorithm, neural network based on Fuzzy ART neural network and the second algorithm is Fuzzy C-Means based on K-Means algorithm. The paper is aimed to compare the quality and performance of both algorithms on the set of real, contextually similar text documents falling into more categories at the same time.
each element
xi belongs to a cluster c j . As well as k-means algorithm and FCM aims to minimize the objective function:
arg c min i =1 j =1 wijm xi − c j , n
where: I. INTRODUCTION With evolution of computers and informatization development occurred a conversion of various and diverse documents into an electronic form. With the access of the Internet came new challenges for this area - information is available at much lower cost and are distributed very quickly. With the rise of information is becoming increasingly difficult to find relevant information on the web, which is a dynamic, increasing and changing space. Therefore, it is important to look for mechanisms to rapid search for relevant information, their categorization and filtering. Fuzzification has brought new possibilities into many technical areas. Also affected the field of artificial neural networks based on Adaptive Resonance Theory (ART). It has been shown that by modification of the original Fuzzy ART neural network there can be reached excellent results in the area of clustering and categorization of text documents [1]–[5]. One of these modifications was selected and experimentally compared with a conventional fuzzy clustering algorithm to highlight in particular the qualitative but also the performance potential of these networks in the area of clustering and categorization of text documents. II. FUZZY C-MEANS ALGORITHM One of the most commonly used fuzzy clustering algorithms is Fuzzy C-Means (FCM), designed by Dunn in 1973 [6] and improved by Bezdek in 1981 [7]. FCM algorithm tries to divide the ultimate collection of n elements X = {x1 ,, xn } into the collection c fuzzy clusters with the conservation of given criteria. It is given a finite set of data. The algorithm returns a list of c centers of clusters C = {c1 ,, cc } and particol matrix
W = wij ∈ [0,1] , i = 1,, n , j = 1,, c where
wij "tells" the extent to which the element
wij =
2
c
1 xi − c j k =1 x − c i k c
2 m −1
(1) (2)
From the objective function k-means differs with increasing levels of membership wij by adding a weighting exponent (fuzzyficator) m ∈ R , where m ≥ 1. Weighting exponent m, determines the level fuzzyfication of cluster. In assessing value m = 1 , competence values wij converges to 0 or 1, meaning sharp division. If there are no done experiments with any domain, or the domain information we do not have any knowledge of, the value of parameter m is normally set to 2. Each point x has a set of coefficients, giving the degree of membership to the k-th cluster wk ( x ) . In the fuzzy c-means cluster centroid is calculated as the average of all points, weighted according to the degree of membership to a cluster:
w (x ) x . = w (x ) m
ck
k
x
m
x
(3)
k
Fuzzy c-means algorithm is very similar to k-means algorithm [8]. FCM algorithm has following steps: 1. Select the number of clusters. 2. Match the randomly to each point coefficients belonging to a cluster. 3. Repeat until algorithm converges (that means, the change in the coefficients between two iterations does not exceed the threshold of sensitivity ε): a. Calculate the centroid of each cluster, using the formula (3).
b.
For each point calculate the coefficients belonging to the cluster, using the formula (2). One of the disadvantages of FCM algorithm is that the initial initialization of centroids is random and therefore we always get with the same parameters differently arranged clusters. Of course, this will also affect the final quality of clustering, which is with the each run of the algorithm significantly different. Another disadvantage is at the start the need to enter the number of clusters to be formed. This can be a problem if the input data set is unknown. However, the FCM is computationally fast algorithm (see. experiment below) and in practice it is applied also in trouble of fuzzy clustering of text documents [9], [10]. Other applications of FCM algorithm from the last period can be found in [11]–[13]. III. KMART NEURAL NETWORK In the paper [2], was proposed the variation of the algorithm fuzzy ART neural network, to permit application of Fuzzyclustering. This system is called KMART (Kondadadi & Kozma Modified ART). A. Creating of clusters in KMART system For creating clusters in KMART it was used a modified version of Fuzzy ART network. Instead of choosing the maximum category similarity and using the vigilance test to verify that the category is close enough the input pattern can each category be checked in the F2 layer with the application of vigilance test. If the category meets the test of vigilance, so that the input document is loaded into this particular category. Measurement of similarity is calculated in vigilance test defining the degree of competence of the input pattern to the current cluster. This permits the document in a number of the clusters with different degrees of affinities. All prototypes which passes the test of vigilance are updated by (3.4). This modification allows for additional benefits resulting from the already mentioned Fuzzy clustering:
Fuzzy ART is usually time demanding, it requires iterative scan in search of the winning category satisfying the vigilance test. In the described modification, this search is not necessary, as each F2 node is already checked. This makes the model less computationally demanding.
Another advantage is that step of selecting categories was removed to avoid using choice parameter α This reduces the number of user defined parameters of the system. This modification does not violate the basic principle of ART network, i.e. to avoid the dilemma of stability and plasticity. KMART is still clustering incremental algorithm and prior to learning new entry access it controls input and input pattern will learn only if it matches one of the stored patterns in a certain tolerance.
B. Algorithm of KMART network learning 1. Read the new input vector (document) I containing binary or analog components. Let I:=[next input vector] 2.
Calculate output value for all neurons y ( j ) (it is rate of affiliation to document to j- category) based on relation:
y ( j ) :=
I ∧ wj I
,
(4)
where ∧ is operator fuzzy AND, defined as: (x ∧ y ) = min(xi , yi ) . 3.
y ( j ) to matrix map, at the place of actually processed category j ( j > 1) and document doc(doc > 1) :
Match the calculated value
map( j , doc ) := y ( j ) 4.
Vigilance test: If,
5.
y ( j ) ≥ ρ , go to step 5, else go to step 6.
(6)
Reload the winner neuron (learning rule):
(
)
w(jnew ) := β I ∧ w(jold ) + (1 − β )w(jold ) 6.
(5)
(7)
Return: go back to step 2, while j ≤ max. number of categories, else go back to step 1. If there is no vector in sequence (document) or w(new ) = w(old ) , then stop.
KMART system reached with Fuzzy-ART network best results on the quality of sorting documents. The quality was measured by comparing of the generated and original clusters. It was also measured the execution time. In the KMART system grew linear time of implementation with number of documents [2]. IV. EXPERIMENTS The experiment was aimed to compare the quality and performance of the KMART algorithm with the standard clustering fuzzy algorithm Fuzzy C-Means (FCM) on the set of real, contextually similar text documents falling into more categories at the same time. Context means the same information area of the documents, that means that documents have several keywords in common. A. Description of the input matrix Text documents are chosen from the corpus 20 Newsgroups1. It is corpus of English texts from mailing lists. The corpus includes a total of 20 topics (categories) in areas such as: sports, computers, religion, politics, science, electronics, medicine and so on. Training matrix contains 500 selected preprocessed text documents from the corpus 20 Newsgroups, each with 118 terms. The documents are divided into five categories, each with 2x50 = 100 documents. Documents are duplicated (twice repeated in each category) due to create more precise clusters. Since the documents are intended for the algorithm without a teacher, this set does not contain information (description) to which category a given document belongs. Therefore it was necessary to repeat the 50 documents for each category. This comes 1
Available on: http://qwone.com/~jason/20Newsgroups/
into the even sharper clusters. Otherwise KMART network produced the wrong category structure. The training documents and terms matrix have been prepared using the methods TF-IDF. Input matrix consists of following categories: 1. Hockey 2. Christianity 3. PC Hardware 4. Atheism 5. MAC Hardware Testing matrix contains 100 preprocessed documents from the same corpus as the training one, each with 118 terms. The documents are divided into two categories of 50 documents, each document falls into two categories simultaneously. The matrix is again prepared using the methods TF-IDF. The matrix therefore contains two different double combinations: 6. Windows (expected context with 3. and 5. category) 7. Religion (expected context with 2. and 4. category). B. Preprocessing of text documents Preprocessing of text documents was realized in program RapidMiner Studio2 and consists of following steps (see. Fig. 1): 1. Transforming capital letters in the documents to lowercase, 2. tokenization - dividing consolidated text of individual documents on the sequence of tokens (in this case, tokens were words), 3. filtering tokens based on their length - were filtered tokens that were less than 4 characters long, 4. stop words filtering – removing not full meaning words, 5. transformation of words to their root shape by Porter algorithm.
1
2
3
4
5
optimal parameter values ρ = 0.61 and β = 1 (determined experimentally) was at the training set reached the maximum F-measure of 0.927. Then test set of documents was brought to the network. Here, the network reached a F-rate of 0,667 and CPU time 0.260 (see. Tab. 1). TABLE I. ALGORITHM RESULTS WITH TRAINING SET - REAL DOCUMENTS Algorithm and input set KMART TRAIN
KMART TRAIN
KMART TEST
β
ρ
F-measure
CPU time [s]
Nr. of iter.
Nr. of created categories
1
0,61
0,927
1,547
3
5
1
0,61
0,927
0,567
1
0
1
0,4
0,667
0,260
2
0
The parameter β was selected in KMART network experimentally after a series of tests that alter this parameter in combination with the parameter ρ. The most optimum value of the parameter β is equal to 1. Slower learning (β