system for fuzzy document clustering and fast fuzzy ... - IEEE Xplore

0 downloads 0 Views 382KB Size Report
Catholic University in Ružomberok,. Hrabovská cesta 1A, 034 01 Ružomberok, Slovakia [email protected]. Abstract – the paper introduces uncontrolled fuzzy.
CINTI 2014 • 15th IEEE International Symposium on Computational Intelligence and Informatics • 19–21 November, 2014 • Budapest, Hungary

SYSTEM FOR FUZZY DOCUMENT CLUSTERING AND FAST FUZZY CLASSIFICATION Michal Rojček*, *

Department of Informatics, Faculty of Education, Catholic University in Ružomberok, Hrabovská cesta 1A, 034 01 Ružomberok, Slovakia [email protected]

B. Other advantage is that the category choice step is removed, and we avoid the using of choice parameter . By that, the number of user defined parameters in the system is lowered. This modification does not break the basic principle of the ART network to prevent the stability and plasticity dilemma. KMART is always incremental clustering algorithm, and before learning of new input it controls the input and output pattern, and learns only when the input pattern responds to some of saved patterns with some tolerance.

Abstract – the paper introduces uncontrolled fuzzy document clustering and fast fuzzy classification. This system is based on KMART neural network that realizes clustering, and original Fuzzy classification algorithm on the base of Fuzzy ART network that realizes classification. Both algorithms share their weights. Uncontrolled system has two separate flows: by first one we influence structure of categories (plasticity) and second one classifies without possibility to influence defined structure (stability). The paper shows legitimacy of such an approach with regard on quality and speed of classification.

III. MODEL OF SYSTEM FOR FUZZY DOCUMENT CLUSTERING AND FAST FUZZY CLASSIFICATION System consists of two algorithms that are connected with each other by the network weights. The first algorithm (Alg. 1) is the KMART system based on the fuzzy ART network [12]. Together with stabilization strategy named Conceptual Duplication [14] in ART1 network, they create two published fuzzy clustering approach application possibilities on the ART networks. Conceptual Duplication strategy introduces a new parameter of evidence. With lower values of this parameter has the algorithm great memory demands and with higher values of this parameter the network behaves unstably. In regard of introduced reality the KMART algorithm was chosen that involves only two parameters: speed learning parameter β and vigilance parameter ρ. In opposition the original Fuzzy ART algorithm, KMART does not entail the choice parameter α, and also any cycle for choice of winning neuron, because the similarity measure of the document to the given category is computed for all documents. First KMART algorithm has the role to create basic (initiation) category structure on the base of training matrix, where are these categories stated by the appropriate term structure. Each document category will replace terms that appear in this category. It is not appropriate to create the training matrix from terms that appear in this category. It is not appropriate to create the training matrix from terms that appear in more categories, because the network would create unsharp (not precisely bounded) cluster structure. Second algorithm (Alg. 2) is original and is proposed in order the fuzzy document classification into created categories without disrupting this hard structure – that

I. INTRODUCTION In many application areas we need to classify documents quickly and qualitatively, on the base of their keywords [1], [2]. Neural networks that are based on Adaptive Resonance Theory (ART) represent fast and efficient tool for clustering [3]–[6] and classification [7]– [11]. II. KMART SYSTEM In [12] was proposed modification of existing Fuzzy ART Algorithm [13], to be able to apply Fuzzy clustering. This system is called KMART (Kondadadi & Kozma Modified ART). For creation of clusters in KMART, the modified version of Fuzzy ART network was used. Instead of choice of maximum category similarity and using of vigilance test for verification, if the category is quite near the input pattern, each category can be verified in the F2 layer with application of the vigilance test. If the category fulfills the vigilance test then the input document is inserted in this particular category. Similarity measuring is computed in the vigilance test that defines the relevance degree of the given input pattern to present cluster. That enables the document to be in more clusters with different degrees of relevance. This modification has also other advantages that follow from mentioned Fuzzy clustering: A. Fuzzy ART is time costly, from the reason of requiring iterative browsing by searching the winning category that fulfills vigilance test. In described modification this searching is not needed because each F2 node is controlled. Thanks to this is this model less computationally complex.

978-1-4799-5338-7/14/$31.00 ©2014 IEEE 39

M. Rojček • System for Fuzzy Document Clustering and Fast Fuzzy Classification

means, this algorithm does not create new categories, it only categorizes documents to the predefined structure of categories on the base of trained shared weights wj. The proposed Fuzzy classification algorithm that comes out from Fuzzy ART net [13] has these steps: 1. Load new input vector (document) I that contains binary or analog components. Let I:=[next input vector] 2. Compute the output value for all neurons y ( j ) (it comes about the document relevance measure to the j-th category) on the base of formula:

y( j ) =

3.

I ∧ wj

to set up the parameter ρ. Justification of this approach shows the experiments described lower. IV. EXPERIMENTS The aim of experiments is to refer to higher precision and speed of classification of documents from interleaving categories by using Fuzzy classification algorithm (Alg. 2), instead of using only one system KMART (Alg. 1) If the KMART net is learned to given categories and we put the document to the input, that exactly belongs to one of existing categories, then there is categorized with highest similarity measure, as we expect, and as it follows from the principle of given network. When in the KMART net is on input document that belongs to two or more existing categories then we expect that assigned document will have higher relevance measures in these categories as in other ones and new category is not created. If experiment shows, KMART realizes this with the same precision as in Alg. 2, then the proposed Fuzzy classification algorithm will not have any substantial contribution. The training matrix contain 300 documents with 30 terms. The documents are divided to five categories with 60 documents in each. By the category terms the term frequency is shown in each document. This frequency is normalized to the interval of real numbers . Testing matrix contains 600 documents each with 30 terms. Documents are divided to five categories (from the training matrix) in different combinations and each document belongs to two categories contemporarily. Matrix thus contains 10 different combinations. The document division is optically visible by lowering the matrices, and it is on Fig. 2.

(1)

I

where ∧ is the fuzzy AND operator that is defined as: (x ∧ y ) = min(xi , yi ) . y( j) Assign the computed value to the map matrix on the place of actually processing category j and the document: map( j , doc) = − y ( j ) (2)

Negative value -y is the sign to identify the algorithm in the common map matrix that computed that value. KMART algorithm [12] uses positive values. 4. Return: go to step 2 until j ≤ max .nr.of categories , otherwise go to step 1. If there is not next vector (document) then end, in the stream case wait for new document. The system architecture with both algorithms will appear next: Training set doc/term

Alg.1: KMART (clustering)

Shared entities: matrix map, weights wij and some variables.

Testing set doc/term

Figure 1.

Alg.2: Fuzzy Classifier based on Fuzzy ART

Model of Fuzzy classification system.

Algorithm 1 (KMART) with initialization set, that we can call training set in the equivalent of the ARTb form the ARTMAP architecture [15]. With this set we determine, in which manner the second classification algorithm categorizes the documents from test matrix. This algorithm has disabled the choice to create new categories and disrupt the given structure. Moreover, it is not needed

Figure 2. Preview distribution of documents into categories: left – Training set, right – Test set (part).

Experiment ran in such a manner that on the input of KMART the training set 300x30 was brought. The

40

CINTI 2014 • 15th IEEE International Symposium on Computational Intelligence and Informatics • 19–21 November, 2014 • Budapest, Hungary

TABLE III. SELECT A PARAMETER ρ IN KMART NETWORK FOR TEST SET

network created 5 categories as was expected. Parameter ρ was in all cases of KMART net chosen experimentally, by acquiring highest F-measure. Parameter of learning speed β was set to value 1 (fast learning). In the next step the same training set was brought to be sure that net assigns all categories correctly. In the Table I. the results are shown that show that KMART classifies the documents with the same Fmeasure as Fuzzy classification algorithm. Processor time is different only about 38ms in advantage of fuzzy classification algorithm , what is caused by vigilance test in the KMART net. In this phase, when on the input the document of only one category were brought, both algorithms ran almost equally.

ρ

F Measure

Stabilisation of network

0,1

0,193

NO

0,2

0,184

YES

0,3

0,344

NO

0,4

0,751

YES

TABLE I.

CLUSTERING AND CLASSIFICATION OF THE TRAINING SET

0,5

0,19

YES

F Measure

CPU time [s]

Number of iterations

Number of created categories

0,6

0,184

NO

0,8

1

0,250

2

5

0,7

0,026

NO

KMART TRAIN

0,8

1

0,141

1

0

0,8

0,006

NO

Fuzzy Class TRAIN

-

1

0,103

1

0

0,9

0,003

NO

Algorithm and Input Matrix

ρ

KMART TRAIN

Another situation emerges when it is needed to classify documents that belong to more categories. Such documents contain testing matrix described above. In the first run the KMART network creates basic category structure on the base of training matrix. Then again the KMART is used, now with testing matrix. By the most optimal parameter setting ρ=0.4 the F/measure of 0.75 was obtained. The network created next new category (sixth), that was the cause of F-measure lowering, particularly its coverage part equal to value 0.6 (precision was equal 1). KMART net was stabilized in two interactions and the processor time was 0,459 s. In the second case the proposed Fuzzy classification algorithm was used where the test set was brought again. The solutions, which were obtained by the classification algorithm was used where the test set was brought again. The solutions, which were obtained by the classification show that Fuzzy classification algorithm had in all watched parameters substationally better values (see Tab II.). F-measure acquired maximal value 1, and the processor time was shorter from the reason of regime absence in the creating new category and weight adaptation.

Figure 3.

V. CONCLUSIONS Proposed system can be used in applications with fast fuzzy classification of unlabeled documents to more categories with some measure of membership without disrupting of category structure created on the base of training set (category stability). The system enables whenever to add new category by that of KMART we bring sample document (documents), that will characterize it by suitable terms (category plasticity). The experiments shown that if only the KMART was used without original Fuzzy classification algorithm on the base of fuzzy ART net, this system will be over the set of overlapping documents too plastic and slow. Moreover, the system needs not to set the parameter ρ by the classification what is great advantage from the time point of view.

TABLE II. CLUSTERING OF TRAINING SET AND CLASSIFICATION OF TEST SET Algorithm and Input Matrix

ρ

F Measure

CPU time [s]

Number of iterations

Number of created categories

KMART TRAIN

0,8

1

0,260

2

5

KMART TEST

0,4

0,751

0,459

2

1

Fuzzy Class TEST

-

1

0,211

1

0

Chart of F-Measure depending on the parameter ρ in Kmart network for the test set.

41

M. Rojček • System for Fuzzy Document Clustering and Fast Fuzzy Classification

ACKNOWLEDGMENT This work was supported by Grant Agency of Faculty of Education, Catholic University in Ružomberok GAPF 2/11/2014.

[8]

[9]

REFERENCES [1]

[2]

[3] [4]

[5]

[6] [7]

S. Büttcher, C. L. A. Clarke, and G. V. Cormack, Information Retrieval: Implementing and Evaluating Search Engines. Cambridge: The MIT Press, 2010, p. 606. R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval: The Concepts and Technology behind Search, vol. 82. 2011, p. 944. L. Massey, “Real-world text clustering with adaptive resonance theory neural networks,” Proceedings. 2005 IEEE Int. Jt. Conf. Neural Networks, 2005., vol. 5, 2005. S. Kim and D. C. Wunsch, “A GPU based Parallel Hierarchical Fuzzy ART clustering,” 2011 Int. Jt. Conf. Neural Networks, pp. 2778–2782, 2011. P. Pardhasaradhi, R. P. Kartheek, and C. Srinivas, “A evolutionary fuzzy art computation for the document clustering,” in International Conference on Computing and Control Engineering, 2012. R. Xu and D. C. Wunsch, “Survey of clustering algorithms.,” IEEE Trans. Neural Netw., vol. 16, pp. 645–678, 2005. R. X. R. Xu, J. X. J. Xu, and D. C. Wunsch, “Using default ARTMAP for cancer classification with MicroRNA expression signatures,” 2009 Int. Jt. Conf. Neural Networks, 2009.

[10] [11]

[12]

[13]

[14] [15]

42

Y. R. Asfour, G. A. Carpenter, S. Grossberg, and G. W. Lesher, “Fusion ARTMAP: an adaptive fuzzy network for multi-channel classification,” Third Int. Conf. Ind. Fuzzy Control Intell. Syst., 1993. X. H. Song, P. K. Hopke, M. Bruns, D. A. Bossio, and K. M. Scow, “A fuzzy adaptive resonance theory - Supervised predictive mapping neural network applied to the classification of multivariate chemical data,” Chemom. Intell. Lab. Syst., vol. 41, pp. 161–170, 1998. G. P. Amis and G. A. Carpenter, “Self-supervised ARTMAP,” Neural Networks, vol. 23, pp. 265–282, 2010. M. Georgiopoulos, H. Fernlund, G. Bebis, and G. L. Heileman, “Order of search in Fuzzy ART and Fuzzy ARTMAP: Effect of the choice parameter,” Neural Networks, vol. 9, pp. 1541–1559, 1996. R. Kondadadi and R. Kozma, “A modified fuzzy ART for soft document clustering,” Proc. 2002 Int. Jt. Conf. Neural Networks. IJCNN’02 (Cat. No.02CH37290), vol. 3, 2002. G. A. Carpenter, S. Grossberg, and D. B. Rosen, “Fuzzy ART: Fast stable learning and categorization of analog patterns by an adaptive resonance system,” Neural Networks, vol. 4. pp. 759– 771, 1991. L. Massey, “Conceptual duplication,” Soft Comput., vol. 12, no. 7, pp. 657–665, Sep. 2007. G. A. Carpenter, S. Grossberg, N. Markuzon, J. H. Reynolds, and D. B. Rosen, “Fuzzy ARTMAP: an adaptive resonance architecture for incremental learning of analog maps,” [Proceedings 1992] IJCNN Int. Jt. Conf. Neural Networks, vol. 3, 1992.