An Extended Chameleon Algorithm for Document Clustering G. Veena and N.K. Lekha
Abstract. A lot of research work has been done in the area of concept mining and document similarity in past few years. But all these works were based on the statistical analysis of keywords. The major challenge in this area involves the preservation of semantics of the terms or phrases. Our paper proposes a graph model to represent the concept in the sentence level. The concept follows a triplet representation. A modified DB scan algorithm is used to cluster the extracted concepts. This cluster forms a belief network or probabilistic network. We use this network for extracting the most probable concepts in the document. In this paper we also proposes a new algorithm for document similarity. For the belief network comparison an extended chameleon Algorithm is also proposed here.
1 Introduction Artificial Intelligence have a successful history in the area of concept mining. The introduction of semantic web followed by the ontology to this area have increased its efficiency and application to a wide range of problems. The earlier works related to the concept mining were based on the statistical approach ie, statistical analysis of the term frequency were considered as the basis for concept mining. This strategy of concept mining provided only the keyword based analysis and did not give any importance to the semantic of the term or phrase ie, the semantic of the concept retrieved was not preserved. This paper provides a novel approach for concept mining. This work also extend to the area of concept comparison and concept clustering. For the purpose of concept representation a new model called as the semantic net is constructed. A new conceptual based comparison model is also proposed here. In a nutshell, the goal of our paper is to propose a new concept representation based on which comparison and clustering is done. We achieve our goal by G. Veena · N.K. Lekha AmritaVishwaVidyapeetham , Dept. of Computer Science and Application e-mail:
[email protected],
[email protected] c Springer International Publishing Switzerland 2015 El-Sayed M. El-Alfy et al. (eds.), Advances in Intelligent Informatics, Advances in Intelligent Systems and Computing 320, DOI: 10.1007/978-3-319-11218-3_31
335
336
G. Veena and N.K. Lekha
using different phases such as the preprocessing phase, concept extraction phase and a comparison and clustering phase which are explained in the following sections.In this paper, Section 2 describes the related works, section 3 describes the proposed solution followed by the experimental results in section 4 and conclusion in section 5.
2 Related Works A great deal of work was done in the area of concept mining for past few years. The paper introduced in [1], gave an introduction to a concept based mining model, which included two modules a concept based mining model and concept based similarity measure. The concept based mining model retrieved concepts precisely but did not preserve the semantic of the concept which is more essential. In the next paper [2], emphasis was given to the concept mining which was further extended to the area of information retrieval. A Conceptual ontological graph(COG) was introduced as a part of the paper. The comparison method described in the paper was based on the concept word length. Considering the concept word length, the similarity was done on the basis of word length of the concept which did not preserve the semantic of the concept. In paper [3], a conceptual ontological graph (COG) was included and along with it a new module called as concept based weighting analysis was also introduced. The concept based weighting analysis assigns weight to the concepts and the highest assigned weighted concept was taken as the main concept of that sentence. Most widely used text similarity measure was based on Vector Space Model (VSM) in [5,6,7]. Here the similarity was measured based on feature vector. In paper [8], the semantic matching of the concepts was done based on a match operator. The work did not focused on the overlaying relations. Paper [9], described semantic matching based on the comparison of labels of the node ie, on the bases of relation. Paper [10], proposed a Fuzzy Similarity based Concept Mining model (FSCMM) which was based on three measures Sentence level, Document level and Integrated corpus level. A Fuzzy Feature Category Similarity Analysis was also done for the similarity analysis. Paper [11], proposed a similarity measure based on the information distance and Kolmorgov Complexity. In this paper the similarity of concepts was analyzed and the similarity is measured between every pair of object according to the most dominant shared feature.
3 Proposed Solutions An efficient concept based mining model is proposed here and finally we extend this model for the purpose of similarity analysis and clustering. The similarity analysis is done both on sentence and document level. The core part of our paper is an introduction of an efficient concept mining model. This model is designed based on a set of algorithms and graphs. The proposed solution approach gives answers to the following questions:
An Extended Chameleon Algorithm for Document Clustering
337
1)How does our model lead to an efficient concept extraction? 2)How does concept extraction lead to an efficient calculation of the similarity analysis and clustering? The answers for the above mentioned questions can be obtained from three modules i)Preprocessing module ii)Concept extraction module iii) Comparison and Clustering module which is described in the subsection 3.1, 3.2 and 3.3 respectively.
3.1 Preprocessing This module describes an efficient way for the generation of a verb-argument structure which finally leads to the concept extraction, Fig.1. The pre-processing step includes document cleaning, Part of speech Tagging(POS Tagging) and Phrase structure tree generation. Document cleaning includes removal of unwanted characters such as numbers, punctuation etc. and tokenizing (ie,breaking the sentence into tokens with the help of delimiters). After this, Stanford Parser is used for the POS Tagging. Stanford Parser is a statistical parser which can parse input data written in several languages. The properly tagged sentences are then given to the Shallow semantic Parsing for the generation of verb-argument structures.
Fig. 1 Key Concept Extraction
For the formation of verb argument structure a phrase tree is generated from the tagged sentences and then a semantic role labeling is done based on the prop bank notation. Prop bank uses predicate independent labels such as ARG0, ARG1 etc as
338
G. Veena and N.K. Lekha
the labels. A sentence can contain more than one verb argument structure according to the number of verbs contained in it. The formation of the verb-argument structure after the labeling is done based on the path analysis. Example 1: Researchers found nanomaterials and made nanofluids. Nanofluids contains many nanometer-sized particle. These fluids are supplied by several methods. Nanofluids and nanomaterials creates a greater evolution in the research area. In the above example there are three sentences and their corresponding verbs are found, made, contains, supplied and creates and their corresponding arguments (Subjects, objects) are Researchers, nanomaterials, nanofluids, nanometer-sized particles, Several methods and greater evolution in the research area ie, the verbargument structures generated according this example is listed out in Table1.
3.2 Concept Extraction From the verb-argument structure the root verb is extracted and finally its corresponding arguments ie, its left and right phrases known as the noun phrases are taken. For each weight analysis is done. A TF-IDF measure is used for the calculation of weight of each verb, subject and object. We use a term frequency(tf) and an inverse document frequency(idf) for this calculation. The
Table 1 Verb-Argument Structures VERB1 ARG0 ARG1 VERB2 ARG0 ARG1 VERB3 ARG0 ARG1 VERB4 ARG0 ARG1 VERB5 ARG0 ARG1 VERB6 ARG0 ARG1
Found Researchers nanomaterials Made Researchers nanofluids Contains nanofluids nanometer-sized particles Supplied nanofluids several methods Creates nanomaterials greater evolution in the research area Creates nanofluids greater evolution in the research area
An Extended Chameleon Algorithm for Document Clustering
339
weight of a term ti or a phrase pi in jth document, x(pij), is calculated as, W (pi , j ) = t f (pi , j ) × id f (pi )
(1)
(| S |) (| d j : p j ∑ d j |)
(2)
id f (pi ) = log
where t f (pi,j) is the number of times pi occurs in jth document. S is the total number of sentences in the document. A feasibility analysis is done for retrieving the most important concept. The feasibility analysis is done based on the term frequency of the concept in document and corpus level. The term frequency is calculated for each term or phrase. After the term frequency analysis of each concepts, the concept with higher term frequency is taken as the most feasible concept and their corresponding triplet representation is generated. The formation of the root verb along with its left and right entity gives us the triplet form based on that particular root verb. The triplet form is represented as the , where V is the root verb, S is the subject and O is the object. Likewise for all verb argument structures in a document and their corresponding triplet forms are generated. A triplet generator algorithm is described in Algorithm 1.
Algorithm 1. Triplet Generation Algorithm S is a new sentence Declare Lv an empty list of verb Declare Sub an empty list of subjects Declare Obj an empty list of objects for each verb-argument structures do Add verb to Lv Add Subject to Sub Add Object to Obj end for for each Sub ji and Ob ji do Calculate the feasibility if feasibility ≤ threshold(0.1) then remove it from Sub and Obj list end if end for for each each feasible term do retrieve their corresponding arguments (from the list of Sub,Obj) and create a link between them. end for
The triplets generated are associated with their corresponding weight which is the calculated TF-IDF value. The triplet generated for the the VERB0,ARG0 and ARG1 from Table1 and Algorithm1 will be .
340
G. Veena and N.K. Lekha
3.3 Comparison and Clustering The novel approach of this paper is the introduction of comparison and clustering of concepts. A similarity analysis is done on the concept. The similarity analysis is calculated on the bases of cosine similarity measures. 3.3.1
Cosine Similarity(CS)
Here the concepts in a document is compared according to a single term similarity measure, a cosine correlation similarity measure is adopted along with the Term/Inverse document frequency (TF-IDF) term weighting[17]. The cosine measure calculates the cosine of the two angles between the concept vectors. The similarity measure (sims ) is: sims (c1 , c2 ) = cos(x, y) =
(c1 .c2 ) ( c1 . c2 )
(3)
The vectors c1 and c2 are represented as single-term weights calculated by using the TF-IDF weighting scheme. The clustering is done using a proposed concept based DB scan algorithm which preserve the semantic of the document. This algorithm is found to outperforms all the other methods. A concept based clustering algorithm is proposed based on the DB scan: Algorithm 2. Extented DB scan Algorithm M is a set of Triplet T Q is an empty queue Pick T from M such that if T is not yet classified then if T is having heightest weight(TF-IDF) then Compare with all other concepts in M based on CS(Cosine Similarity) value if match found then Add the matched concepts to Q and assign it to new cluster clstri for each C ∈ Q do Repeat steps (6) to (9) end for end if end if end if
The proper comparison and clustering will lead to the semantic net construction. A semantic net construction is done based a similarity comparison between the subject-subject and subject-object from the constructed clusters. A semantic net is a knowledge representation model which is a directed graph consisting of node and
An Extended Chameleon Algorithm for Document Clustering
341
their corresponding links. The nodes in the semantic graph represents the concepts and the links represent their relation. The basic idea of semantic nets is that it provides a graph theoretic structure for the concepts. The semantic net construction can be considered as a mathematical model called as a belief network. Belief Network A belief network is a probabilistic graphical model, that shows a set of random variables and their conditional dependencies via a directed acyclic graph (DAG). The belief network here represents the nodes as the subject and object and their conditional dependency to the other nodes. The conditional dependency explains the relationship between the subjects and objects. For the construction of the belief network a noun phrase comparison is done inside each cluster. The comparison is based on the subject-subject and subjectobject. The similarity analysis is also based on an efficient database which gives the synonym for each word called as the WordNet. The set of synonyms for a word is called synset. If there is no exact word similarity, then the corresponding synset will be checked for matching. So that original semantics are preserved. The belief network generated based on Example1 is given in Fig.2.
Fig. 2 Belief Network
Algorithm 3 explains the Belief network constructor algorithm:
Algorithm 3. Belief Network Constructor Algorithm Clstr is a set of clusters for each cluster clstri do Compare the concepts in clrstri (Noun Phrase Comparison) if matching is found then Add a link between the similar nodes else Remove the dissimilar concepts from clstri Create a Belief network(BN)corresponding to each clstri end if end for
342
G. Veena and N.K. Lekha
A conditional probability measure is calculated upon this belief network. The conditional probability calculation gives us the most feasible concept from the belief network ie, when the conditional probability is applied the most probable concept of that document can be retrieved which can be furthur used for the purpose of indexing, information retrieval etc. The calculation of the conditional probability on a belief network to find the most feasible concept is novel approach proposed in this paper. Likewise for a set of phrases and terms the belief network generated is shown in Fig.3.
Fig. 3 Belief Network for a set of terms and phrases
For each subject and object in the belief network after the calculation of the probability of each subject and object with respect to all other subjects-objects in the network the triplet with highest probability is taken as the most feasible or more important concept of that network. The probability is calculated on the basis of the links from a subject to object of that particular document. Conditional probability analysis Fig.2 consists of a set of triplets, from this the most probable triplet in the document level can be calculated by the applying the conditional probability and it is calculated as: P(Researchers | nanomaterials)=P(nanomaterials | Researchers) P(Researchers)/ P(nanomaterials) P(Researchers | nanofluids)=P(nanofluids | Researchers) P(Researchers)/ P(nanofluids) P(Fluids | Several methods)=P(Several methods | Fluids) P(Fluids)/ P(Several methods) P(Nanofluids | nano-sized particles)=P(nano-sized particles | Nanofluids) P(Nanofluids)/ P(nano-sized particles)
An Extended Chameleon Algorithm for Document Clustering
343
Chameleon:A two phase Clustering Algorithm for Belief Network Comparison This is a graph partition based algorithm which operates on graph in which nodes represents the data items and edges represents relation between them [15]. This algorithm is basically used for the inter comparison between the belief networks for more efficient clustering and to find the more accurate concept. After the application of this algorithm again the condition probability is applied which is used to pick the most imporatant concept. This algorithm leads to a more efficient inter documents clustering.This method is based on two measures i) Relative interconnectivity ii) Relative Closeness. i) Relative interconnectivity The relative interconnectivity between the belief networks of different documents can be found out by: Consider beli and beli + 1 be two belief networks ,then the relative interconnectivity can be measured using eq(4): RI(beli , beli + 1) =
(| EC(beli , beli + 1) |) ∗ 2 (| EC(beli ) || EC(beli + 1) |)
(4)
where EC(beli ,beli + 1)= sum of weights of edges that connect Ci with Cj. EC(beli ) = weighted sum of edges that partition the cluster into roughly equal parts. ii)Relative Closeness The relative Closeness between the belief networks of different documents can be found out by: Consider beli and beli + 1 be two belief networks ,then the relative Closeness can be measured using eq(5): RC(beli , beli + 1) =
(S(beli , beli + 1)) (| beli | S(beli )+ | beli + 1 | S(beli + 1))
A Chameleon algorithm is explained below:
Algorithm 4. Concept based Chameleon Algorithm belf is a set of belief networks for each belief network beli do Compare beli with the adjacent belief network beli + 1 if beli = beli + 1 then add an edge between the equal nodes of beli and beli + 1 Calculate RI and RC: end if if RI(beli )* RC(beli + 1) >TR (TR = Threshold) then Combine beli and beli + 1 end if end for
(5)
344
G. Veena and N.K. Lekha
Example: Consider three belief network beli , bel j and belk which is named as A, B and C respectively. The edge weight between the clusters is given by Wc li and it is given by the equation: Wc li =
(Weighto f R1) (Sumo f totalnumbero f RIinB)
(6)
where RI is the concept in cluster A The inner edge weight is assigned by the equation: Ic li =
(Weighto f RI) (Totalnumbero f ougoingedges f romRI)
(7)
where RI is the concept in cluster A From the above equation we calculate EC from which RI and RC between the three belief networks is calculated, the pair having value more than thresold is selected and combined together to form a single cluster. Here in the above example A and B have highest RC and RI value than A, C and B, C ie, from the Fig.4 and Fig.5 we can conclude that A and B form a cluster, but A and C does not form a cluster.
Fig. 4 Graph Representing more RI and RC
Fig. 5 Graph Representing less RI and RC
4 Experiments and Evaluation The experimental setup consisted of two data sets. The first data set contains 200 ACM abstract articles collected from the ACM digital library. The ACM articles are classified according to the ACM computing classification system into five main
An Extended Chameleon Algorithm for Document Clustering
345
categories: general literature, hardware, computer system organization, software and data. The second data contains 50 PubMed abstract articles collected from the site http://www.ncbi.nlm.nih.gov/pubmed. PubMed is a online database which is a bibliographic database of life sciences and biomedical information. It include many articles based on bibliographic information from academic journals covering medicines, nursing etc. It also covers even more fields which includes literature in biology and biochemistry ,as well as molecular evolution. • The Sample data set: Example1: Researchers found nanomaterials and made nanofluids. Nanofluids contains many nanometer-sized particle. These fluids are supplied by several methods. Nanofluids and nanomaterials creates a greater evolution in the research area. The runtime of concept based DB scan algorithm was analyzed and it was obtained as O(nlogn) where n is the number of concepts. The result analysis is done based on two measures: 1) Concept based Clustering Efficiency The concept based Clustering efficiency is done based on two measures Fmeasure and Entropy. The F-measure consists both the Precision P and Recall R. The measure of cluster j with respect to class i given by: P=
(Mi , j ) (M j )
(8)
R=
(Mi , j ) (Mi )
(9)
where, Mi , j - Number of members of class i in cluster j M j - Number of members of cluster j Mi - Number of members of i The F-measure is calculated as: F(i) =
(2PR) (P + R)
(10)
346
G. Veena and N.K. Lekha
When evaluating class i, the cluster having the main F-measure is measured and finally taken as the cluster that draws to class i.The entropy is measured for evaluation of consistancy of the clusters. Higher the consistancy the entropy will be low and vise versa.The entropy is calculated as: Ec =
n
∑
j=1
(M j ) ∗ Ej (M)
(11)
M j - Amount of cluster j M - Sum of concepts Table2 gives the cluster efficiency according to the calculated F-measure and Entropy value.
Table 2 Clustering efficiency
F-measure Entropy
Concept based DB scan clustering 1 0.30
K-Nearest Neighbor(KNN) 0.75 0.89
Fig. 6 Clustering efficiency
2)Accuracy Measure An accuracy measure is calculated for the comparison of concept mining by several methods. Here comparison is done between thwo concept based mining model i)Keyword based concept mining mode (Vector Space Model) ii)Proposed concept based minig model. The accuracy measure is calculated as: Accuracy Measure = ∑ni=1 Ci
An Extended Chameleon Algorithm for Document Clustering
347
Table 3 Analysis Results Mining Types Keyword based mining Concept based mining
Percentage Accuracy 0.75 0.89
Table3 shows the calculated percentage accuracy of the concepts retrieved from a document using two concept mining model. Fig 7 provides the result analysis of the comparison. The result analysis of accuracy measure plotted in a graph which represents a correct comparison between the keyword based concept mining model and proposed concept mining model. The accuracy rate is higher for the proposed concept based mining when compared to the keyword based concept mining model.
Fig. 7 Analysis Graph
5 Conclusion A new concept-based mining model, composed of a triplet representation and a belief network construction of the concepts is introduced in this paper, which is followed by a concept based comparison and clustering model. The triplet representation of the concepts followed by a belief network construction of that concept presents a new method for concept mining in an efficient way when compared to other methods.This representation captures the structure of the sentence semantics in an efficient way. The new concept based similarity measure analyzes the similarity between the concept based on the belief network structure. A concept based clustering is also applied , which is done in the sentence level which increases the accuracy of the clusters. The future work in this area includes the introduction to a concept based indexing which in a far way improve the clustered concepts usage to the application of information retrieval.Many concept based information retrieval systems already existed
348
G. Veena and N.K. Lekha
but applying this model for the purpose of such an application makes a challenge in the area of information retrieval. Acknowledgements. We are thankful to Dr M R Kaimal, Chairman Department of Computer Science, Amrita University for his valuable feedback and suggestions.
References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15]
Shehata, S., Karray, F., Kamel, M.S.: Enhancing Text Clustering Using Concept-based Mining Model. In: ICDM 2006, pp. 1043–1048 (2006) Shehata, S., Karray, F., Kamel, M.S.: Enhancing Text Retrieval Performance using Conceptual Ontological Graph. In: ICDM Workshops 2006, pp. 39–44 (2006) Shehata, S., Karray, F., Kamel, M.S.: An efficient concept-based retrieval model for enhancing text retrieval quality. Knowl. Inf. Syst. 35(2), 411–434 (2013) Aas, K., Eikvil, L.: Text categorisation: a survey. Technical report 941, Norwegian Computing Center (1999) Salton, G., McGill, M.J.: Introduction to modern information retrieval. McGraw-Hill, New York (1983) Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 112–117 (1975) Giunchiglia, F., Yatskevich, M., Shvaiko, P.: Semantic Matching: Algorithms and Implementation Yatskevich, M., Giunchiglia, F.: Element level semantic matching using WordNet Puri, S.: A Fuzzy Similarity Based Concept Mining Model for Text Classification Cilibrasi, R.L., Vitanyi, P.M.B.: The Google Similarity Distance Fillmore, C.: The case for case. In: Universals in linguistic theory. Holt, Rinehart and Winston, Inc.,New York Jurafsky, D., Martin, J.H.: Speech and language processing. Prentice Hall Inc., Upper Saddle River (2000) Kingsbury, P., Palmer, M.: : the next level of treebank. In: Proceedings of treebanks and lexical theories (2003) Ramos, J.: Using TF-IDF to Determine Word Relevance in Document Queries Han, J., Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques