A Survey on Semantic Document Clustering

0 downloads 0 Views 492KB Size Report
Java, Perl and many other platforms. Yes. Not available. B. Ontology. Ontology determines .... work is planning to implement in Netbeans IDE. A. Architecture.
Published by IEEE Published version is available at http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7226036 DOI: 10.1109/ICECCT.2015.7226036

A Survey on Semantic Document Clustering Maitri P. Naik

Harshadkumar B. Prajapati

Vipul K. Dabhi

Department of Information Technology, Dharmsinh Desai University, Nadiad, India. [email protected]

Department of Information Technology, Dharmsinh Desai University, Nadiad, India. [email protected]

Department of Information Technology, Dharmsinh Desai University, Nadiad, India. [email protected]

Abstract— Clustering is the process of partitioning a set of data objects into subsets. It is commonly used technique in data mining, information retrieval, and knowledge discovery for finding hidden patterns or objects from a data of different category. Text clustering process deals with grouping of an unstructured collection of documents into semantically related groups. A document is considered as a bag of words in traditional document clustering methods; however, semantic meaning of word is not considered. Thus, more informative features like concept weight are important to achieve accurate document clustering and this can be achieved through semantic document clustering because it takes meaningful relationship into account. This paper highlights major challenges in traditional document clustering and semantic document clustering along with brief discussion. This paper identifies five major areas under semantic clustering and presents a survey of 17 papers that has studied, covering major significant works. Moreover, this paper also provides a survey of tools, ontology databases, and algorithms, which help in applying and evaluating document clustering. The presented survey is used in preparing the proposed work in the same direction. This proposed work uses the concept weight for text clustering system which is to be developed based on a Hierarchical Agglomerative Clustering, Bisecting k-means algorithm, and Self Organized Map Neural Network in accordance with the principles of WordNet ontology as a background knowledge. Keywords—clustering, semantic clustering, ontology, HAC, bisecting k-means, SOM-NN, clustering algorithms, evaluation measures.

I. INTRODUCTION Clustering is considered as one of the most important unsupervised learning problem. In the clustering process, objects are organized into groups of similar members. Hence, a cluster is a collection of objects which are similar to each other but dissimilar to the objects of other clusters[1]. A text clustering divides a collection of text documents into different category groups so that documents in the same category group describe the same topic. A text clustering is the elemental function in the process of text mining[2]. Automated document processing can include operations such as document comparison, document categorization, and document selection [3]. Document clustering has very significant uses in many areas of data mining and information retrieval. Clusters of documents are generated automatically from the collection of documents[4].

In traditional method of document clustering, single, unique, or compound words of the document set are used as features. But the traditional method does not consider semantic relationships into account. The problems such as the synonym problem and the polysemous problem [5] exist in the traditional method; therefore, a bag of original words cannot represent the exact content of a document and cannot produce meaningful clusters. Therefore, to improve document clustering, there is a need of clustering techniques that also consider meaning of words into clustering process. One way to solve this problem is to enhance document representation with the background knowledge represented by ontology [6,7]. Another way is to use Latent Semantic Analysis (LSA)[8] technique. Polysemy and synonym problems [5] are fundamental problems in unsupervised learning techniques. A synonymous term maps to the same concept as different words in the document. A polysemous term is a term that has multiple, disjoint meaning. These two problems can be solved by using LSA in traditional keyword based retrieval. The main use of this LSA technique is to illustrate the core semantic structure of a document by doing their representation in high dimensional space. For reducing the number of dimensions in word vectors, an alteration of LSA is used. Another advantage of LSA is to take semantic relationships among the concepts to find relevant documents. Semantic document clustering has an important benefit of being able to remove irrelevant documents by recognizing conceptual mismatches. Word Sense Disambiguation (WSD) [9,10] is also used to resolve the ambiguity by pointing which concept is represented by a word or a phrase in a context. Use of ontology makes it easier to identify related concepts and their linguistic representatives given a key concept, whereas LSA tries to uncover the hidden conceptual relationships among the words and phrases as per their linguistic usage patterns[8]. WordNet 2.0[11] is a lexical database (a collection of words) that can be used to get information of words or phrases. It is used as knowledge base in automatic text analysis and artificial intelligence. WordNet 2.0 contains many set of synonym words of same concept and their relationships with different synsets. A few surveys exist on document clustering. The work in [12] presents a survey of clustering algorithms for text data. Moreover, the work also includes a survey of the problem of text clustering, key challenges of the clustering problem, and

key methods used for text clustering and their relative advantages. The work in [13] provides a survey of document structural similarity algorithms which include various approximation algorithms and the optimal tree edit distance algorithm, and several algorithms for measuring document structure similarity. This work also presents comparison of their accuracy and performance in brief. In the work [14], a brief overview about clustering techniques, their approaches, advantages, and disadvantages are described. The work in [15] presents overview about widely used techniques of document clustering. Moreover, the work also provides the comparison of few techniques, and dimensionality reduction techniques. This paper presents a survey of semantic document clustering with focus on approaches of clustering, algorithms and tools. This paper also presents few problems occurred in traditional document clustering and how to solve these problems, fundamental of document clustering and challenges, and approaches of document clustering and difference between them. Moreover, this paper identifies five major areas under semantic clustering and presents a survey of 17 papers that has studied, covering major significant works. Some tools used for preprocessing steps, description of ontology, discussion about different algorithms, similarity measure for finding distance between two clusters and evaluation methods used for improve accuracy of clusters are also discussed in this paper. This paper is structured as follows. Section 2 addresses fundamental concepts including traditional document clustering, semantic analysis and clustering, their advantages, and challenges in document clustering. Section 3 presents different approaches of clustering the documents and comparison of document clustering approaches. Section 4 presents a survey of different semantic clustering documents. Section 5 focuses on how semantic document clustering can be achieved. Section 6 describes the proposed work that intends to carry out in near future. Finally, Section 7 concludes the work and provides future work. II.

FUNDAMENTALS OF DOCUMENT CLUSTERING AND CHALLENGES IN DOCUMENT CLUSTERING In this section, description of traditional and semantic document clustering are discussed. Advantages and challenges of document clustering are also presented. A. Traditional Document Clustering Traditional document clustering model uses 'bags of words' (BOW) model for document representation. Unfortunately, this model has a critical disadvantage that it ignores semantic relations between words. It is practically observed that few algorithms significantly improve the performance of document clustering. Drawback of traditional clustering algorithms is that they are unable to identify two dissimilar clusters. Traditional document clustering methods or algorithms use features like words, phrases, and sequence to create cluster. Traditional document clustering methods use vector space model. In this model, document is represented as a vector using term frequency based weighting scheme. However, term frequency based weighting scheme can only capture the number of occurrences of the terms in a document;

therefore this model cannot perfectly utilize semantic correlations between document contents. B. Semantic Document Clustering 1) Semantics and Semantic Analysis Semantics is concerned with the study of meaning. It focuses on the relation between signifiers like words, phrases, signs, and symbols. The meaning of semantic is related with the meaning in language or logic. It tries to recognize the meaning as an element of language and how it is constructed by language. Semantics looks at meaning in language in isolation, and in the language itself. Semantics checks the different ways in which meanings of words can relate to each other to understand the relationships between them. Sentences can semantically relate to one-another in different ways.

Fig. 1. Document Clustering. Image source [16]

Semantic analysis is the process of relating syntactic structures from phrases, clauses, sentences, and paragraphs to their language-independent meanings. It involves removal of the features specific to particular linguistic and cultural contexts. 2) What is Semantic Clustering? Semantic Clustering is a technique to develop relevant keywords by concentrating majorly on keywords and keyword phrases that are closely related and associative. Semantic clustering concerns with partitioning points of a data set into distinct groups (clusters) in a way that two points from one cluster are semantically similar to each other but two points from distinct clusters are dissimilar to each other. C. Advantages of Semantic Clustering Important advantages of semantic clustering are given below: 1) Latent Semantic Indexing (LSI) technique can achieve dynamic clustering on the basis of conceptual contents of documents.[17] 2) Clustering is the method to make groups of documents on the basis of their conceptual similarity therefore, it makes the task easier while working with unknown collection of unstructured text. 3) LSI can carry out example based categorization as well as cross linguistic concept searching. 4) LSI can also process random character strings. This technique is not limited to work only with words.

5) It is proven that LSI is good solution for a number of conceptual matching problems. This technique can capture key relationship information containing casual information, goal oriented and taxonomic information.[17] 6) Information and Relationship discovery. 7) Semantic information retrieval method has exploited the advantages of the semantic web to retrieve the relevant data. 8) Common text clustering methods have poor capabilities in explaining its users why certain result is achieved. Because of this, the disadvantage is that these methods cannot relate semantically nearby terms as well as they cannot explain how the result clusters are related to one another. 9) Semantic information is used for improving evaluation measures like precision or recall in information retrieval system and clustering process. 10) Useful in data concept construction. 11) Allow for categorical attributes. D. Challenges in Document Clustering 1) General Challenges in Document Clustering Document clustering is being studied from many decades but still it is far from a trivial and solved problem. The challenges are [18]: 1) Selection of proper features of the documents to be used in clustering. 2) Selection of suitable similarity measure among the documents. 3) Selection of proper clustering method that utilizes the above similarity measure. 4) Implementation of clustering algorithm in most efficient way to optimize memory and CPU usage. 5) Actual clustering of object. 6) Data abstraction. 7) Evaluation. 8) Finding ways to evaluate the quality of the performed clustering. 9) Problem representation, including feature extraction, selection, or both. 10) Definition of proximity measure suitable to the domain. 2) Semantic Challenges in Document Clustering There exists several challenges for increasing the clustering quality. 1) The majority of existing document clustering algorithms do not consider the semantic relationships which generate poor clustering results. 2) Many domain specific ontologies are not available so mapping of concept with that domain is not possible. 3) When Word Sense Disambiguation procedure is used, the quality of the clusters is highly dependent on the correctness of that procedure.[5]

4) Feature selection step does not consider the effect of the selected features on clustering.[5] 5) Unlike WordNet, Wikipedia is not a structured thesaurus and thus cannot be easily used to handle synonymy and polysemy issues.[5] 6) Lack of uniformity, especially in terms of the benchmark data and baseline algorithms used.[5] III.

APPROACHES OF DOCUMENT CLUSTERING

This section describes two approaches those are used for clustering the documents. Traditional approach as well as semantic approach of document clustering are discussed briefly. A. Traditional Approach of Document Clustering The architecture consists of following components: Syntactic analysis, Semantic analysis and document clustering based on semantic. As explained in [16], this approach describes steps for mining documents on semantic understanding of text. This approach is based on analyzing the text in documents as illustrated in following figure. Text analysis steps consist of syntactic analysis and semantic analysis. Syntactic analysis extracts syntax structural descriptions and semantic analysis produces formal knowledge representation of document contents.

Fig. 2. System Architecture of traditional approach of document clustering. Image source [16]

In this approach, Morphological parser, Syntactic parser and Semantic parser are used for different purposes. Use of data structure is needed to get semantic relations among the concepts of the input documents which gives output in the form of tree or complete graph. This architecture also consists of three major components: text parser, similarity estimator and mining processes. These three resources are needed to build a semantic based document clustering system. In most of the cases, all these components are connected to each other. It simply means that the input of one component might be the output of other component and vice versa. It makes the system modular and highly integrated.

Task of the text parser is to read input text and then convert it to symbolic and canonical knowledge representation. An automatic syntactic analysis would be the first step in a complete parsing procedure. Then the similarity estimator takes two parsed texts and determines the semantic distance between them. In this approach, when parser transforms the text into its semantic representations properly and after that similarity estimator recognizes their exactness with respect to the meaning, it assures that the distance measuring operations and the parsing process forms a homomorphism with human perception of documents' similarities. Document clustering, document classification, information retrieval, information extraction and, information filtering, all these document mining processes can vary as per their specifications and requirements. Nonetheless, all of them require representing documents in some formal way. B. Semantic Approach of Document Clustering In recent days, 'ontology' word is very popular in the fields of computer science research as well as in the application of computer science techniques in managing the scientific and other information. It signifies that 'ontology' word has the meaning of a standardized terminological framework. Terms of this framework imply how the information is organized. In this approach, ontology is used for document clustering. Ontology belongs to a specific domain of knowledge. Domain can be a research field, an enterprise, an industry domain, or any other restricted set of knowledge.

shows that semantic approach is better as compared to traditional approach for clustering the documents. TABLE I. Criteria

Data representation method Features

As explained in [19], this system includes three main modules: document preprocessing, concept weight calculation on the base of ontology and document clustering with concept weight. This architecture is used to perform clustering process based on the concept weight supported by the ontology. Using domain specific ontology, a feature represented document can be transformed to a concept represented document. As a result, the target document corpus will be clustered identically to the concepts representing individual document. Hence, proceeding of document clustering at the conceptual level is achieved. Text documents are used for clustering process in this system. C. Comparison of Document Clustering Approaches Table 1 shows the comparison between two approaches: traditional approach and semantic approach. This comparison

Traditional Approach of Document Clustering Bag-of-Words (BOW) representation, Vector Space Model Frequency of words, Bags of words

Background knowledge

Ontology is not used as background knowledge.

Function

Traditional document clustering concentrate on the syntax in a document that produces poor clustering results because it ignores the conceptual similarity of terms that do not cooccur actually. So, semantic understanding of text is necessary to improve the efficiency and accuracy of clustering. Some problems like polysemy, synonymy, ambiguity and semantic similarities may occur and these problems may not be captured by traditional mining techniques based on words frequencies in text.

Problems occured and solved

Fig. 3. Architecture of semantic approach using concept weight. Image source [19]

COMPARISON BETWEEN TWO APPROACHES Semantic Approach of Document Clustering Universal Networking Language, Semantic Graph Concept of words, Concepts based on background domain knowledge, concept weight, Freq. concept, sense of a word Ontology as background knowledge is taken to improve performance. Semantic-based text document clustering focuses on the computational semantics of the grammar in the document and hence, provides higher accuracy and quality of resulted cluster. Moreover, this approach improves the performance of clustering. Problems like polysemy and synonymy are resolved by using Latent Semantic Indexing (LSI) technique while ambiguity problem is resolved by word sense disambiguation method.

Document clustering based on semantic approach provides better result, accuracy, and quality of cluster than traditional approach. Performance of text document clustering is also increased by using semantic document approach. IV. SURVEY OF SEMANTIC CLUSTERING APPROACH Table 2 shows a survey on semantic document clustering approach. This survey is divided into different topics like ontology based clustering, semantic network or graph, frequent concept, latent semantic indexing, and WordNet ontology. All of these topics provide information related to respective topic. Different criteria like data representation method, feature vector, similarity measure, clustering algorithm, evaluation method, strength, and weakness are added in this survey.

TABLE II. Topic

Ontology based clustering

Papers

Dataset/ Representa tion Method

LITERATURE REVIEW AND COMPARISON OF SEMANTIC DOCUMENT CLUSTERING Feature Vector

Similarity Measure

Evaluation Method

Strength

Weakness

-Efficient method for retrieving info. from very large database -Concept based clustering exhibits better performance Webocrat-like approach based on ontology is very promising and providing better retrieval efficiency than LSI or standard full text approach COSA is better for practical purpose of clustering in large document dataset

Not supported to video and image retrieval document

Not getting diff. accurate results for diff. purpose Dataset is small

Anbarasi .M et al., Feb-2014 [6]

Medical dataset

-Text files -Keywords

-Cosine

-K-means

-Recall -Precision

Jan Paralic, Ivan Kostial, 2003 [20]

Collection named Cystic Fibrosis, a subset extracted from a large MEDLINE collection COSA (Concept Selection and Aggregation ): ontologybased heuristics Mobile value-added business database

-Weight of concepts -Document vector -Query vector

-Cosine

-TF-IDF approach -LSI approach -Webocrat-like approach based on a ontology

-Recall -Precision

Term Vector

Not mentioned

K-means

Silhouette coefficient

Ontology semantic node

-Euclidean Distance

OFW-Clustering

Feature weight calculation method

Clustering result with a feature weight is more accurate

Yong Wang, Julia Hodges, 2006 , [9]

Wordnet semantic network

Sense of a word

-Edged-based -Node-based

-Entropy -Recall -Precision -F-measure

-For large dataset, the Bisecting K-means method is best -For small dataset, the HAC method is best

Sunita Sarkar et al., 2014, [22]

Vector space model using UNL Graph/Link

Each document

-Cosine Correlation -Inter cluster -Intra cluster

-Word sense disambiguation method -Semantic relatedness measures among senses: senseno method and offset method -K-means -Buckshot -HAC -Bisecting Kmeans Hybrid PSO+Kmean algo

UNL link method provide better performance than term freq. method

Not mentioned

B. Choudhary , P. Bhattachar yya, 2002, [23] Rekha Baghel, Dr. Renu Dhir, Jul. 2010 [24]

Universal Networking Language (UNL): Semantic graph Frequent concepts: semantic relationship between words

Document cluster

Not mentioned

Kohonen Self Organizing Maps

-TF (Term Freq.) -TF-IDF(Term Freq.-) -UNL Link -Term Frequency -UNL Link -UNL Relation

Dataset too small

Frequent Concept

-Cosine -Inter cluster

Frequent Concepts based Document Clustering (FCDC)

Semantic approach performs better than the methods based on only freq. FCDC has better F-measure and provide better accuracy than other algo.

Andreas Hotho et al., April 2002, [7]

Lei ZHANG, Zhichao WANG, 2010, [21]

Semantic Network/ Graph

Frequent Concept

Clustering Algo.

-F-measure -Recall -Precision

Manual assignment of concepts to query has been used

Dataset is not based on text documents

Users have to provide number of clusters

Latent Semantic Indexing

Rifat Ozcan, Y. Alp Aslandoga n, 2004 [8] Wei Song, Soon Cheol Park, 2009 [25]

SENSEVA L-2 English All task data, Image captions dataset Reuters21578

-Concept -Word

-Cosine -Word-word Similarity pairs

Page Ranking Soft WSD Algorithm

Eisa Hasanzade h et al., January, 2012, [26]

Chih-Ping Wei et al., 2008, [27]

Unique term

-Cosine

Genetic Algorithm based on the Latent semantic indexing model (GAL)

-F-measure

Reuters dataset and Hamshahri dataset

-Term -Weight

PSO+LSI

- F-measure

-Parallel, English and Chinese corpus Three text document datasets: EMail1200, SCOTS and Reuters text corpuses Reuters newsfeeds Or dataset

-Keywords -Sentence

-Distance between current position and pbest -Distance between current position and gbest -Cosine

-LSI-based MLDC technique -HAC algorithm

-Recall -Precision

LSI provide better result in reducing the dimension

Not mentioned

Mrs.Leena H. Patil, Dr. Mohamme d Atique, 2013 [29]

Tarek F. Gharib et al., Jan 2012, [28]

Performs well on more large scale data with more queries

WSD more difficult short queries

GAL automatically evolves proper no. of clusters and providing near optimal text clustering Obtain best performance

Not mentioned

is in

PSO+Kmeans algorithms less effective

Wordnet

-Cosine

-K-means -Bisecting Kmeans -Self Organizing Map neural network

Silhouette coefficient

SOM-NN improves overall clustering quality than other two algo.

Not mentioned

Feature Weighting

-Cosine

-Bisecting Kmeans -word sense disambiguation

-Precision -Purity -Inverse Purity

Bisecting k-means with WSD and feature weighting improve effectiveness of text clustering

Reuters corpus, classic 30 and 20 News Group

-Term Freq. -Doc Freq.

Not mentioned

-tf-idf -tfdf -tf2

Entropy

Using tf-idf dimensionality of doc. gets reduced without data loss

Julian Sedding, Dimitar Kazakov, 2004 [30]

Reuters21578 news corpus

Bag of words

-Cosine

-Bisecting means

-Precision -Purity -Entropy

Sunita Sarkar et al., June 2014[31]

Nepali text dataset

Texts represented in terms of synsets corresponding to a word

-Cosine

-K-means -Particle Swarm Optimization -PSO+K-means algorithm

Automated approach of only using the most common sense seems more realistic yet beneficial Hybrid PSO+Kmeans performs better than other two algo.

Text documents contribute good similarity rating if they are related via Wordnet synsets or hyponyms % of terms removed is higher for tf2, tf-df than tf-idf, there is higher possibility of data loss All 21578 documents are not used. Restricted to 12344 documents

Andreas Hotho et al., 2003 [10]

WordNet Ontology

-Recall -Precision

In this survey, most of the people have used different datasets or any corpus but crawling the sentences from any particular website through web crawler is not performed yet.

K-

-Intra cluster -Inter cluster

Not mentioned

Generally, bag of words are used in traditional document clustering through which meaningful information did not get but meaningful clusters are obtained by calculating tf-idf formula for bags of words as a feature and then mapping them

with ontology. It is an improvement in semantic document clustering. Most of the people have used bisecting k-means, HAC, hybrid PSO+k-means, SOM according to their advantages and disadvantages. But bisecting k-means algorithm-a variant of k-means used with HAC or SOM may provide better result because bisecting k-means has low cost and quality of clustering is very satisfactory. The reason is that during the split, multiple k-means is used to find a better clustering among several candidate and this reduce the effect of the randomness. Cosine similarity measurement is commonly used in most of the literature for measuring the distance between clusters. For measuring accuracy, different evaluation methods have used which are common. TABLE III. Tools Name Weka [32]

V.

APPLYING DOCUMENT CLUSTERING

This section discusses about tools used for preprocessing steps and small description about ontology. Also, it discusses different types of document clustering algorithms which are used to cluster the documents. A. Tools Table 3 provides information about different tools, API and algorithms used for preprocessing steps like tokenization, stop word removal and stemming.

SURVEY OF D IFFERENT TOOLS USED FOR D OCUMENT PROCESSING

Purpose of Tools

In which phase of Document Clustering the tool is used Document Preprocessing

Language JAVA

Open Source Yes

Support for Doc Clustering Yes

Document Preprocessing

JAVA

Yes

Yes

Document Preprocessing

MatLab

No

Yes

MatLab [34]

Data preprocessing, classification, clustering, association rules, regression and visualization -Also apply to big data It supports all steps of the data mining process including results, visualization, validation and optimization Data analysis, exploration, and visualization

Stanford Tokenizer [35]

Text categorization/tokenization. It is deterministic, fast and efficient and it provides API access

Tokenization

JAVA

Yes

Not available

Apache OpenNLP [36]

The Apache OpenNLP library is a machine learning based toolkit for processing of natural language text. It includes a sentence detector, a tokenizer, a name finder, a parts-of-speech (POS) tagger, a chunker, and a parser. It has very good APIs. Document preprocessing, Text indexing and searching

Tokenizer

JAVA

Yes

Not available

Stop word removal

JAVA

Yes

Not available

The main purpose of stemming is to reduce different grammatical forms / word forms of a word like its noun, adjective, verb, adverb etc. to its root form. Standard set of rules provides a ‘strong’ or ‘heavy’ stemmer which is quite aggressive in conflation of words

Stemming

SnowBall, JAVA

Yes

Not available

Stemming

Pascal, ANSI C, Perl and JAVA

Yes

Not available

For removing the commoner morphological and inflexional endings from English words. Its main use is in the part of normalization process

Stemming

ANSI C, Java, Perl and many other platforms

Yes

Not available

RapidMiner [33]

Lucene API [37] Snowball Stemming API [38] The Lancaster Stemming Algorithm (Paice/Husk stemmer) [39] Porter Stemmer [40]

B. Ontology Ontology determines what is of interest in a domain and how information about the domain is structured. Ontology is collection of classes, properties, relationships between classes and individuals. Many ontology editors are available for developing ontology like Protégé, SWOOP, OntoEdit, Altova, OntoStudio, SemanticWorks etc. But Protégé is most widely used by researchers, professionals and programmers. Protégé is an open source freely available ontology editor and knowledge base framework for building intelligent system.[41] Ontology is a specification of a conceptualization. It is a body of knowledge that describes some domain, normally common sense knowledge domain. An ontology is developed, used, reused and related to other ontologies and it also needs to be maintained. Ontologies play a crucial role in defining

standardized concepts. In computer science and information science, ontologies are used to represent knowledge within a domain. Common components of ontologies include: Individuals, Classes, Attributes, Relations, Function terms, Restrictions, Rules, Axioms, and Events[42]. Ontology is a way of specifying relationships among the concepts, objects, and other entities belonging to a particular area of human experience or knowledge. All components can be used for document representation and clustering. Categorizations of ontology include application ontology, domain ontology, task ontology and top-level ontology. C. Algorithms for Clustering Documents Clustering algorithms used for document clustering may be categorized by how they form groups of clusters. There are five main categories of clustering methods such as partitioning

algorithms, hierarchical algorithms, density-based, grid-based and model-based. Partitioning algorithms include k-means and k-medoids. Hierarchical algorithms work on either consecutive splitting (divisive) or merging (agglomerative) of groups to form a hierarchy of clusters based on a specified measure of distance between objects. Density-based includes TABLE IV. Clustering Algorithm K-means

• •

DBSCAN algorithm. Model-based clustering methods include COBWEB algorithm. Neural network based Kohonen’s SelfOrganizing Map is also used as clustering algorithm. Particle Swarm Optimization technique is also used for grouping the documents. Table 4 shows advantages and disadvantages of certain clustering algorithms.

ADVANTAGES AND D ISADVANTAGES OF WIDELY USED CLUSTERING ALGORITHMS

Advantages If variables are huge and if value of k keeps smalls then K-Means most of the times computationally fast. K-Means produce tighter clusters especially if the clusters are globular.

• •

• • • Bisecting k-means

• •

Hierarchical clustering

Agglomerative

• • • •

Particle Swarm Optimization

• • •



Self Organized Map NN

• •

The Bisecting k-Means method has low cost and its quality of clustering is very satisfactory. It almost always gets at least as good quality as HAC, and it is also more stable. The reason is that during the split, multiple k-means is used to find a better clustering among several candidate and this reduce the effect of the randomness. No apriori information about the number of clusters required. Easy to implement and gives best result in some cases. It can produce an ordering of the objects, which may be informative for data display. Smaller clusters are generated, which may be helpful for discovery. Quality of clustering got from HAC is satisfactory. Usually better than regular K-means. Over classical approaches: • Does not require derivatives, more stable. • Smaller time steps are possible. • Easily parallelizable. Over genetic algorithm: • Simpler to understand and implement. • Fewer parameters to adjust. • Lower computational costs. Data mapping is easily interpreted. Capable of organizing large, complex data sets.

VI.

PROPOSED WORK ON SEMANTIC DOCUMENT CLUSTERING This section discusses about the proposed work that includes architecture used for document clustering. This section also discusses which ontology is used and which clustering algorithm is better to use in this work. Moreover, it discusses which similarity measures and which evaluation methods are going to perform in document clustering. This work is planning to implement in Netbeans IDE. A. Architecture Main components of this architecture are Document Preprocessing, WordNet Category Mapping, and Document Clustering using algorithm. Data preprocessing is a crucial stage in an effective document clustering. Preprocessing steps needed in this work are tokenization, stop word removal and stemming. News articles from newspaper



• •

• •

Disadvantages Difficult to predict K-Value. It didn't work well with global as well as nonglobular cluster and also doesn't work well with clusters of different size and different density in the original data. Different initial partitions can result in different final clusters. The clusters are non-hierarchical and do not overlap. Difficult in comparing quality of the clusters produced. No particular disadvantage.

Algorithm can never undo what was done previously. The computation time is costly and if initial document collection is large then computation could be very time consuming. No objective function is directly minimized. Sometimes it is difficult to identify the correct number of clusters by the dendogram.

• •

Still too slow compared to classical approaches. Not exactly repeatable in terms of computational cost, which makes the comparison hard.

• • •

Difficult to determine what input weights to use. Mapping can result in divided clusters. Requires that nearby points behave similarly.

website will be used as input for clustering document. For tokenization, Stanford Tokenizer will be used because it is deterministic, fast and efficient and it provides API access. For stop word removal, Lucene API is going to be used and for stemming, Porter algorithm is going to be used.

Fig. 4. Proposed Architecture

B. WordNet Ontology Mapping In this work, WordNet ontology will be used because it is not based on any particular domain. It is large lexical database of English in which nouns, adjectives, verbs and adverbs are groups into sets of synonyms and each expressing a distinct concept. Therefore, after preprocessing, find frequency of keywords from the document and then find weight using tf-idf equation. tf-idf (term frequent - inverse document frequency): It is denoted as tfi-dfij and used for the measure of the importance of term tj within document di. TF (t) = (Number of times term t appears in a document) / (Total number of terms in the document). IDF (t) = log_e (Total number of documents / Number of documents with term t in it).

Where • indicates dot product and || indicates length of a vector. E. Evaluation Methods There are many evaluation measures available like entropy, recall, precision, F-measure, silhouette co-efficient, purity, inverse purity for improving cluster's accuracy, efficiency and result. And at the end, clustering process is done. Entropy:

CE = −∑ t

Where, CE is cluster entropy and D(w) is document set Recall:

Recall =

tf-idf = TF (t)*IDF (t) After generating weight for all terms, semantic weight of keywords is mapped with WordNet ontology and different concepts are made. Concepts are also defined through synonym matching by mapping keyword dynamically with online WordNet dictionary. C. Clustering Algorithm There are many clustering algorithm available for distinguishing the document into similar group like Kmeans, Bisecting K-means, and Hierarchical Agglomerative Clustering etc. Neural network based Self Organized Map and Swarm Optimization algorithms are also used for clustering documents. In this work, Bisecting K-means, HAC algorithm and, SOM neural network will be used. The Bisecting k-Means starts with all objects in a single cluster. This method has low cost, compared to HAC, yet its quality of clustering is very satisfactory. It almost always gets at least as good quality as HAC, and it is also more stable. The reason is that during the split, multiple k-means is used to find a better clustering among several candidate and this reduce the effect of the randomness. HAC algorithm is easy to implement and gives best result in some cases. It can create an ordering of the objects, which may be informative for data display. Smaller clusters are also generated, which may be helpful for discovery. SOM neural network is used for organizing large and complex data sets. Through this neural network, mapping of data is also easily interpreted. D. Similarity Measure In a clustering analysis, similarity between two documents needs to be computed. There are several similarity measures available to compute the similarity between two documents like Euclidean distance, Manhattan distance, cosine similarity etc. Among these measurements, cosine similarity measure has been used to compute the similarity between two documents. cos (d1, d2) = (d1 • d2) / ||d1||||d2||

| D(w) ∩ D(t) | | D (w) ∩ D(t) | log | D (w) | | D(w) |

A A+ B

Where, A is true positive and B is false negative Precision:

Precision =

A A+C

Where, A is true positive and C is false positive F-measure:

F − measure =

2*precision*recall precision + recall

Purity:

purity (Ω, C ) =

1 N

∑ max | w k

j

k

∩ cj |

Where, Ω = {w1,w2,...,wK} is the set of clusters and C = {c1,c2,...,cJ} is the set of classes. Silhouette co-efficient:

s (i ) =

b(i) − a (i ) max{a(i), b(i )}

Where, a(i) = average dissimilarity of i and b(i) = lowest average dissimilarity of i VII. CONCLUSION In this paper, a detailed survey of semantic document clustering is presented. It also includes survey of different challenges in traditional and semantic document clustering, different tools available for preprocessing techniques, ontology, various algorithms used for clustering documents, and performance measures available for evaluation of clusters. From this survey, it is concluded that document clustering is possible in two ways: traditional based and semantic based. Preprocessing steps are used for

tokenization, stop word removal and steaming in both approaches. Moreover, it is observed that semantic based approach for document clustering provide better accuracy, result and quality of cluster than traditional approach because ontology and concept weight are used in semantic approach. In the proposed work, semantic document clustering approach will be used. WordNet ontology will be used for getting meaningful clusters. Hierarchical agglomerative clustering, bisecting k-means algorithm and self organized map neural network will be used for clustering the text documents. This proposed approach will try to achieve better accuracy, efficiency and quality of clusters as compare to traditional approach.

REFERENCES [1]

[2]

[3]

[4]

[5]

[6]

[7] [8]

[9]

[10]

[11]

[12] [13] [14]

[15] [16] [17]

Clustering: An Introduction, Available on [URL:http://home.deib. polimi.it/matteucc/Clustering/tutorial_html/], Accessed on: 20 sept., 2014 H. Tar and T. Nyaunt, "Enhancing Traditional Text Documents Clustering based on Ontology," Int. J. of Comput. Applicat., vol. 33, no. 10, 2011, pp. 38-42. H. B. Prajapati and V. K. Dabhi, "XML based architectures for documents comparison, categorisation, and scrutinisation," Int. J. of Data Anal. Techniques and Strategies, vol. 2, no .4, 2010, pp. 385410. B. Krishna, P. Satheesh, and S. Kumar, "Comparative Study of Kmeans and Bisecting K-means Techniques in Wordnet Based Document Clustering," Int. J. of Eng. and Advanced Technology (IJEAT), vol. 1, 2012, pp. 229-234. S. Fodeh, B. Punch, and P. Tan, "On ontology-driven document clustering using core semantic features," Knowledge and Inform. Syst., vol. 28, no. 2, 2011, pp. 395-421. M. Anbarasi, V. Iswarya, M. Sindhuja, and S. Yogabindiya, "Ontology Oriented Concept Based Clustering,” Int. J. of Research in Eng. and Technology(IJRET), vol. 3, issue 2, 2014. A. Hotho, A. Maedche, and S. Staab, "Ontology-based text document clustering," KI, vol. 16, no. 4, 2002, pp. 48-54. R. Ozcan and Y. A. Aslangdogan, "Concept based information access using ontologies and latent semantic analysis," Dept. of Comput. Sci. and Eng., vol. 8, 2004. Y. Wang and J. Hodges, "Document clustering with semantic analysis," Proc. IEEE of the 39th Annu. Hawaii Int. Conf. on Syst. Sci., vol. 3, 2006, HICSS'06. A. Hotho, S. Staab and G. Stumme, "Ontologies improve text document clustering," 3rd IEEE Int. Conf. on Data Mining, pp. 541544, 2003, ICDM 2003. G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J. Miller, "Introduction to wordnet: An on-line lexical database," Int. J. of lexicography, vol. 3, no. 4, 1990, pp. 235-244. C. C. Aggarwal and C. Zhai, "A survey of text clustering algorithms," Mining Text Data, Springer US, 2012, pp. 77-128. D. Buttler, "A short survey of document structure similarity algorithms," Int. Conf. on Internet Computing, pp. 3-9, 2004. H. Gaudani, K. Lakhani, and R. Chhatrala, "Survey of Document Clustering," Int. J. of Comput. Sci. and Mobile Computing, vol. 3, issue 5, 2014, pp. 871-874. Y. Xiao, "A Survey of Document Clustering Techniques & Comparison of LDA and moVMF," 2010. K. Shaban, "A Semantic Approach for Document Clustering," J. of software, vol. 4, no. 5, 2009, pp. 391-404. Latent semantic indexing, Available on [URL:http://en.wikipedia.org/wiki/Latent_semantic_indexing], Access on: 25th sept., 2014

[18] Pankaj Jajoo, "Document clustering", M.Tech thesis, IIT, Kharagpur, 2008 [19] G. Bharathi and D. Venkatesan, "Study of ontology or thesaurus based document clustering and information retrieval," J. of Eng. and Appl. Sci., vol.7, no. 4, 2012, pp. 342-347. [20] J. Paralic and I. Kostial, “Ontology-based Information Retrieval,” Proc. of the 14th Int. Conf. on Inform. and Intelligent Syst. (IIS 2003), Varazdin, Croatia, 2003, pp. 23-28. [21] L. Zhang and Z. Wang, "Ontology-based clustering algorithm with feature weights," J. of Computational Inform. Syst., vol. 6, no. 9, 2010, pp. 2959-2966. [22] S. Sarkar, A. Roy, and B. S. Purkayastha, "Clustering of Documents using Particle Swarm Optimization and Semantics Information," Int. J. of Comput. Sci. & Inform. Technologies, vol. 5, no. 3, 2014. [23] B. Choudhary and P. Bhattacharyya, "Text clustering using semantics," The 11th Int. World Wide Web Conf., WWW2002, Honolulu, Hawaii, USA, 2002. [24] R. Baghel and R. Dhir, "A Frequent Concepts Based Document Clustering Algorithm," Int. J. of Comput. Applicat., vol. 4, no. 5, 2010, pp. 6-12. [25] W. Song and S. C. Park, "Genetic algorithm for text clustering based on latent semantic indexing," Comput. & Math. with Applicat., vol. 57, no. 11, 2009, pp. 1901-1907. [26] E. Hasanzadeh, M. P. Rad, and H. A. Rokny, "Text clustering on latent semantic indexing with particle swarm optimization (PSO) algorithm," Int. J. of Physical Sci., vol. 7, no. 1, 2012, pp. 116-120. [27] C. Wei, C. C. Yang, and C. Lin, "A Latent Semantic Indexing-based approach to multilingual document clustering," Decision Support Syst., vol. 45, no .3, 2008, pp. 606-620. [28] T. F. Gharib, M. M. Fouad, A. Mashat, and I. Bidawi, ”Self Organizing Map-based Document Clustering Using WordNet Ontologies,” Int. J. of Comput. Sci. Issues(IJCSI), vol. 9, no. 1, 2012. [29] L. H. Patil and M. Atique, "A Semantic approach for effective document clustering using WordNet," arXiv preprint arXiv:1303.0489, 2013. [30] J. Sedding and D. Kazakov, "WordNet-based text document clustering," Proc. of the 3rd Workshop on RObust Methods in Anal. of Natural Language Data Assoc. for Computational Linguistics, pp. 104-113, 2004. [31] S. Sarkar, A. Roy, and B. S. Purkayastha, "A Comparative Analysis of Particle Swarm Optimization and K-means Algorithm For Text Clustering Using Nepali Wordnet," Int. J. on Natural Language Computing (IJNLC), vol. 3, no.3, 2014. [32] Weka: Data mining software in Java, Available on [URL:http:// www.cs. waikato.ac.nz/ml/weka/], Accessed on: 30 sept., 2014 [33] RapidMiner Documentation, Available on [URL:http://rapidminer. com/documentation/], Accessed on: 30 sept., 2014 [34] Overview of matlab, Available on [URL:http://www.tutorialspoint. com/matlab/matlab_overview.htm], Accessed on: 30 sept., 2014 [35] Description of stanford tokenizer, Available on [URL:http://nlp. stanford.edu/software/tokenizer.shtml], Accessed on: 30 sept., 2014 [36] Description of Apache openNLP, Available on [URL:https://opennlp .apache.org/], Accessed on: 2nd oct., 2014 [37] Overview of Apache lucene, Available on [URL:http://lucene. apache.org/], Accessed on: 2nd oct., 2014 [38] Introduction of snowball stemmer, Available on [URL:http:// preciselyconcise.com/apis_and_installations/snowball_stemmer.php], Accessed on: 2nd oct., 2014 [39] The Paice/Husk stemmer, Available on [URL:http:// www.comp.lancs.ac.uk/computing/research/stemming/general/paice.h tm], Accessed on: 2nd oct., 2014 [40] The Porter stemming algorithm, Available on [URL:http://tartarus .org/martin/PorterStemmer/], Accessed on: 2nd oct., 2014 [41] Introduction of protege editor, Available on [URL:http:// protege.stanford.edu/], Accessed on: 2nd oct., 2014 [42] Ontology components, Available on [URL:http://en.wikipedia .org/wiki/Ontology_components], Accessed on: 2nd oct., 2014

Suggest Documents