ABSTRACT Data warehouse deals with huge volume

0 downloads 0 Views 343KB Size Report
A SURVEY ON VARIOUS CLUSTERING TECHNIQUES IN DATA MINING ... automatic analysis of large quantities of data to extract previously unknown interesting ... The main challenges in text summarization are extracting the overall.
A SURVEY ON VARIOUS CLUSTERING TECHNIQUES IN DATA MINING

VANITHA.L1, RAJARAM.P2, PRAKASAM.P3 1 Second Year, M.E., Computer Science and Engineering, Tagore Institute of Engineering and Technology, Salem, Tamil Nadu. [email protected] 2 Assistant Professor, Department of Computer Science and Engineering, Tagore Institute of Engineering and Technology, Salem, Tamil Nadu. [email protected] 3 Principal, Department of Electronics and Communication Engineering, Tagore Institute of Engineering and Technology, Salem, Tamil Nadu. [email protected]

ABSTRACT Data warehouse deals with huge volume of historic and real time data on which various operations are performed. Since the volume of data increases and the data are from multiple sources text summarization has become a tedious process. For effectively summarizing these data, a better clustering methodology is required. In traditional clustering, there are several clustering algorithms such as k-mean, k-medoids and Clara where the number of clusters has to be specified. To overcome this constraint, new clustering algorithm is proposed called hierarchical fuzzy clustering algorithm that clusters hierarchical data in natural language which may contain n number of clusters and prevent overlapping efficiently. 1 INTRODUCTION Data mining is an analysis step of the Knowledge Discovery in Databases process or KDD. It is the computational process at the intersection of machine learning, artificial intelligence, statistics, and database systems for discovering patterns in large data sets. The actual data mining task is the automatic or semiautomatic analysis of large quantities of data to extract previously unknown interesting patterns such as groups of data records cluster analysis unusual records and dependencies . Database techniques such as spatial indices are used in data mining. These patterns are viewed as a kind of summary of the input data and may be used in further analysis. Data mining uses information from past data to analyze the outcome of a particular problem or situation that may arise. While large-scale information technology has been evolving analytical systems and a separate transaction, a link between these two systems is provided by data mining. Relationships and patterns in stored transaction are analyzed using mining software. This analyzes is based on open-ended user queries. To perform this clustering technique is used. Clustering is the division of data into groups of similar objects. It neglects some details in exchange for data simplification. Informally, clustering means summarizing the data to crisp and short clustering and be viewed as data modeling. Therefore it can be related to many disciplines from statistics analysis to numerical analysis. In many applications from information retrieval to CRM, clustering plays an important role in a broad range. Large datasets and many attributes are dealt in such applications. Data mining is the exploration of such data. This survey mainly focuses on clustering and clustering algorithms from

742

a data mining perspective. Clustering is the subject of active research in several fields. Very large datasets with very many attributes of different types adds complications to clustering. This leads relevant clustering algorithms to unique computational requirements. To real-life data mining problems a variety of algorithms have recently emerged that meet these requirements and were successfully applied. 2 LITERATURE REVIEW Hatzivassiloglou.v have proposed a tool for statistical similarity measurements and clustering SIMFINDER: Similarity finder. The main problem in clustering tool is to organize a small piece of text from multiple documents to a tight cluster. To perform this clustering and fine similarity the author proposed simfinder that present a statistical similarity measuring and clustering. It incorporates algorithm to construct a highly similar sentences or paragraph for summarization. It has demonstrated a quantitative improvement in performances.[1] Zha.H presented a novel method using mutual reinforcement principal for simultaneous key phrase extraction and text summarization. The main challenges in text summarization are extracting the overall information from a long document that contains variety of topics. To solve this challenge, the authors have proposed a novel method that explores priors embedded in linear ordering of document. It enhances the quality of clustering sentences of documents. It also developed a mutual reinforcement principle to compute key phrase and sentence saliency scores. [2] Radev D.R demonstrated a multi document summarizer, MEAD. The main problem is to generate a summary produced from topic detection and tracking system. To perform this summarization the author presented a MEAD document summarizer that generates summary produced by topic detection and tracking system. It have been focused that MEAD procedures summarizes that are similar in quality to one produced by humans. [3] Aliguyev R.M. proposed an algorithm to find similarity between very short texts. The existing methods for computing similarity are adopted from approaches are used for long text documents. These methods are inefficient and require human inputs. So, the author focused on computing similarity between short texts using semantic information and corporate data. This provides similarity measures that are fairly consistent with human knowledge [4] Luxburg U.V made a study on most popular clustering algorithm in recent years spectral clustering. Many traditional cluster algorithm works well on small set contains fewer than several hundred data objects. To overcome this we move to the modern clustering algorithm [5]. Mihalcea R has explained a method using corpus-based and knowledge-based measures of similarity for calculating the semantic similarity of texts. The main problem focused by previous work is on either large documents (e.g. text classification, information retrieval) or individual words. Though a large fraction of the information consists of short text snippets (e.g. scientific documents abstract, captions in imagine, descriptions of product), present today on the Web and elsewhere this paper focus on measuring the semantic similarity of

743

short texts. Experiments done on a paraphrase data set shows that the semantic similarity method outperforms the methods based on simple lexical matching, resulting in up to 13% error rate reduction with respect to the traditional vector-based similarity metric.[6] Wang.Daims to create a compressed summary while retaining the main characteristics of the original set of documents. Extraction of sentences from documents uses statistics and machine learning techniques. The major issues for multi-document summarization are the information contained in different documents often overlaps with each other. Therefore, it is necessary to find an effective way to merge the documents while recognizing and removing redundancy and in English to avoid repetition. A new multi-document summarization framework based on sentence-level semantic analysis (SLSS) and symmetric non-negative matrix factorization (SNMF) is proposed. SLSS is able to capture the semantic relationships between sentences and SNMF can divide the sentences into groups for extraction. Experimental results on DUC2005 and DUC2006 data sets demonstrate the improvement of our proposed framework over the implemented existing summarization systems.[7] Ruspini E.H presents new algorithms for fuzzy clustering of relational data. Object data refers to the situation where the objects tobe clustered are represented by vectors. Relationaldata refers to the situation where it has only numerical valuesrepresenting the degrees to which pairs of objects in thedata set are related. Relational clusteringis more general in the sense that it is applicable to situationsin which the objects to be clustered cannot be represented bynumerical features but dissimilarities between object pairs can be measured. The objective functions are based on selecting c representative objects (medoids) from the data set in such a way that the total dissimilarity within each cluster is minimized. A comparison of Relational Fuzzy c-Means algorithm (RFCM) with FCMdd shows that FCMdd is much faster. These papers have presented a new relational fuzzyclustering algorithm based on the idea of medoids. [8] Shi J and Malik J proposed a novel approach for solving the vision problem in perceptual grouping. This approach aims at extracting the global impression of an image rather than focusing on image data local features and consistencies. Wertheimer pointed out the key importance’s of organization in vision and perceptual grouping. Similarity, proximity and good continuation are several key factors which lead to visual grouping. However many of the computational issues of perceptual grouping have remained unresolved even today. This paper presents a general framework for this problem that is case of image segmentation is focused specifically. A grouping algorithm based on the view that perceptual grouping should be a process that aims to extract global impressions of a scene and provides a hierarchical description of it is proposed. The normalized cut criteria for segmenting the graph are presented. Normalized cut is an unbiased measure of disassociation between subgroups of a graph.[9] Meila M and Shi J proposed a new view for image segmentation by pairwise similarities. This author focuses on the pairwise clustering and segmentation. Statistical clustering methods which assume a probabilistic model to generate the observed data points is in contrast to pairwise clustering. Similarity function between pair of points is defined and then it formulates a criterion that the clustering must optimize. In existing spectral clustering are still incompletely understood. The main achievement of this work is to show that there is a simple probabilistic interpretation that serves as an analysis tool for all spectral method. Normalized cut method arises

744

naturally from the framework. The framework provides a principled method for learning the similarity function as a combination of features.[10] Frey B.J. and Dueck. D Presented a method for clustering data by identifying a subset of representative. Examplars is important for processing sensory signals and detecting patterns in data. In scientific data analysis and in engineering systems critical step is clustering data based on a measure of similarity. A usual approach is to use data to learn a set of centers such that the sum of squared errors between data points and their nearest centers is small. Similarity between pairs of data points takes as input measures of affinity propagation Affinity propagation found clusters with much lower error than other methods and it did so in less than one-hundredth the amount of time. [11]

AUTHOR(s)/YEAR

PAPER NAME

Hatzivassiloglouet al(2001)

SIMFINDER, A Flexible Organizing text from Simfinder a clustering tool for multiple document clustering tool summarization

H. Zha (2002)

Generic summarization and Key phrase extraction using mutual reinforcement principle Centriod based summarization of multiple document

Radevet al (2004)

Aliguyev (2004)

PROBLEM

SOLUTION / METHODS

Text summarization Spectral graph and key phrase clustering extraction

Multi document Centroid summarization based summarization

R.M. A New sentences Text summarization in Text ranking similarity measure and multiple document extraction

Luxburg U.V (2007)

A Tutorial on spectral Standard clustering problem

745

linear Graph Laplacian method

Mihalcea R (2006)

Corpus based and Summarizes long text knowledge based measure of text similarity

simple lexical matching

Wang et al (2008)

Multidocument information contained symmetric summarization via in different documents non-negative sentences level semantic often overlaps matrix analysis factorization

Ruspini E.H (1970)

A New approach to cluster

Applicable only to fuzzy relational clustering clustering

Shi J and Malik J Normalized cut and image perceptual grouping (2000) segmentation

Extraction of global impression

Meila M and Shi J Learning segmentation by Probabilistic method Image (2001) random walk to generate data point segmentation by pairwise similarities Frey B.J (2004)

Clustering by Passing Exemplars, clustering Affinity Messages between data based on propagation Data Points similarity

3 PROPOSED WORKS Through the literature survey it is found that text summarization is becoming a great challenge in documentation. Clustering plays an important role in summarization and such a clustering can be done efficient using hierarchical clustering method. This method performs hierarchical clustering using fuzzy technique. Semantic measure and correlation coefficient are the two main parameters. This clustering method finds and displays the hierarchical structure present in natural language and avoids overlaps efficiently.

Raw data

746

Raw data

Semantic measure

Level of abstraction

Summarized data

Raw data

Figure 1 Text Summarization

4 CONCLUSION From these comparative studies it is found that this method support hierarchical clustering and extract the semantic relationship accurately. Various clustering techniques and algorithm have been studied for document summarization and each has its own advantages and limitation. Clustering there are many clustering algorithm for text summarization. An experiment on these existing techniques enables flat clustering method and supports only any of summarization features. But still the prevails many open source problem such as bilingual, long text summarization, multi document summarization and extraction of relevant text. The proposed methodology performs hierarchical clustering and overcome the text overlapping problem and outperforms the other clustering algorithm.

REFERENCES [1]

V. Hatzivassiloglou, J.L. Klavans, M.L. Holcombe, R. Barzilay, M.Kan, and K.R. McKeown, “SIMFINDER: A Flexible Clustering Tool for Summarization,” Proc. NAACL Workshop Automatic Summarization, pp. 41-49, 2001.

[2]

H. Zha, “Generic Summarization and Keyphrase Extraction Using Mutual Reinforcement Principle and Sentence Clustering,” Proc. 25th Ann. Int’l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 113-120, 2002.

747

[3]

D.R. Radev, H. Jing, M. Stys, and D. Tam, “Centroid-Based Summarization of Multiple Documents,” Information Processing and Management: An Int’l J., vol. 40, pp. 919-938, 2004.

[4]

R.M. Aliguyev, “A New Sentence Similarity Measure and Sentence Based Extractive Technique for Automatic Text Summarization,” Expert Systems with Applications, vol. 36, pp. 7764- 7772, 2009.

[5]

R. Kosala and H. Blockeel, “Web Mining Research: A Survey, ACM SIGKDD Explorations Newsletter, vol. 2, no. 1, pp. 1-15, 2000.

[6]

G. Salton, Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer. Addison-Wesley, 1989.

[7]

J.B MacQueen, “Some Methods for Classification and Analysis of Multivariate Observations,” Proc. Fifth Berkeley Symp. Math. Statistics and Probability, pp. 281-297, 1967. G. Ball and D. Hall, “A Clustering Technique for Summarizing Multivariate Data,” Behavioural Science, vol. 12, pp. 153-155, 1967.

[8]

[9]

J.C. Dunn, “A Fuzzy Relative of the ISODATA Process and its Use in Detecting Compact WellSeparated Clusters,” J. Cybernetics, vol. 3, no. 3, pp. 32-57, 1973. [10] J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, 1981.

[10]

J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, 1981.

[11]

R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, second ed. John Wiley & Sons, 2001.

[12]

U.V. Luxburg, “A Tutorial on Spectral Clustering,” Statistics and Computing, vol. 17, no. 4, pp. 395416, 2007.

[13]

B.J. Frey and D. Dueck, “Clustering by Passing Messages between Data Points,” Science, vol. 315, pp. 972-976, 2007.

[14]

S. Theodoridis and K. Koutroumbas, Pattern Recognition, fourthed. Academic Press, 2008.

[15]

C.D. Manning, P. Raghavan, and H. Schu¨ tze, Introduction to Information Retrieval. Cambridge Univ. Press, 2008.

[16]

Y. Li, D. McLean, Z.A. Bandar, J.D. O’Shea, and K. Crockett, “Sentence Similarity Based on Semantic Nets and Corpus Statistics,” IEEE Trans. Knowledge and Data Eng., vol. 8, no. 8, pp. 1138-1150, Aug. 2006.

748

[17]

R. Mihalcea, C. Corley, and C.Strapparava, “Corpus-Based and Knowledge-Based Measures of Text Semantic Similarity,” Proc. 21st Nat’l Conf. Artificial Intelligence, pp. 775 780, 2006.

[18]

D. Wang, T. Li, S. Zhu, and C. Ding, “Multi-Document Summarization via Sentence-Level Semantic Analysis and Symmetric Matrix Factorization,” Proc. 31st Ann. Int’l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 307-314, 2008.

[19]

C. Fellbaum, WordNet: An Electronic Lexical Database. MIT Press, 1998.

[20]

E.H. Ruspini, “A New Approach to Clustering,” Information and Control, vol. 15, pp. 22-32, 1969.

[21]

E.H. Ruspini, “Numerical Methods for Fuzzy Clustering,” Information Science, vol. 2, pp. 319-350, 1970.

[22]

M. Roubens, “Pattern Classification Problems and Fuzzy Sets,” Fuzzy Sets and Systems, vol. 1, pp. 239-253, 1978.

[23]

M.P. Windham, “Numerical Classification of Proximity Data with Assignment Measures,” J. Classification, vol. 2, pp. 157-172, 1985.

749