Journal of Sci., Engg. & Tech. Mgt. Vol 4 (1), Feb 2012
Semantic Based Approach for Document Clustering 1
1
Sumit Mahalle and 2Dr. Ketan Shah
M.Tech. Student, MPSTME and 2Associate Professor (IT), MPSTME 1
[email protected] and
[email protected] Abstract
Traditional document clustering methods mainly uses the matching technique for clustering. However it does not capture the meaning behind the words which is bad side of traditional method to mine the text. The paper is based on Semantic based approach for document clustering which is mainly based on semantic notations of text in documents. This concept is suitable for document clustering as it does not do simple matching but can compare the graphs (Parse trees)of two text and calculate the semantic distance between them, means when the parser properly converts text to semantic representations and the similarity estimator identifies their closeness with respect to meaning such that documents are cluster perfectly to particular cluster and give the proper response at the time of data mining which is not accurately in traditional methods. Keywords: Document clustering, semantic clustering, parse trees, similarity estimator.
1. Introduction The World Wide Web is a tremendously rich knowledge base. The knowledge comes not only from the content of the pages themselves, but also from the unique characteristics of the Web, such as its hyperlink structure and its diversity of content and languages. A considerably large portion of information present on the World Wide Web (www) today is in the form of unstructured or semi-structured text data bases [2]. The web instantaneously delivers huge number of these documents in response to a user query. However, due to lack of structure, the users are at a loss to manage the information contained in these documents efficiently. Over the last decade there is very fast growth of information on web. It has become a major source of information. Search engines were introduced to help find the relevant information on the web, such as Google, Yahoo!, etc. However, search engines do not organize documents automatically; they just retrieve related
documents to a certain query issued by the user [4]. While search engines are well recognized by the Information Retrieval community, they do not solve the problem of automatically organizing the documents they retrieve. One of the main problem in this rich knowledge base area is the presence of lot of irrelevant data in document and the organisation of data. Due to this problem, user who wants information on some topics will get maximum data which he doesn’t intend to use, resulting in information overload problem [4]. Some of the problems identified with conventional search engines: 1. Conventional systems mainly use the presence or absence of keywords to mine texts. 2. Only deal with simple word counting and frequency distributions of term appearances. 3. Information overload problem.
Journal of Sci., Engg. & Tech. Mgt. Vol 4 (1), Feb 2012
4. Do not capture the meaning behind the words, which results in limiting the ability to mine the texts. Therfore there is a need to cluster the documents properly, so that the search in www becomes more effective[1]. In this paper we discuss a technique to automatically cluster these documents into the related topics. Clustering is the proven technique for document grouping and categorization based on the similarity between these documents which is main aim to do it in Semantic way. The work focuses on the problem of mining useful information from the collected web documents using semantic based approach to document clustering, and measure the efficiency of clustering.
3. Mining Process. Test parser is responsible to convert the text into tree like structure called as parse tree which is based on semantic notations, Similarity estimator is used to measure percentage of similarity between two trees which is responsible for clustering the documents and last step is mining the data from as the query fired [6]. 2.2. Semantic Trees
Semantic based trees that depend upon semantic notations are based on seven different fields into which text is divided. Depending upon these seven fields, text in document set is stored with the use of parser. The categorized 7 different fields [1] are:
2. Document Clustering
i)
Clustering is one of the techniques to improve the efficiency in information retrieval. It is a data mining tool for grouping objects into clusters. Clustering divides the objects (documents) into meaningful groups based on similarity between objects. Documents within one cluster have high similarity with each other, but low similarity with documents in other clusters[7].
ii)
2.1 Semantic Clustering:
Approach
for
Document
The proposed system is based on implementation of the semantic based approach of document clustering technique to enhance web mining. In this work, focus is on the problem of mining the useful information from the collected web documents by semantically document clustering of the text, from the downloaded web documents[5]. The three main parts in this approach[1] 1. Text Parser. 2. Similarity Estimator.
iii) iv) v) vi) vii)
Name: Unique identification for the node. Type: Classification of the entity. (e.g., Agent, Object, Action, and State) Text: The original text. Syntax: The part of speech tag. Synonyms: Dictionary senses (synonyms) of the entity. Semantic: Disambiguated meaning of the entity. Relation: Relation to other nodes in the graph.
Every sentence in the document is categorized in these 7 different fields to differentiate the document sets. By dividing documents in this way we can group similar documents in one particular cluster. The 7 fields division will extract the meaning of the sentence and the relation of that word with other word in the sentence.
Journal of Sci., Engg. & Tech. Mgt. Vol 4 (1), Feb 2012
Document X
Para1…….n
We may give the threshold (Eg.75%) and set the match pointer that if similarity is upto 75% then we will put those documents in similar cluster. 3. Algorithm
Sentence 1…….n DocuSemant ment icAnaly is Clause 1…….n
Word 1…….n Figure 1: Document division
Figure 1 shows how a document will be parsed in a semantic way. Suppose we are considering abstract of an IEEE paper,then first word to last word will make one clause,first clause to last clause will make one sentence,first sentence to last sentence will make one paraghraph,an finaly first paragraph to last paragraph will make entire document [7]. In this way semantic clustering will be applied from each word to the entire document. The sets of such similar documents will be put in one cluster. 2.3. Similarity of trees
We can find out the two trees similarity by finding degree of their similarity percentage of each nodes. Similar trees i.e. text will be put into one cluster[8]. Suppose we are considering matching of two trees i.e.. two sentences, then we have division of 7 different fields of each sentence. If out of 7 fields, 5 fields are matching, then those fields are given more weightage and we will consider two sentence as near about similar and we put them in similar cluster. In this way these parse tress are match by using semantic notations.
Part1: Clustering of stored documents
50 html pages of IEEE papers (abstract) as documents is stored in dataset. Neglect the undesired data (that will not be taking part in clustering) from that html pages. Make a semantic graph model(SGM) of each abstract part in the document and its related text in paragraph. Each sentence is represented by a graph parse trees. All trees are compared. Clustering is done on basis of this similarity percentage. Part 2: Query processing
When user fires a query, that sentence or word will be converted into tree like structure(parse tree). User query & database stored parse trees of all document will be compared using similarity estimator process. Result will return all the similar documents which user wants, as clustering is based on semantic based approach. System first stores the set of 50 html pages of IEEE papers as document is stored in dataset as a directory. Make a semantic graph model(SGM) of each abstract part in the document and its related text in paragraph. Each sentence is represented by a parse trees,attributed graph of concepts as nodes and relations between them. All graphs (essentially trees) of the produced sentence representations are
Journal of Sci., Engg. & Tech. Mgt. Vol 4 (1), Feb 2012
connected to one node that makes an inclusive rooted tree for the whole document.When user fires a query, that sentence or word will be converted into tree like structure(parse tree)[9]. User parse tree & our database stored parse tree of all documents will be compared using similarity estimator process. Result will give all similar documents by finding proper text or key words which user wants as clustering is based on semantic based approach. 4. Summary & Discussion The framework for semantic based document clustering in web text mining starts with the selection of application domain and then collection of web pages related to the application domain, followed by conversion of web pages to text documents. The documents will then be preprocessed for removing frequently used stop words and rules will be generated as per the association in the words of query provided by user. In the last step number of clusters will be created from the processed text using semantic parsing document clustering. The clusters will be selected as per user query for mining text and the text will be mined from documents. This approach is very much different from traditional approaches as it finds the meaning and then clusters the document. Due to this the user will not get irrelevant results which is not useful to him and clustering is done in proper groups as semantic fields of one sentence is different from other. 5. Future Work Currently, we are working on coding the framework discussed in the paper. An optimized version of the parse tree generator is being developed which will be used by a clustering algorithm to group documents. The clustering algorithm will be given a
threshold will be the deciding criteria when matching the trees given by the parse-tree generator. 5. REFERENCES [1] Khaled B. Shaban, “A Semantic Approach for Document Clustering”, Journal of software,vol.4.5,July 2009. [2] Andreas Hotho and Gerd Stumme,2007,Andreas Hotho and Gerd Stumme,“Mining the World Wide WebMethods, Application and Perceptivities”, in Künstliche Intelligenz, July 2007. [3] Anupam Joshi and Raghu Krishnapuram,2003 Anupam Joshi and Raghu Krishnapuram, “ Robust Fuzzy Clustering Methods to Support Web Mining”, in Proceedings of the Workshop on Data Mining and Knowledge Discovery , SOGMOD ,1998 .Applications”, Annual Review of Information Science and Technology 2003. [4] Bezdek J. C,1988 Bezdek J. C, “Pattern Recognition with Fuzzy Objective Function Algorithms”, Plenum Press, New York, 1988. [5]
Hsinchun Chen and Michael Chau, “Web Mining: Machine learning for Web Applications”, Annual Review of Information Science and Technology 2003.
[6] Mr. Rizwan Ahmad and Dr. Aasia Khanum, “Document Topic Generation in Text Mining by Using Cluster Analysis with EROCK”, International Journal of Computer Science & Security (IJCSS), Volume (4) : Issue (2) Aug 2008. [7] M.E.S. Mendes Rodrigues and L. Sacks, “A Scalable Hierarchical Fuzzy Clustering Algorithm for Text Mining”, Department of Electronic and Electrical
Journal of Sci., Engg. & Tech. Mgt. Vol 4 (1), Feb 2012
Engineering University College London Torrington Place, London, WC1E 7JE, United Kingdom, 2004. [8] Shady Shehata, Member, IEEE, Fakhri Karray, Senior Member, IEEE, and Mohamed S. Kamel, Fellow, IEEE, “An Efficient Concept-Based Mining Model for Enhancing Text Clustering”, IEEE
Trans. On Knowledge and Data Engineering, Vol. 22, No. 10, Oct 2010. [9] Jingwen Tian, Meijuan Gao, and Yang Sun, “Study on Web Classification Mining Method Based on Fuzzy Neural Network”, Proceedings of the IEEE International Conference on Automation and Logistics Shenyang, China August 2009.