Enhancement of Text Retrieval Methods by using Hybrid Approach in ...

4 downloads 10407 Views 262KB Size Report
This paper aims to use enhanced data mining techniques to extract text from Research ... Before putting the data in the data warehouse the keyword extraction ...
IJCST Vol. 4, Issue 2, April - June 2013

ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print)

Enhancement of Text Retrieval Methods by using Hybrid Approach in Data Mining 1 1

Kirti Joshi, 2Navjot Kaur, 3Rajneet Kaur

Dept. of CSE, Sri Guru Granth Sahib World University, Fatehgarh Sahib, Punjab, India

Abstract Data mining, a branch of computer science is the process of extracting patterns from large data sets by combining methods from statistics and artificial intelligence with database management. Research papers text retrieval refers to text retrieval techniques applied to research articles and literature available of the different research papers. The volume of published different research papers in different areas especially in computers science, and therefore the underlying knowledge base, is expanding at an increasing rate. By discovering predictive relationships between different pieces of extracted data, data-mining algorithms can be used to improve the accuracy of information. This paper presents a technique for using soft clustering data mining algorithm to increase the accuracy of text extraction. In this proposed work, the work is done to increase the accuracy of text extraction by using KEA with the dictionary approach and by using Artificial Bees Colony algorithm. Keywords Data Mining, Research Articles Text Extraction, KEA, Artificial Bees Colony, Fuzzy C Mean I. Introduction This paper aims to use enhanced data mining techniques to extract text from Research Papers with reasonably high recall and precision. In recent years, along with development of information technology, Research Papers grows rapidly. It creates a need and challenge for data mining. Data mining is a process of the knowledge discovery in databases and the goal is to find out the hidden and interesting information [3]. The technology includes association rules, classification, clustering, and evolution analysis etc. Clustering algorithms are used as the essential tools to group analogous patterns and separate outliers according to its principles that elements in the same cluster are more homogenous while elements in the different ones are more dissimilar [1-2]. Furthermore, data mining algorithms do not need to rely on the pre-defined classes and the training examples while classifying the classes and can produce the good quality of clustering, so they fit to extract the biomedical text better. A major challenge for Information Retrieval in the life science domain is coping with its complex and inconsistent terminology. In this paper we try to devise an algorithm which makes word-based retrieval more robust. We will investigate how data mining algorithms based on keywords affects retrieval effectiveness in the biomedical domain. We will try to answer the following research question in this paper “How can the effectiveness of word-based Research Paper retrieval be improved using data mining algorithm? II. Method Text mining is defined as the automatic discovery of new, previously unknown, information from unstructured textual data [3-4]. This process is done in three steps: information retrieval, information extraction and data mining. A primary reason for using data mining for Research articles is to assist in the analysis of collections of the available text. The analysis in this paper will be augmented by using experiment-based approach. Before data mining algorithms w w w. i j c s t. c o m

can be used, a target data set will be assembled. As data mining can only uncover patterns already present in the data, the target dataset must be large enough to contain these patterns. Pre-process is essential to analyze the multivariate datasets before clustering or data mining. The target set is then cleaned. Cleaning removes the observations with noise and missing data. The text data from various research papers available with us is first put into a data warehouse. Before putting the data in the data warehouse the keyword extraction algorithm is used to find out the keywords from the full text [8]. This keyword extraction uses partial parser to extract entity names. This parser uses linguistic rules and statistical disambiguates to achieve greater precision. The data is then organized into clusters. Clustering is the task of discovering groups and structures in the data that are in some way or another “similar”, without using known structures in the data. The clusters will be created based on the keywords extracted from our text. These clusters will be created using fuzzy C mean algorithm. The fuzzy c-means algorithm is one of the most widely used soft clustering algorithms. It is a variant of standard k-means algorithm that uses a soft membership function. Fuzzy C-Means (FCM) clustering algorithm is one of the most popular fuzzy cluster. III. Common Techniques of Data Mining There are many techniques of data mining. The most common techniques used in the field of data mining are followings. A. Artificial Neural Networks Artificial neural networks non-linear predictive models that learn through training and resemble biological neural networks in structure. This predictive model uses neural networks and finds the patterns from large databases. B. Decision Trees Set of decisions are represented by Tree-shaped structures. These decisions generate rules for the classification of a dataset under the large databases. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID). C. Genetic Algorithms Optimization techniques that use process such as genetic combination, mutation, and natural selection in a design based on the concepts of evolution. D. Nearest Neighbor Method A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k³1). This is sometimes called the k-nearest neighbor technique. E. Rule Induction Rule Induction is the extraction of useful if-then rules from data based on statistical significance between different records of database. Many of these technologies have been in use for more than a decade in specialized analysis tools that work with relatively International Journal of Computer Science And Technology   41

IJCST Vol. 4, Issue 2, April - June 2013

small volumes of data. These capabilities are now evolving to integrate directly with industry-standard data warehouse and OLAP platforms [8]. The appendix to this white paper provides a glossary of data. F. Artificial Bee Colony Algorithm Bee Colony Optimization for travelling salesman problem[9]. The ABC optimization is a population-based search algorithm which applies the concept of social interaction to problem solving. This biological phenomenon when applied to the process of path planning problems for the vehicles, it is found to be excelling in solution quality as well as in computation time. Simulations have been used to evaluate the fitness of paths found by ABC Optimization. The effectiveness of the paths has been evaluated with the parameters such as tour length, bee travel time by Artificial Bee Colony Algorithm. The travelling salesman problem for VRP is optimized by using nearest neighbour method; evaluation results are presented which are then compared by the artificial bee colony algorithm. The pursued approach gives the best results for finding the shortest path in a shortest time for moving towards the goal. G. Fuzzy C Means Here it is the proposed enhanced algorithm is responsible for extracting keywords present in the full text biomedical article store these keywords in a relation. Then the actual work of algorithm begins, it starts clustering of keywords. The algorithm initially picks some keywords that are extracted. It groups the full text articles based on these keywords. It means each cluster contains only those articles which contain that keyword as their part. Then it starts using fuzzy C mean clustering to combine the clusters together on some similarity measure[7]. Here we combine two clusters if their similarity measure is greater than or equal to a specified threshold value. The proposed Algorithm repeats this process until no more changes are made to the clusters. Finally the proposed algorithm stores all the clusters in an xml file. Here our motive to extract all the full text articles which may be relevant for the user providing the search string, for this out of all clusters the cluster with largest number of articles is our target. The enhanced part of the algorithm is integrated with dictionary words as well in case the user entered the wrong word the enhanced model of algorithm will compare the words present in dictionary and find the exact word of the related keyword and displays the correct result as previous one. IV. Methodology Algorithm 1. Collect the documents from the list of Research Article database 2. Extract the keywords from the document set using KEA algorithm. 3. Refer to the Research lexicon and discard the irrelevant keywords. 4. Put the data in following relation so that the full text can be retrieved later using keywords only.

5. Go to step 1 and repeat till the entire document in the list of Research dataset are processed. 6. Use the fuzzy c-means algorithm to create clusters on keywords.

42

International Journal of Computer Science And Technology

ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print)

7. In case the user enters the wrong keyword then the system will correct the keyword by comparing it with dictionary data base and gives exact efficient documents. 8. Enhancement of algorithm with dictionary database approach and Artificial Bees Colony Algorithm (Optimization algorithm which will give the best result for finding the shortest path in shortest time). 9. Comparing the results of the algorithms with other clustering algorithm in terms of efficiency. V. Conclusion Extraction of text from Research papers is an essential operation. Given that there have been many text extraction methods developed; this paper presents a novel technique that employs keyword based article clustering to further enhance the text extraction process. The development of the proposed algorithm is of practical significance; however it is challenging to design a unified approach of text extraction that retrieves the relevant text articles more efficiently. The proposed algorithm, using data mining algorithm, seems to extract the text with contextual completeness in overall, individual and collective forms, making it able to significantly enhance the text extraction process from various research paper literature. References [1] Venkatadri.M,Dr.Lokanath C.Reddy,"A Review on Data Mining from Past to the Future”, International Journal of Computer Application, Vol. 15, No. 7, Feb2011 [2] Aparna S. Varde,"Challenging Research Issues in Data Mining,Databases and Information Retrieval”, SIJKDD Explorations,Vol. 11, Issue 1 [3] Sumit Vashishta,Dr. Yogendra Kumar Jain,"Efficient Retrieval of Text for Biomedical Domain using Data Mining Algorithm”, (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 2, No. 4, 2011 [4] Raymond J. Mooney, Un Yong Nahm,"Text Mining with Information Extraction”, Multilingualism and Electronic Language Management: Proceedings of the 4th International MIDP Colloquium, September 2003, Bloemfontein, South Africa, Daelemans, W., Du Plessis, T., Snyman, C. and Teck, L. (Eds.) pp. 141-160, Van Schaik Pub., South Africa, 2005 [5] Shakti Kundu,“An Intelligent Approach of Web Datamining", International Journal on Computer Science and Engineering (IJCSE), Vol. 4, No. 05 May 2012. [6] Shobha S. Raskar, D. M. Thakore,"Text Mining and Clustering Analysis”, IJCSNS International Journal of Computer Science and Network Security, Vol. 11, No. 6, June 2011 [7] Malathy K,"An Enhanced Fuzzy c-Means Clustering for Intelligent Prediction”, IEEE Transaction on Fuzzy Systems, Vol. 18, No. 2, pp. 274 – 284, 2010. [8] Jasmeen Kaur , Vishal Gupta,"Effective Approaches For Extraction of Keywords”, IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 6, November 2010. [9] Ashita S. Bhagade, Parag. V. Puranik,"Artificial Bee Colony (ABC) Algorithm for Vehicle Routing Optimization Problem”,International Journal of Soft Computing and Engineering (IJSCE), Vol. 2, Issue-2, May 2012 [10] Pavel Berkhin,"Survey of Clustering Data Mining Techniques”. [11] Michal Steinbach, George Karypis, Vipin Kumar,"A Comparison of Document ClusteringTechniques”, University of Minnesota Technical Report #00-034 w w w. i j c s t. c o m

ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print)

[12] Clifton, Christopher,“Encyclopedia Britannica: Definition of Data Mining”, 2010. [13] Han, J., Kamber, M.,"Data Mining Concepts and Techniques”, CA: Morgan Kaufmann, 2001. [14] Badgett RG,“How to search for and evaluate medical evidence”, Seminars in Medical Practice 1999, 2, pp. 8-14, 28. [15] Richardson J,"Building CAM databases the challenges ahead”, J Altern Complement Med 2002, 8, pp. 7-8. [16] Kantardzic, Mehmed,“Data Mining: Concepts, Models, Methods, and Algorithms”, John Wiley & Sons. 2003. [17] Miller, H., Han, J., (eds.),“Geographic Data Mining and Knowledge Discovery”, (London: Taylor & Francis), 2001. [18] Manu Aery, Naveen Ramamurthy, Y. Alp Aslandogan. “Topic identification of textual data”, Technical report, The University of Texas at Arlington, 2003. [19] Pavel Berkhin,“Survey of clustering data mining techniques”, Technical report, Accrue Software, San Jose, CA, 2002. [20] Cecil Chua, Roger H.L. Chiang, Ee-Peng Lim,“An integrated data mining system to automate discovery of measures of association”, In Proceedings of the 33rd Hawaii International Conference on System Sciences, 2000. [21] [Online] Available: http://www.cs.utexas.edu/~ml/papers/ discotex-melm-03.pdf [22] [Online] Available: http://www.enggjournals.com/iSjcse/ doc/IJCSE12-04-05-080.pdf

IJCST Vol. 4, Issue 2, April - June 2013

Navjot Kaur did her B.Tech and M.Tech in Computer Science and Engineering from Punjabi University Patiala Campus in 2006 and 2008 respectively. Pursuing Ph.D in Semantic Web and Web Mining from Punjabi University Patiala. Working as Assistant Professor at Sri Guru Granth Sahib World University, Fatehgarh Sahib, Punjab, India. Rsearch area is Web Mining and NLP.

Rajneet Kaur did her B.Tech (Computer Science & Engineering) from Baba Banda Singh Bahadur Engineering College Fatehgarh Sahib in 2008. She did her Master’s in Engineering in Computer Science from U.I.E.T., Panjab University, Chandigarh in 2011. She has teaching experience of one year as lecturer at Baba Banda Singh Bahadur Polytechnic College FGS. From last two years, she is working as Assistant Professor at Sri Guru Granth Sahib World University, Fatehgarh Sahib, Punjab, India. Her main research area is Data Mining and Digital Image Processing She has published many research papers in International Journals and Conferences. Kirti Joshi did her B.Tech (Information Technology) from Baba Banda Singh Bahadur Engineering College, Fatehgarh Sahib, Punjab, India. She is pursuing M.Tech (Computer Science and Engineering) from Sri Guru Granth Sahib World University, Fatehgarh Sahib, Punjab, India. Her research areas include Data Mining and Natural Language Processings.

w w w. i j c s t. c o m

International Journal of Computer Science And Technology   43

Suggest Documents