Following this vision of text mining as data mining on unstructured data, most of the ... Each class of packets is treated with some special service based on its class. ... 1, 2,3Department of Computer Science & Engineering, 4Department of ...
Volume 4, Issue 2, February 2014
ISSN: 2277 128X
International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue: Advanced Developments in Engineering and Technology Conference Held in Lord Krishna College of Engineering Ghaziabad, India
An Information Retrieval(IR) Techniques for text Mining on web for Unstructured data 1
1, 2,3
Santosh Kumar Paul, 2Madhup Agrawal, 3Shyam Rajput, 3Sanjeev Kumar Department of Computer Science & Engineering, 4Department of Information Technology 1, 3, 4 ABES Engineering College, 2Lord Krishna Engineering College 1,2,3,4 Ghaziabad, India
Abstract— It is observed that text mining on Web is an essential step in research and application of data mining. We are mainly using information retrieval, search engine and some outliers detection techniques to look up the efficient and desired Web information. But IR focuses on searching for information that is explicitly present but not latent knowledge in some document. the search engine can hardly according to different need of different customers and provide individual service, and it is very difficult to mine data further. Outliers can find out the deviation on the interestingness. There are two types of data on web. One is Structured data and other is unstructured data. There are various techniques to handle structured data. he main problem on text mining on Web is handle the unstructured data. This paper discusses an Algorithm of how to follow the unstructured data on Web and by using the text mining technique, how to extract and express unstructured data to semi-structured data so that we can find pseudo title or serial to classify the data information with with the Web page text contents for user. The process of Web text mining, information extraction method, mining algorithm and realization technique are discussed in details. Keywords—Web Text Mining; Information Retrieval; Outliers Detection; Unstructured Data I. INTRODUCTION Text mining refers to the discovery of non-trivial, previously unknown, and potentially useful knowledge from a collection of texts. Since its origin, text mining has been considered an analog of data mining (interpreted as Knowledge Discovery in Databases, or KDD) applied to text repositories. Text mining is very important since now a days, around 80% of the information stored in computers (not considering audio, video, and images) consists of text. What seems to give text mining a clear distinction from data mining is that the latter deals with structured data, whereas text presents special characteristics and its explicit appearance is basically unstructured. Following this vision of text mining as data mining on unstructured data, most of the approaches to text mining have been mainly concerned with obtaining structured datasets from text, called inter-mediate forms, on which usual data mining techniques are applied. is done by a classifier based on information in the packet header. Each class of packets is treated with some special service based on its class. Table 1 shows the most usual intermediate forms that have been proposed and the minimal text units (atomic data) they are comprised of. For example, the bag of words represents a piece of text by means of a set of weighted terms, like that employed in Information Retrieval. Data mining techniques employed on the intermediate forms of Table 1 include clustering, dependence analysis, and association rules among others. Table 1 Minimal text units and their related Intermediate forms Text Unit
Intermediate Form
Words
Bag of Words, N-gram
Concept
Concept Hierarchy, Semantic Graph
Phrase
N-phrase
Paragraph
Paragraph, N-phrase, trends
Document
Document
© 2014, Lord Krishna College of Engineering Ghaziabad, India
Page | 67
Paul et al., International Journal of Advanced Research in Computer Science and Software Engineering 4(2), February- 2014, pp. 67-70 II. WEB TEXT MINING PROCESS Compared with the traditional data and the data warehouse, the information on Web is semi-structured and/or unstructured, dynamic state data [8]. So it is very difficult to directly carry on data mining on the Web page. The data on Web have to be through necessary data processing. The processing process of the typical Web mining includes four step showed as following figure 1.
Figure 1. Web text mining common process 1.
2.
3.
4.
To lookup resources: its task gets data from the target Web document. It is remarkable that information resources sometimes not only is limited by on-line Web document, but also include E-mail, electron document, newsgroup, perhaps the log data of website, even is the information in the bargain database which passes Web formation[9]. Pretreatment and information extraction: its task is to get rid of useless information and carry on information necessary sorting from the acquired Web resources. For example automatically clean advertisement conjunction, clean surplus format marking, automatically identify paragraph or field, get characteristic item, and form to data rules, some logic form or even relation table. Mode discovery: it is automatic and can be carried on in the same one website or among several websites. It is the non-ordinary process in which we discover valid, novel, latent, available and comprehending knowledge including forms of concept, mode, rule, regulation, restriction and visualizing etc. Its adoptive technique includes classification, clustering, associate rule and sequence analysis etc. Mode analysis: It verifies and explains the mode produced in top one step. It can be automatically completed by machine or be mutually completed among analytic personnel and machine.
III. UNSTRUCTURED DATA Most previous studies of data mining have focused on structured data, such as relational, transactional, and data warehouse. However, a substantial portion of the available information is stored in text databases, which consist of large collections of documents from various sources, such as news articles, research papers, books, digital libraries, email messages, and Web pages. Text databases are rapidly growing due to the increasing amount of information available in electronic form, such as electronic publications, various kinds of electronic documents, e-mail, and the World Wide Web [4]. Nowadays most of the information in government, industry, business, and other institutions are stored electronically, in the form of text databases. Data stored in most text databases are semi structured data in that they are neither completely unstructured nor completely structured [7]. For example, a document may contain a few structured fields, such as title, authors, publication date, and category, and so on, but also contain some largely unstructured text components, such as abstract and contents [3]. There have been a great deal of studies on the modeling and implementation of semi structured data in recent database research. Moreover, information retrieval techniques, such as text indexing methods, have been developed to handle unstructured documents. Traditional information retrieval techniques [11, 12] become inadequate for the increasingly vast amounts of text data. Typically, only a small fraction of the many available documents will be relevant to a given individual user. Without knowing what could be in the documents, it is difficult to formulate effective queries for analyzing and extracting useful information from the data. Users need tools to compare different documents, rank the importance and relevance of the documents, or find patterns and trends across multiple documents. Thus, text mining has become an increasingly popular and essential theme in data mining [12]. Structured data, that can be organized structure and therefore can be defined a term used for the actual data. The most commonly used universal type of structured data such as SQL and Access are data sources. For example, Structured Query Language (SQL), columns (variables) and rows (records) based information allows in select. The content of structured data can be organized according to the data types and data is search able [11]. Unstructured data refers to usually computerized information that either does not have a data model nor has one that is not © 2014, Lord Krishna College of Engineering Ghaziabad, India
Page | 68
Paul et al., International Journal of Advanced Research in Computer Science and Software Engineering 4(2), February- 2014, pp. 67-70 easily usable by a computer program [8]. Unstructured data distinguishes such information from data stored in fielded form in databases or annotated in documents. Probably the most common types of unstructured data such as image files, PDF, word and text, are kept text files on the web and e-mail log files [11]. In spite of organize E-mail databases with tools such as Microsoft Outlook, which kind of structured data is consider to change raw data. IV. PROPOSED ALGORITHM This chapter defines a few kinds of user intention which a user want to do. Our system will extract context information related to the user intention and measures relatedness between them. A. Motivation and Pr-requirements
Figure 2. Typical architecture for context awareness Fig. 2 shows a typical architecture for context awareness. This paper mainly deals with fundamental data applicable to the context information interpreter which processes context information and finally grasps user intention. Context is divided into natural phenomenon and user intended object. The natural phenomenon is information which cannot be manipulated by user such as weather, wind and outside temperature. The user intended object is directly related to the user intention such as places including office, building and car, objects such as door, handle, desk, chair, computer, monitor, keyboard, pen, paper, computer applications, and so on. In this research, we only consider the intended objects to grasp user intention. For example, if the context information interpreter has received data from sensors of „book,‟ „chair‟ and „pen‟ then the system can infer the user intention to „writing‟ or „studying‟. Based on the inferred result, the context-aware application can provide the user with appropriate services such as light adjustment and mobile phone sound mute function. Therefore, it is important that measures relatedness between context (information) and user intention. For the purpose, we define a few per-requirements as follows: R1. The context about natural phenomenon is not considered for this research. R2. The devices (context) include physical object such as „computer‟ and logical object such as „computer application‟. R3. The devices surrounding user has own sensor that senses user behavior such as access and use. R4. Context information delivered from each sensor has 100% accurate. This research makes fundamental data under the per-requirements above to help the context information interpreter grasp user intention. From following section, we define user intention, extract context information according to the user intention and measure relatedness between them. In addition to the per-requisites, we limit domain of user intention to office domain for easy understanding of this work. B. User Intention and Context The user intention in office or laboratory is various. We strongly believe this method is applicable to grasp diverse and specific intention. We define 17 kinds of user intention (UI) as next: UI = {reading, writing, talking, presenting, programming, scanning, printing, eating, opening, closing, searching, drinking, providing, sending, receiving, leaving, entering}. Each element involved in the UI set is influenced by diverse context information. For example, the intention of the authors on writing this paper can be „writing‟ and its related context will be „computer,‟ „keyboard,‟ „paper,‟ „monitor,‟ „chair,‟ „desk‟ and „MS-Word.‟ The context may be more various than these according to user environment. Therefore extracting all of context information and matching the context information to user intention are important work for pervasive computing environment. In order to fit this condition, we collect the context from Google 5-gram data extracted from Web documents. The Web documents are published by people all of the world and it means the the authors on writing this paper can be „writing‟ and its related context will be „computer,‟ „keyboard,‟ „paper,‟ „monitor,‟ „chair,‟ „desk‟ and „MS-Word.‟ The context may be more various than these according to user environment. Therefore extracting all of context information and matching the context information to user intention are important work for pervasive computing environment. In order to fit this condition, we collect the context from Google 5-gram data extracted from Web documents. The Web documents are published by people all of the world and it means the documents reflect relatedness between user intentions and real objects (context) well. We use co-occurrence to extract context information of each intention. The following steps indicate this method. © 2014, Lord Krishna College of Engineering Ghaziabad, India
Page | 69
Paul et al., International Journal of Advanced Research in Computer Science and Software Engineering 4(2), February- 2014, pp. 67-70 Step 1: reads a token from 5-gram data set. Step 2: if a token contains an element(s) in UI, words in the token are considered as context information of the element(s). And co-occurred count of each word is accumulated into storage. Co-occurrence is a basic to measure probabilistic relatedness between two words. [10] Step 3: repeats step 1 and 2 until all of 5-gram data is Processed. Through above steps, we could make data set containing user intention, context information and co-occurred count. And it can be said the context data is enough to be utilized to grasp real world pattern. Table 2 shows statistics of the extracted context according to user intention. Table 2. STATISTICS OF THE EXTRACTED CONTEXT ACCORDING TO USER INTENTION UI
Statistical Context
UI
Statistical Context
Reading
49,528
Writing
51,573
Talking
34,518
Presenting
41,438
Programming
54,910
Scanning
13,546
Printing
38,631
Eating
19,784
Opening
61,653
Closing
35,642
Searching
58,952
Drinking
21,621
Sending
67,004
Receiving
66,813
Leaving
38,054
Entering
46,907
Providing
27,418 727,892
V. CONCLUSION As Web data are mainly semi-structured, unstructured and different structured data, text mining on web is related but different from data mining, information retrieval and search engine. It can follow the appointed website or Web page according to different need of different customers and provide individual service by using the text mining technique. The traditional text classification technique only has training and categorizing two processes. Its classification ability is fixedly constant and don't have ability of continuous study. The algorithm in this paper is expanded as "Training → Categorizing → feedback judgment → feedback". This kind of method is more close the real meaning machine learning. It makes the algorithm has certain degree cognition self-determination. We present experiments on different data set which demonstrate more effectiveness and Accuracy of our algorithm than traditional algorithm. There will be extensive applications and utility values in web mining. REFERENCES [1] Yang HC, Lee CH. A text mining approach on automatic generation of web directories and hierarchies [J]. Expert Systems with Applications, 2004, 27: 645-663 [2] International Ergonomics Association, http :// www. iea. cc/ index. cfm [EB/OL] [3] Yang Y M.An evaluation of statistical approach to text categorization [R].In Technical Report CMU—CS一97 127 . Computer Science Department , Carnegie Mellon University,1997 [4] XUE Wei-min,LU Yu-Chang.Research on text data mining. Journal of Beijing Union University(Natural Sciences) 2005,V01.19, No.4, 12 [5] Han J , Kamber M . Data Mining : Concepts and Techniques [M].San Francisco:Morgan Kaufinann Publishers,2001 [6] C. Manning, P. Raghavan, H. Schütze. Introduction to Information Retrieval. Cambridge University Press. 2008. http:// informationretreval. Org/ [7] Bing Liu. Web Data Mining - Exploring Hyperlinks, Contents and Usage Data. Springer, 2007 [8] Soumen Chakrabarti. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, 2002 [9] Srihari, R. K., Li, W., Niu, C. and Cornell, T.“InfoXtract: A Customizable Intermediate Level Information Extraction Engine.” Journal of Natural Language Engineering, 2006, pp. 1-26 [10] Das-Neves, F., Fox, E. A. and Yu, X. Connecting Topics in Document Collections with Stepping Stones and Pathways. CIKM‟05, ACM Press, 2005, pp. 91-98
© 2014, Lord Krishna College of Engineering Ghaziabad, India
Page | 70