Use of Search Engine Result count for Similarity, Duplicate and Substitution Detection in Web Mining: A Survey Sonal Deshmukh1, Dr R R Deshmukh2, Sachin Deshmukh3 1. 2. 3.
[email protected] MCA Department, Jawaharlal Nehru Engineering College, Aurangabad PIN 431002 INDIA
[email protected] Department of CS and IT, Dr BAM University, Aurangabad PIN 431003 INDIA
[email protected] Department of CS and IT, Dr BAM University, Aurangabad PIN 431003 INDIA
Abstract: With a huge amount of information available on the Web, it has become a fertile area of mining research. Search engines are developing and adopting different mining algorithms for different communities of research like web content mining, web structure mining and web usage mining. In this paper we are focusing on Duplicate web page detection and word substitution for hiding the real information in the web page or in E-mail. These problems can be solved using different categories of algorithms. For the survey, we focused on the algorithms that are using search engine result count for solving these problems. Keywords: Web mining, data mining, duplicate detection, text substitution, page similarity
1. INTRODUCTION The World Wide Web is the widely known largest heterogeneous dynamic database available publically. Due to the huge information available, we are drowning in information and facing information overload [1]. This overload presents the problems like finding relevant information, creating new knowledge out of the information available, personalization of the information and learning about consumers or individual users [2]. Web mining is the use of data mining techniques to automatically discover and extract information from web document and services. The task of web mining can be subdivided into Resource finding, Information selection and preprocessing, Generalization and Analysis [3]. The major categories of Web Mining can be stated as web content mining, web structure mining and web usage mining [4]. Web content mining is described as the discovery of the useful information from web data or web documents or web contents. Web structure mining [5] is to discover the model underlying the link structures of the web. The model is based on the topology of the hyperlinks with or without description of the links. The model can be used to categorize the web pages and is useful to generate information such as the similarity and relationship between the web sites. Web usage mining tries to make sense of the data generated by web surfer’s sessions or behaviors. Web usage data includes data from server access logs, proxy server logs, browsers logs, user profiles, registration data, user sessions or transactions, cookies, user queries, bookmark data, mouse clicks and scrolls and any other data as a interaction. Today many of the search engines are using the search queries for fine tuning their search algorithms. If the most wanted query results are arranged for easy access, the searching time decreases drastically. Hence the popular searches of the day (like special news) are retrieved fast in the search engines.
2. Google and its PageRank Although many factors determine Google's overall ranking of search engine results, Google maintains that the heart of its search engine software is Page Rank [6]. A few quick searches on the Internet reveal
that both the business and academic communities hold PageRank in high regard. The business community is mindful that Google remains the search engine of choice and that PageRank plays a substantial role in the order in which webpages are displayed. Maximizing the PageRank score of a webpage, therefore, has become an important component of company marketing strategies. The academic community recognizes that PageRank has connections to numerous areas of mathematics and computer science such as matrix theory, numerical analysis, information retrieval, and graph theory. As a result, much research continues to be devoted to explaining and improving PageRank [7]. The PageRank algorithm assigns a PageRank score to each of more than 25 billion webpages [8]. The algorithm models the behavior of an idealized random Web surfer [9, 10]. This Internet user randomly chooses a webpage to view from the listing of available webpages. Then, the surfer randomly selects a link from that webpage to another webpage. The surfer continues the process of selecting links at random from successive webpages until deciding to move to another webpage by some means other than selecting a link. The choice of which webpage to visit next does not depend on the previously visited webpages, and the idealized Web surfer never grows tired of visiting webpages. Thus, the PageRank score of a webpage represents the probability that a random Web surfer chooses to view that webpage. To model the activity of the random Web surfer, the PageRank algorithm represents the link structure of the Web as a directed graph. Webpages are nodes of the graph, and links from webpages to other webpages are edges that show direction of movement. The process for determining PageRank begins by expressing the directed Web graph as the n x n "hyperlink matrix" 1t, where n is the number of webpages. To model the overall behavior of a random Web surfer, Google forms the matrix G=αS+(1-α)lν, where 0 ≤ α