Change Detection in Web Pages - IEEE Xplore

5 downloads 194 Views 464KB Size Report
In this paper, we describe an approach for building a system for efficiently monitoring ... different versions of a web page, and evaluating the significance of web ...
10th International Conference on Information Technology

Change Detection in Web Pages Divakar Yadav JIIT University, Noida (India) [email protected]

A.K.Sharma YMCA, Faridabad (India) [email protected]

Instead of having users tracking when to visit web pages of interest and identifying what and how the page of interest has been changed, the information change monitoring service is becoming increasingly popular, thus enabling information to be delivered while they are still fresh. Several tools are available to assist users in tracking changes in web pages of interests. Examples of tools are Timely Web and URLy Warning [16, 17]. In this type of scenario, it becomes natural that these tools are getting the popularity. The Web provides access to a wide variety of information but much of this information is fluid; it changes, moves, and occasionally disappears. Bookmarks, paths over Web pages, and catalogs like Yahoo! are examples of page collections that can become out-of-date as continuous changes are made to their components. Maintaining these collections requires that they should be updated continuously. Tools to help in this maintenance require an understanding of what changes are important and what changes are not. So keeping all these aspects of web in mind, in this paper we are proposing a scheme to detect the changes in the web documents. In section 3.1, we identify and discuss the types of changes that can take place in the web pages. Section 3.2 and 3.3 discuss the proposed solution for the problems identified in the previous section. Here we make use of the tree structure of the web documents and traverse down the tree in order to detect the changes in the structure as well as in the content of the web document. Instead of traveling down the whole tree we make use of the levels in the tree structure in order to detect the addition or deletion of the node at a specific level or levels. In order to detect the change in the content we make use of the R.M.S. value of the ASCII values of the characters of the text extracted from the web document. They can be in the form of the paragraphs, headers or any other form of text.

Abstract A large amount of new information is posted on the Web every day. We can take the example of the news portals, which keep on changing not only each and every day but also within each and every hour. Now, which information or data is of how much importance depends upon the perception of the specific user. The Internet and the World Wide Web have enabled a publishing explosion of useful online information, which has produced the unfortunate side effect of information overload: it is increasingly difficult for individuals to keep abreast of fresh information. In this paper, we describe an approach for building a system for efficiently monitoring changes to Web documents. We discuss the mechanism that our proposed algorithm uses to discover and detect changes to the Web pages efficiently. Our solution for searching new information from the web page by tracking the changes in web document’s structure has been discussed. In the methodology section, we present the algorithm and technique useful for detecting web pages that are changed, extracting changes from different versions of a web page, and evaluating the significance of web changes. Our algorithm for extracting web changes consists of three steps: document tree construction, document tree encoding and tree matching (based upon the concept of R.M.S. value of the content), for the detection of two types of changes basically -- structural changes and content changes. It has linear time complexity and extracts effectively the changed content from different versions of a web page.

1.

Introduction

The World Wide Web (the Web), as one of the most popular applications on the Internet, continues to grow at an astounding speed. Not only the size of static web pages increases approximately 15% per month, the number of dynamic pages generated by programs has been growing exponentially. The rapid growth of the Web has affected the ways in which fresh information is delivered and disseminated. 0-7695-3068-0/07 $25.00 © 2007 IEEE DOI 10.1109/ICIT.2007.37

J.P.Gupta JIIT University, Noida (India [email protected]

2. Related work Now the question which arises in our minds as soon as changes in the web pages becomes the centre of concern is what makes retrieving and managing web changes an effective method for retrieving new 259 265

information from the web? Some of the studies show that although the Web is growing and changing fast, the absolute amount of changed content on existing web pages during a short period is significantly smaller than the total amount of content in the Web. Two recent studies as mentioned below show the need for management of changes in web pages. around 50%) of web pages remain completely unchanged during the entire period they studied. To measure the degree of change, they compute the between different versions of web documents. They show that many of the pages that do change undergo only minor changes in their content: even after a whole year, 50% of the changed pages are less than 5% different from their initial version. Fretterly [19] performed a large crawl that downloaded 151 million HTML pages. They then attempted to fetch each of these 151 million HTML pages ten more times over a span of ten weeks during Dec. 2002 to Mar. 2003. For each version of each document, they compute the checksum and shingles to measure the degree of change. The degree of change is categorized into 6 groups: complete change (no common singles), large change (less than 30% common shingles), medium change (30%-70% common shingles), small change (70%-99% common shingles), no text change (100% common shingles), and no change (same checksum). Experiments show that about 76% of all pages fall into the groups of no text change and no change. The percentage for the group of small change is around 16% while the percentage for groups of complete change and large change is only 3%. The above results are very supportive to our studies. They suggest that incremental method may be very effective in updating web indexes, and that searching for new information appearing on the web by retrieving the changes will require a small amount of data processing as compared to the huge size of the Web. It has been discussed in [6] how to estimate the change frequency of a web page by revisiting the page periodically. A study [1] has been done on how often a crawler should visit a page when it knows how often the page changes. Their initial experiment tries to answer the following questions about the evolving web: 1) How often does a web page change? 2) What is the lifespan of a page? 3) How long does it take for 50% of the web to change? 4) Can we des cribe changes of web pages by a mathematical model? References [1] and [2] experimentally study how often web pages change. Reference [1] and [5] studies the relationship between the desirability of a page and its lifespan. Some papers investigate page changes to

Ntoulas [18] collected a historical database for the web by downloading 154 popular Web sites (e.g., acm.org, hp.com and oreilly.com) every week from October 2002 until October 2003, for a total of 51 weeks. The average number of web pages downloaded weekly was 4.4 million. The experiments show that a significant fraction ( shingles of each document and measure the difference of shingles improve web caching policies. Comparing with the research in active databases, the WebCQ [11] system differs in three ways: First, the WebCQ system targets at monitoring and tracking changes to arbitrary web pages. Second, WebCQ monitors data provided by the content providers on remote servers, and WebCQ monitoring and tracking service requires neither the control over the data it monitors nor the structural information about the data it is tracking. A study [8] by Rocco, Butler, Luis has contributed to this field by developing a mechanism for efficient storage and processing of Web documents. The Page Digest design encourages a clean separation of the structural elements of Web documents from their content. Its encoding transformation produces many of the advantages of traditional string digest schemes yet remains invertible without introducing significant additional cost or complexity. Using the Page Digest encoding can provide at least an order of magnitude speedup when traversing a Web document as compared to using a standard Document Object Model implementation. The experiments show that change detection using Page Digest operates in linear time, offering 75% improvement in execution performance compared with existing systems. In addition, the Page Digest encoding can reduce the tag name redundancy found in Web documents, allowing 30% to 50% reduction in document size. They have used depth first traversal to parse the document tree. Various online tools to detect and notify web page changes are available on the internet. some of this are like timelyWeb [16] and URLY Warning[17]. There has been a lot of research going on in this field as this has been a newer topic in the field of hypertext. Researchers are trying to increase the efficiencies of the change detection algorithms in order to increase the efficiencies of the web crawler.

3. Proposed Methods 3.1 Classification of changes: Broadly speaking these changes can be classified into 4 major categories: 1. Content/Semantic changes refer to modifications of the page contents from the reader’s point of view. 266 260

For example, a page created for a soccer tournament might be continuously updated as the tournament progresses. After the tournament has ended, the page might change to a presentation about the tournament results and sports injuries. 2. Presentation/Cosmetic changes are changes related to the document representation that do not reflect changes in the topic presented in the document. For instance, changes to HTML tags can modify the appearance of a Web page while it otherwise remains the same. 3. Structural changes refer to the underlying connection of the document to other documents. As an example, consider changes in the link destinations of a ―Weekly Hot Links Web page. While this page might be conceptually the same, the fact that the destination of the links has changed might be relevant, even if the text of the links has not. Structural changes are also important to detect, as they often might not be visually perceptible. 4. Behavioral changes refer to modifications to the active components of a document. For Web pages, this includes scripts, plug-ins and applets. The consequences of these changes are harder to predict, especially since many pages hide the script code in other files.

• • •

Parser level order child enumeration (tree traversal/parsing by level order); Tree mapping (finding the combined r.m.s value of the tags and content for change detection/extraction); and Change detector and presenter.

Our extractor is very effective for real web data in practice and has linear scalability. Change detector works basically for two semantic phases—content changes and the structured changes. We can further divide this into 3 views1. HTML view 2. Content view 3. Tree view The HTML view shows and compares the old and the new version of the web page e.g. any news portal. It extracts the HTML source code of those pages and the changed portion is displayed in the source code only. The content view deals with the changed portions of the contents of the web page versions. It includes the text, paragraphs, headers; etc. There is R.M.S value calculator which calculates the R.M.S. value of the text or the paragraph, or heading. The tree view traverses the document tree structure and presents the changes by comparing the number of nodes and by comparing the array structure of the HTML page. e.g. if the number of node on traversing the old and the new versions of the tree are different that means that there is new node added or deleted in the page structure. In all we need some fact or figure (i.e. signature) on the basis of which we can distinguish the content of the node of the web page. In this section, we study the types of problems that are emerging in detecting the new or changed information of the web pages and for each problem, we present a practical solution and analyze its efficiency. The problems and solutions are: 1) How can changes in web pages be detected effectively? First we classify the problem into two broad categories- the structural changes and the content based changes. We detect changes using a hypertext document tree encoding of the document. The first module renders an ordered tree structure of the input HTML web page, counts the number of nodes i.e. node count if it is different that means the structure of the web page has changed or a new node has been added or deleted. Further case may arise where there are equal number of nodes after comparing the older and the newer version of the HTML web page tree structure, but the order of the tags have been changed or a nested structure has been added. In order to detect these types of changes parsing plays an important role, which we have executed after studying different existing techniques based on data structures. The technique, which we have used, incorporates level

Fig.1 Types of changes in the web pages

Structural changes can be in the HTML or XML document (structured or semi structured) of a specific web page, these changes can be in the form of updating content (addition of a new node to the tree structure, deletion of nodes from the structure), image or link insertion /deletion etc. Techniques like page checksum, digital fingerprinting, page mirroring etc. are used for detecting content or semantic changes. We are using an improved version of page digest [8] to detect the structural changes in the web page structure i.e. in the HTML tags structure. The suggested algorithm is based on level order search which is another form of breadth first search. We move in the tree level by level. The basic modules, which constitute the application are• Document tree construction (It takes a HTML file as an input file, parses the tree with the help of a parser (as discussed later), identifies those opening tags as nodes of the trees, and maintaining the parent child relationships constructs the tree of the given web page.) ;

267 261

order ordered tree traversal, which has its advantage of traversing the tree level by level. 2) How can changes between different versions of web documents be presented/extracted effectively? We present an efficient and effective algorithm for comparing different versions of web documents. The change extractor based on this algorithm includes three phases: document tree construction, level order child enumeration (tree traversal/parsing by level order), and tree mapping (finding the combined R.M.S. value of the tags and content for change detection/extraction). Our extractor is very effective for real web data in practice and has linear scalability. 3) How should a change both in structure or content be detected and evaluated? We have constructed a specialized set of arrays for HTML pages which show the relationships among the tree nodes, i.e. a child or a parent. By analyzing the content of the arrays, we can easily determine the difference between the older version of the web page and the newer version. There are different cases in a HTML tree structure where the order of the tags plays an important role.

6) TAG NAME: This store the tag name.

Fig. 3 HTML Tree structure of web page

Addition /Deletion of node (Tag) in the Tree Suppose a new node is introduced into the tree i.e. a new tag is added into the html code as described below.

3.2 Change Detection: Method for structural change detection: -We consider the following sample HTML web page whose structure is to be detected for the changes using the proposed algorithm.

Fig 4. Initial structure (before addition)

Fig. 2 An example web page

There may be two types of changes in web pages : (1) Addition/deletion of a node containing tag of the HTML page. (2) Modification in the tag or tag value or change in content. The structure of every node of the tree representing the web page shall contain the following information: 1) ID: This index stores the unique id representing each node; 2) CHILD: This index stores the information of child. 3) PARENT: This index stores the information of the parent; 4) LEVEL: This index stores the level, where the node exists; 5) CONTENT VALUE : This index stores the RMS value of sum of character of the content;

Fig 5. Final structure (after addition of TR) The following s the change in the structure LEVEL LEVEL 1 LEVEL 2 LEVEL 3 LEVEL 4

Initially: LEVEL = ID = CHILD = LEVEL_ARRY =

268 262

INTIALLY

LATER

1 1 1 1

{1 {1 {1 {1

2 2 1 1

1 1 2 1

3 3 1 1

4 } 4 } NULL } 1 }

Later: LEVEL = {1 2 ID = {1 2 CHILD = {1 2 LEVEL_ARRY={1

3 3 1 1

3 4} 5 4} NULL NULL} 2 1 1}

4. Results By using the above techniques, we have been able to detect the structural and content-based changes successfully. Though many researches in the past have been conducted to detect the changes in the pages, so in comparison to others our technique gives a linear performance since we have used level order traversal which indeed saves our time by avoiding traversal of the whole tree. Rather our technique aims at traversing only the changed portion of the tree thus saving time. We conducted our tests on some of the practical web pages and have been able to extract the text successfully and present the changes between the two proposed the concept of page quality, which is closely related versions of the web page i.e. old and new versions of web pages.

Now, by comparing the two sets we get the idea that the modification has been done at LEVEL 3. In order to find the modification at a given level, we use the level order traversal The algorithm for level order traversal with breadth first search will give us the location where the change has taken place. Now, the second case is when the modification has been done in the value or content of the text. Example: Initially Structure of node is as given below;

5. Conclusion and Future Work

Fig 6. Initial content of the arrays

We presented a study of how to model web changes and how to detect them in web pages which are volatile that means which change almost each and every day. The proposed algorithm extract changes between different versions of web pages. We have been able to make out structural as well as contentbased changes by developing an efficient application in java which scores on other techniques on the basis of simplicity and understandability. Though level order traversal is another form of the breadth first traversal, so it may incorporate some of the drawbacks of the breadth first traversal . First, we found that detecting changes using HTTP Meta data can successfully reduce network traffic. Second, existing algorithms for extracting changes using tree edit distance have high computational cost, which is inappropriate for largescale search engine development. We proposed a new algorithm, which reduces the cost to linear using both tree encoding and level-by-level tree matching. Tracking changes can successfully retrieve the summary but not the complete content of the newly created page. A good future direction is to integrate both changed content and newly created content into a unified search index. Such index can have a higher coverage of new information on the web. A further thought is that whether updates on popular pages which are more likely to be updated by web authors are of high quality for retrieval. Cho [1] proposed the concept of page quality, which is closely, related to the popularity metrics of web pages. In our prospect ion, if such quality metric can be used to evaluate the updates of web pages, then people may be able to develop

Calculation of R.M.S. value before change: Cont_value= l + e +f +t = 108, 101,102,116 = 106.9170239. Now, Structure of node is:

Fig 7. Final array structure

Calculation of R.M.S. value after the change: Cont_value= r + i +g +h +t = 108.5375511 For anagrams and text where numbers of similar characters are same but at different positions e.g. NAME and MANE, the R.M.S. Value shall be same. So, in order to avoid confusion we propose a solution in which we multiply the position of the character with its ASCII value and then we follow:

We have taken an example of anagram NAME AND MANE. In first word position of N=0, A=1, M=2, E=3 and in second it is M=0, A=1, N=2, E=3.Since we are multiplying this with the ASCII value of each character, so the R.M.S. value will be different. The formula for R.M.S. value shall be:

269 263

solutions to improve the quality, as well as freshness, of web index for retrieval. We leave it as a good future direction that how web change detection on the basis of change frequency, quality, popularity, etc. can be used in a unified framework for web index synchronization problem.

Department of Computer Science, Texas A&M, College Station, TX, 77843-3112, US,2001 [13] Zubin Dalal, Suvendu Dash, Pratik Dave, Luis Francisco-Revilla, Richard Furuta, Unmil Karadkar, Frank Shipman, Managing Distributed Collections: Evaluating Webpage Changes, Movement, and Replacement, Department of Computer Science and Center for the Study of Digital Libraries, Texas A&M University, and College Station, TX 77843-3112..JCDL’04, June 7–11, 2004, Tucson, Arizona, USA.Copyright 2004 ACM

References: [1] Junghoo Cho, Hector Garcia-Molina, Department of Computer Science, Stanford, The Evolution of the Web and Implications for an Incremental Crawler, CA 94305, December 2, 1999

[14] Latifur Khan, Lei Wang and Yan Rao, Change Detection of XML Documents Using Signatures, Department of Computer Science, University of Texas at Dallas, Richardson, TX 75083-0688

[2] Jenny Edwards Kevin McCurley John Tomlin, An Adaptive Model for Optimizing Performance of an Incremental Web Crawler, 2000

[15] Shuohao Zhang, Curtis Dyreson, and Richard T. Snodgrass Schema- Less, Semantics-Based Change Detection for XML Detection, Washington State University, Pullman, Washington, U.S.A. WISE 2004, LNCS 3306, pp. 279–290,© Springer-Verlag Berlin Heidelberg 2004

[3] Junghoo Cho and Hector Garcia-Molina, Synchronizing a database to improve freshness, submitted for publication, 1999. http://www-db.stanford.edu/~cho/papers/cho-synch.ps.

[16] Timely Web tool, www.timelyweb.com/index.html [17] URLyWarningwww.bleepingcomputer.com

[4] Junghoo Cho, Hector Garcia-Molina, and Lawrence, Efficient crawling through URL ordering, Page In Proceedings of the 7th World-Wide Web Conference, 1998.

[18] A. Ntoulas, J. Cho, and C. Olston. What’s new on the web? The evolution of the web from a search engine perspective. In Proc. 13th International World Wide Web Conference, 2004.

[5] Junghoo Cho, University of California, Los Angeles, California a Hector Garcia-Molina, Stanford University, Stanford, California. Effective Page Refresh Policies for Web Crawlers,

[19] D. Fretterly, M. Manasse, M. Najork, and J. Wiener. A large-scale study of the evolution of web pages. In Proc. 12th International World Wide Web Conference, 2003.

[6] Junghoo Cho, University of California, Los Angeles and Hector Garcia- Molina, Stanford University, Estimating Frequency of Change, [7] Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Searching the Web, Stanford University, 1999 [8] Daniel Rocco, David Buttler, Ling Liu, Page Digest for Large-Scale Web Services Georgia Institute of Technology, College of Computing, Atlanta, GA 30332, U.S.A., Proceedings of the IEEE International Conference on ECommerce (CEC‘03), 2003 [9] David Buttler, Daniel Rocco, Ling Liu, Efficient Web Change Monitoring with Page Digest, 2001 [10] Buda Rahardjo, Roland H.C. Yap, Automatic Information Extraction from Web Pages, , School of Computing, National University of Singapore, Republic of Singapore,2001 [11] Ling Liu Carlton Pu, Wei Tang, WebCQ – Detecting and Delivering Information Changes on the Web, 2000 [12] Luis Francisco-Revilla, Frank M. Shipman III, Richard Furuta, Unmil Karadkar, Avital Arora,Perception of Content, Structure, and Presentation Changes in Web-based Hypertext, , Center for the Study of Digital Libraries and

270 264

Suggest Documents