tem CDWeb that allows users to monitor a web page de- pending of their need. ... ple, we show a simple model for an XML document. In the DOM model XML .... Change Monitoring server, the Query Builder and the. Change Presentation ...
Monitoring Web information changes. Sergio Flesca, Filippo Furfaro and Elio Masciari DEIS, Univ. della Calabria, 87030 Rende, Italy flesca,furfaro,masciari@si.deis.unical.it
In this paper we present a graph-based approach to answer to the question stated above. We also present the system CDWeb that allows users to monitor a web page depending of their need. More specifically we allow user to specify which portion of a document he is interested in and what kind of changes is useful for his goal (i.e. a user may be more interested in structural changes in the document while another user may be interested in semantic changes). For example, consider a web document concerning auction sales on-line; user may want to be alerted if a change occurs in one of the items he want to buy (i.e. if the quotation of an article has been changed or if new offers are available on the site). The problem of finding a minimum cost edit script that changes a document into its modified version is too computationally expensive (NP-Complete, prove this by reduction from the “exact cover by three-sets” problem[3, 2]), so we need a technique able to detect changes with a reasonable degree of efficiency. We represent the document as a tree and user focuses on specific portions of it, i.e. sub-trees. Once a particular sub-tree has been chosen, we store it and define an appropriate trigger that handles change notification. The system periodically loads the document to be monitored and checks if something has changed in the portion of interest. Since recovering the exact changes history is an hard problem, we have chosen to obtain a less precise information about changes. Instead of searching for what is the sequence of changes that has occurred, we pay attention on how much the changes have modified the examined document. Indeed, in many application contexts if we are able to detect that something has changed, our task is ended and no more information is needed (e.g. for on-line trading it is important to know whether a stock quote is changed, regardless of the type of changes).
Abstract The growth of the amount of information available on the internet has lead to a question: how can we use this information for improving business, research and so on? Often we need to access to some information only if it is changed, but this request can be more specific i.e. we would like to observe structural changes in a document or simply if a textual information as been modified. This is a very tedious work if an appropriate support does not exist. In this work we present a system called CDWeb that performs changes detection. The system we have implemented allows user to monitor the whole document or a specific portion depending of user request. User can also specify what kind of information he is more interested in, such as structural changes, or semantic changes. To prove our algorithm effectiveness we have also performed a sensibility analysis.
1 Introduction The rapid growth of the Internet has produced an enormous amount of information located everywhere. Not only the number of static web pages is rapidly growing, but even the number of dynamically generated web pages grows exponentially. Usually dynamic pages are used to deliver information that changes in the time. So, it is very frequent that web users feel the necessity to keep track of page changes, but they want to access this pages only when the information is changed. To fulfill this kind of user’s request a number of information change monitoring systems has been developed in the last years[5]. These services keep in charge the task of accessing web pages trying to identify when and how the page of interest is changed. When dealing with change detection in semistructured data the main question is : can I query changes efficiently? This is a hard question due to lack of a fixed data structure.
2 Web Changes Monitoring Our technique is designed to detect changes in small portions of web documents. In order to specify the information that have to be monitored, the user starts by choosing a por-
Work partially supported by the projects “DataX” and “TelCal” founded by MURST (Italy)
1
tion of a Web document (a sub-tree in the document tree), that contains the information which the user is interested in. This sub-tree is the region that the system tries to identify in the modified document. Then the user defines a set of portions of the selected sub-document whose changes have to be monitored.
Value1 Value2 Value3 Value4
Once the user has properly specified his conditions, the system reloads the selected Web pages at regular intervals (or also at user request) and checks for changes on the desired sub-tree. Obviously it has to find the sub-tree of interest in the new (eventually changed) document; this is achieved using a graph based algorithm. Such an algorithm constructs a complete bipartite graph whose nodes are the nodes of the stored document tree and the nodes of the current document tree. Each edge is labeled with a measures the similarity between the connected nodes. This graph is used to calculate the similarity between the stored sub-tree and the sub-trees of the new document, using a technique that will be explained in the following. To retrieve the portion of interest we need to test each sub-tree of the new document; the sub-tree in the new document that is the most similar to the stored one is the current version of the monitored information.
Figure 1. A DOM Tree The bipartite graph built using the selected sub-tree will be used to obtain information about changes. In particular all sub-trees of the monitored sub-tree can be associated to a sub-tree of the actual version and vice-versa. A trigger is fired for each information that the user wants to monitor. In the remainder of this section we better explain the technique used.
The DOM defines an abstract definition of XML data but it does not imply any concrete implementation: each application that uses the DOM model is free to keep the document by using a proper representation. In this work we assume the presence of a alphabet on which the strings of the document are built, of a set of element types ,that contains the possible structuring markup of the document formatting language, and a set of attribute names . A document tree is an ordered tree whose nodes are typed by using their corresponding markup type, and whose leaves have associated the actual text string. Moreover, a set of attribute-value pairs is associated to each node. Given a document tree , whose root is , and a node of , we denote with the text associated to . This text is obtained by concatenating the text associated to the leaf of the sub-tree of rooted . We also denote with a list of strings that contains the concatenation of the HTML type of the nodes in the path from to , and with the set of HTML attribute of the node . The triple constitutes the characteristics of the node.
2.1 Document Model
Several different data models have been proposed to represent Web documents. For instance, the WWW Consortium (W3C) has defined a kind of “generic” model, named Document Object Model (DOM), which defines the basic structure to represent HTML and XML data. As an example, we show a simple model for an XML document. In the DOM model XML documents have a logical structure which looks like a “forest” (a set of trees). For instance, the following XML document has a DOM structure reported in Figure 1.
Example 2.1 Let’s consider the document tree in figure 2. Each letter denotes the HTML tag which corresponds the node (e.g. refers to Table, to Table row, to table data, etc.). For the root we have: 2
and vice-versa. We also weight each attribute type in a different way since some attributes are considered less relevant than other. Finally typedist( ½ , ¾ ) measure the difference in the complete type of the two element ½ ¾ , where the difference are weighted in a inverse relationship with the distance from the actual html element type of ½ ¾ . The complete similarity function between two tree element is done by taking the weighted sum of intersect, attdist and typedist. This weighted sum is later shifted to obtain value ranging in the interval [-1, 1], where to the maximum similarity is given value 1, and a similarity degree of 0 is considered the similarity degree with a new node(insertion or deletion). We can use the node similarity function to define a method to detect document changes. We have to identify in the current document the sub-tree that is more similar to the stored one. So we need a procedure to evaluate the similarity between document trees. We first construct a complete bipartite graph starting from the document trees representing the old version of the page (or a portion of it) being observed (say it ½ ) and the new version of the page (say it ¾ ), in the following way:
Figure 2. A simple document tree
we have two set of nodes ½ and ¾ given respectively by the nodes in ½ and ¾ ;
= This, is, an, example
we label each (undirected) edge starting from a node in ½ and ending in a node in ¾ with the similarity degree of these two nodes.
=T =
Obviously many edges will have a very low similarity, but we are interested in searching for the subset of the edges that maximize the similarity between ½ and ¾ . Therefore we must introduce the notion of tree mapping. All the possible changes that can occur in a document cause a change in the association between the nodes in the document tree of the original pages and the nodes in the actual pages resulting in a variety of possible mapping. We give initially a definition for a tree mapping.
And for the node we have:
= This, is =T.Tr.Td.P = A
2.2 Node similarity
Definition 2.1 (Tree Mapping) Given two document trees and ¼ ¼ ¼ ¼ ¼ ¼ ¼ . A Tree Mapping M from to ¼ is a relation ¼ , such that ¼ . ¾
Given two document trees their similarity degree is a function of the similarity of their nodes. In order to measure the similarity degree between them, we assign to each sub-tree root a signature starting from previously defined parameters , , and as follows. Given two sub-trees ½ and ¾ , rooted respectively in ½ and ¾ we define three similarity measures, one for each term of the characteristic of the node, intersect, attdist, typedist. In particular, ½ ¾ denotes the hull of ¾ w.r.t ½ , attdist(½ ,¾ ) give a measure of the number of attributes in ½ which have a different value in ¾
Given two document trees , ¼, a tree mapping from to ¼ and a node in , we denote with the set of nodes of ¼ associated with in and given a node in ¼ we denote with the set of nodes of associated with in . Definition 2.2 (Simple Edit Mapping) Given two document tree and ¼ ¼ ¼ ¼ ¼ ¼ ¼ , an edit mapping from to ¼ is a tree mapping such that if and ¼ . ¾ 3
The definition implies that the change model we follow is unable to detect very complicate edit operation such as split of tree nodes or join of multiple nodes in a new one. This is essentially due to efficiency reasons. It is important to note that for several applications, such as detection of update in auction pages, this simple change model suffices. To define the similarity degree of two document trees, we first give the notion of tree similarity ( ) with respect to a given simple edit mapping. is defined as the average of the similarity coefficients that label the edge leaving the node, or zero if no outgoing edge belongs to the mapping. The similarity degree of two document tree is defined as the maximum similarity degree for all the possible simple edit mapping. The algorithm for change detection is very simple since we can refer to the Maximum Matching problem. The algorithm consists of the following steps. It takes as input a stored document sub-tree ½¼ ½¼ ½ ½ ½¼ ½ ½ and the actual document ¾ ¾ ¾ ¾ ¾ ¾ ¾ and return as output a sub-tree ¾¼ of ¾ that matches ½¼ and a mapping between ½¼ ¾¼ . It construct a complete bipartite graph between ½¼ and ¾ , search for the sub-tree ¾¼ of ¾ whose maximum similarity matching with ½¼ is maximum, and return this subtree and this matching. The algorithm has polynomial complexity and experimental results confirm that, in case of simple changes, it always find the changes made to the document.
Figure 3. System Architecture Change Detection module This is the main module since in this part of the system the change detection algorithm explained above are executed to verify changes and alert user. The input of this module is a couple of document, more precisely a web document and a previously stored piece of that document. These document are represented in a graph based way to allow the algorithm execution; once the similarity measure for the document is obtained it is passed to the server. Query Builder User may monitor a web page for any type of change that can be made, so we allow user to specify the condition for the monitoring based on the below specified syntax. Note that when a user selects a piece of the document being observed he specifies also a reference name for that portion (this is done because users are not aware of the internal representation in the system of that portion of the document). Once the piece to be monitored is selected user may set the condition for notification accordingly to the syntax for Web Trigger shown in Figure 3.1. Change Presentation module This module contains a Graphical User interface that allows user to browse the Web page he wants to monitor and when a trigger fires presents to him the piece of information being observed and the changes occurred placed side by side. We make this choice to allow easy recognition of changes discovered by the system.
3 Implementation Issue In this section we will describe the system architecture of CDWeb, in particular we pay attention to the definition of user trigger for monitoring purpose, the module for change detection of the system is available at the Web site: http://isi-cnr.deis.unical.it:1080/ masciari/CDWeb/
3.1 System Architecture The system is implemented in a modular way and its main modules are the Change Detection module, the Change Monitoring server, the Query Builder and the Change Presentation module. The overall structure is shown in Figure 3. Change Monitoring server When a Web page has to be monitored, the user stores the piece of the document to be monitored on the change monitoring server, on the server are also stored user defined query. Periodically the server reloads the observed Web page and sends it together with the previously stored information to the Change Detection module that verifies if some changes have been occurred. If relevant changes are detected and some triggers fire the information are sent to the Change Presentation module.
4 Experimental result In this section we show some test results relative to the execution time of the changes detection algorithm, and on the effectiveness of the technique on a collection of test run on ebay and auckland. All the test of this section have been run on a Pentium III 300Mhz with 128 Megabyte of Ram. 4
CDWeb WebTrigger ::= CREATE WebTrigger WebTrigger name ON WebTrigger target FOR CHANGES ON WebTrigger target-list CHECKING WebTrigger target-zone NOTIFY BY WebTrigger method description BETWEEN WebTrigger date AND WebTrigger date EVERY WebTrigger interval WebTrigger name ::= text string WebTrigger method description ::= email address fax phone WebTrigger target ::= url WebTrigger target-list ::= HTML element HTML element target-list WebTrigger target-zone ::= HTML element WebTrigger date ::= month “-” day “-” year WebTrigger interval ::= number MINUTES HOURS DAYS
Figure 4. WebTrigger syntax
Figure 6. Results for page monitoring
The first set of tests have been performed applying the algorithm to Web pages that have been changed by a simple mapping, we have chosen 10 interesting test pages as you can see in figure 4
been tested by several experiments that confirm the validity of our approach. The core of CDWeb system is the change detection module that detects changes in the document tree structure.The document model we have adopted conform very well both for HTML document and XML document.
URL http://eBay.co.UK http://eBay.co.UK http://eBay.co.UK http://eBay.co.UK http://eBay.co.UK http://eBay.co.UK http://eBay.co.UK http://eBay.co.UK http://eBay.co.UK http://eBay.co.UK
Portion Monitored DVD LAPTOPS GSM MOBILE AERONAUTICS COMMEMORATIVE SPORT FILM ROCK HARD CDs SCIENCE & NATURE
Time(s) 13 15 11 18 18 14 14 11 11 10
References [1] L. Liu, C. Pu, W. Tang WebCQ - Detecting and delivering information changes on the web. CIKM’00 Washington, DC USA. [2] S. Chawathe, H. Garcia-Molina Meaningful change detection in structured data. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 26-37, Tuscon, Arizona, May 1997. [3] S. Chawathe, S. Abiteboul, J. Widom Representing and querying changes in semistructured data. In Proc. of the Int. Conf. on Data Engeneering, pages 4-13, Orlando, Florida, february 1998. [4] F. Douglis, T. Ball, Y. Chen, and E. Koutsofios. The AT&T Internet Difference Engine: Tracking and Viewing Changes on the Web. World Wide Web, 1(1), pages 27-44, January 1998. [5] NetMind. http://www.netmind.com [6] TracerLock. http://www.peacefire.org/tracerlock [7] Webwhacker. http://www.webwhacker.com [8] K. Topley Core Swing advanced programming. Prentice Hall,1999. [9] F. Douglis, T. Ball, Y. Chen, E. Koutsofios WebGuide: Querying and Navigating Changes in Web Repositories. In Proc. of 1996 USENIX. [10] J. McHugh, S. Abiteboul, R. Goldman, D. Quass, and J. Widom Lore: A database management system for semistructured data. SIGMOD Record, 26(3): pages 54-66, September 1997.
Figure 5. Time expense for change detection The second set of tests has been run on a collection of 100 auction web pages taken from ebay and auckland and the percentage of successful change detection was of As a further example we show in figure 6 the output of the system when multiple pages are monitored at the same time.
5 Conclusion In this paper we have presented the architecture of system for change detection on Web pages. The system we have implemented permit to define trigger on Web pages changes. The assumption made for the cost model have 5