At present most of the systems use a mailing list to send the same compiled changes to all ..... For example, for a page containing the list of fiction books, a user.
Expressive Profile Specification and Its Semantics for a Web Monitoring System* Ajay Eppili, Jyoti Jacob, Alpa Sachde, and Sharma Chakravarthy Information Technology Laboratory Computer Science and Engineering, UT Arlington, Texas, USA {eppili,jacob,sachde,sharma}@cse.uta.edu
Abstract. World wide web has gained a lot of prominence with respect to information retrieval and data delivery. With such a prolific growth, a user interested in a specific change has to continuously retrieve/pull information from the web and analyze it. This results in wastage of resources and more importantly the burden is on the user. Pull-based retrieval needs to be replaced with a pushbased paradigm for efficiency and notification of relevant information in a timely manner. WebVigiL is an efficient profile-based system to monitor, retrieve, detect and notify specific changes to HTML and XML pages on the web. In this paper, we describe the expressive profile specification language along with its semantics. We also present an efficient implementation of these profiles. Finally, we present the overall architecture of the WebVigiL system and its implementation status.
1 Introduction Information on the Internet, growing at a rapid rate, is spread over multiple repositories. This has greatly affected the way information is accessed, delivered and disseminated. Users, at present, are not only interested in the new information available on web pages but also in retrieving changes of interest in a timely manner. More specifically, users may only be interested in particular changes (such as keywords, phrases, links etc). Push and Pull paradigms [1] are traditionally used for monitoring the pages of interest. Pull Paradigm is an approach where the user performs an explicit action in the form of a query, transaction execution on a periodic basis on the pages of interest. Here, the burden of retrieving the required information is on the user and may result in changes being missed when a large number of web sites need to be monitored. In the push paradigm, the system is responsible for accepting user needs and informs the user (or a set of users) when something of interest happens. Although this approach reduces the burden on the user, naive use of a push paradigm results in informing users about the changes to web pages irrespective of the user’s interest. At present most of the systems use a mailing list to send the same compiled changes to all its subscribers. *
This work was supported, in part, by the Office of Naval Research & the SPAWAR System Center–San Diego & by the Rome Laboratory (grant F30602-01-2-05430), and by NSF (grant IIS-0123730).
Expressive Profile Specification and Its Semantics for a Web Monitoring System
421
Hence, an approach is needed which replaces periodic polling and notifies the user of the relevant changes in a timely manner. The emphasis in WebVigiL is on selective change notification. This entails notifying the user about the changes to the web pages based on user specified interest/policy. WebVigiL is a web monitoring system, which uses an appropriate combination of push and intelligent pull paradigm with the help of active capability to monitor customized changes to HTML and XML pages. WebVigiL intelligently pulls the information using a learning-based algorithm [2] from the web server based on user profile and propagates/pushes only the relevant information to the end user. In addition, WebVigiL is a scalable system, designed to detect even composite changes for a large number of users. An overview of the paradigm used and the basic approach taken for effective monitoring is discussed in [3]. This paper concentrates on the expressiveness of change specification, its semantics, and its implementation. In order for the user to specify notification and monitoring requirements, an expressive change specification language is needed. The remainder of the paper is organized as follows. Section 2 discusses related work. Section 3 discusses the syntax and semantics of the change specification language which captures the monitoring requirements of the user and in addition supports inheritance, event-based duration and composite changes. Section 4 gives an overview of the current architecture and status of the system. Section 5 concludes the paper with an emphasis on future work.
2 Related Work Many research groups have been working to address detecting changes to documents. GNU diff [4] detects changes between any two text files. Most of the previous work in change detection has dealt only with flat-files [5] and not structured or unstructured web documents. Several tools have been developed to detect changes between two versions of unstructured HTML documents [6]. Some change–monitoring tools such as ChangeDetection.com [7] have been developed using the push-pull paradigm. But these tools detect changes to the entire page instead of user specified components and the changes can be tracked only on limited pages. 2.1 Approaches for User Specification Present day users are interested in monitoring changes to pages and want to be notified based on his/her profile. Hence, an expressive language is necessary to specify user-intent on fetching, monitoring and propagating changes. WebCQ [8] detects customized changes between two given HTML pages and provides an expressive language for the user to specify his/her interests. But WebCQ only supports changes between the last two pages of interest. As a result, flexible and expressive compare options are not provided to the user. AT&T Internet Difference Engine [9] views a HTML document as a sequence of sentences and sentence-breaking markups. This approach may be expensive computationally as each sentence may need to be compared with all sentences in the document. WYSIGOT [10] is a commercial application that can be used to detect changes to HTML pages. It has to be installed on the local
422
Ajay Eppili et al.
machine, which is not always possible. This system gives an interface to specify the specifications for monitoring a web page. It has the feature to monitor an HTML page and also all the pages that it points to. But the granularity of change detection is at the page level. In [11], the authors allow the user to submit monitoring requests and continuous queries on the XML documents stored in the Xyleme repository. WebVigiL supports a life-span for change monitoring request which is akin to a continuous query. Change detection is continuously performed over the life-span. To the best of our knowledge, customized changes, inheritance, different reference selection or correlated specifications cannot be specified in Xyleme.
3 Change Specification Language The present day web user’s interest has evolved from mere retrieval of information to monitoring the changes on web pages that are of interest. As the web pages are distributed over large repositories, the emphasis is on selective and timely propagation of information/changes. Changes need to be notified to the user in different ways based on their profiles/policies. In addition, the notification of these changes may have to be sent to different devices that have different storage and communication bandwidths. The language for establishing the user policies should be able to accommodate the requirements of a heterogeneous distributed large network-centric environment. Hence, there is a need to define an expressive and extensible specification language wherein the user can specify details such as the web page(s) to be monitored, the type of change (keywords, phrases etc.) and the interval for comparing occurrence of changes. User should also be able to specify how, when, and where to be notified taking into consideration the quality of service factors such as timeliness, size vs. quality of notification. WebVigiL provides an expressive language with well-defined semantics for specifying the monitoring requirements of a user pertaining to the web [12]. Each monitoring request is termed a Sentinel. The change specification language developed for this purpose allows the user to create a monitoring request based on his/her requirements. The semantics of this language for WebVigiL have been formalized. Complete syntax of the language is shown in Fig 1. Following are a few monitoring scenarios that can be represented using the above sentinel specification language. Example 1: Alex wants to monitor http://www.uta.edu/spring04/cse/classes.htm for the keyword “cse5331” to take a decision for registering the course cse5331. The sentinel starts from May 15, 2004 to August 10, 2004 (summer semester) and she wants to be notified as soon as a change is detected. Sentinel (s1) for this scenario is as follows: Create Sentinel s1 Using http://www.uta.edu/ spring04/cse/classes.htm Monitor keyword (cse5331) Fetch 1 day From 05/15/04 To 08/10/04
Expressive Profile Specification and Its Semantics for a Web Monitoring System
423
Notify By email [email protected] Every best effort Compare pairwise ChangeHistory 3 Presentation dual frame Example 2: Alex wants to monitor the same URL as in Example 1 for regular updates on new courses getting added but is not interested in changes to images. As it is correlated with sentinel s1, the duration is specified between the start of s1 and the end of s1. The sentinel (s2) for the above scenario is: Create Sentinel s2 Using s1 Monitor Anychange AND (NOT) images Presentation only change
Fig. 1. Sentinel Syntax
3.1 Sentinel Name This is to specify a name for user’s request. The syntax of sentinel name is Create Sentinel . For every sentinel, the WebVigiL system generates a
424
Ajay Eppili et al.
unique identifier. In addition, the system also allows the user to specify a sentinel name. The user is required to specify a distinct name for all his sentinels. This name identifies a request uniquely. Further it facilitates the user to specify another sentinel in terms of his/her previously defined sentinels. 3.2 Sentinel Target The syntax of sentinel target is Using . The sentinel-target could be either a URL or a previously defined sentinel Si. If the new sentinel Sn, specifies the sentinel target as Si, then Sn inherits all properties of Si, unless the user overrides those properties in the current specification. In Example 1, Alex is interested in monitoring the course web page for the keyword ‘cse5331’. Alex should be able to specify this URL as the target on which the system monitors the changes on the keyword cse5331. Later Alex wants to get updates on the new classes being added to the page, as this may affect her decision for registering for the course cse5331. She should place another sentinel for the same URL but with different change criteria. As the second case is correlated with the first case, Alex can specify s1 as the sentinel target with a different change type. Sentinels are correlated if they inherit run time properties such as start and end time of a sentinel. Otherwise, they merely inherit static properties (e.g., URL, change type, etc. of the sentinel). The language allows the user to specify the reference web page or a previously placed sentinel as the target. 3.3 Sentinel Type WebVigiL allows the detection of customized changes in the form of sentinel type and provides explicit semantics for the user to specify his/her desired type of change. The syntax of sentinel type is given as: Monitor , where sentinel type is sentinel-type= [] [ ] In Example 1, Alex is interested in ‘cse5331’. Detecting changes to the entire page leads to wasteful computations and further sends unwanted information to Alex. In Example 2, Alex is interested in any change to the class web page but is not interested in the changes pertaining to images. WebVigiL handles such requests by introducing change type and operators in its change specification language. The contents of a web page can be any combination of objects such as set of words, links and images. Users can specify such objects using change type and use operators over these objects. Change Specification Language defines Primitive change and Composite change for a sentinel type. Primitive change: It is the detection of a single type of change between two versions of the same page. For keyword change, the user must specify a set of words. An exception list can also be given for any change. For phrase change, a set of phrases is specified. For regular expressions, a valid regular expression is given. Composite change: It comprises of a combination of distinct primitive change(s) specified on the same page, using one of the binary operators AND and OR. The
Expressive Profile Specification and Its Semantics for a Web Monitoring System
425
semantics of composite change formed by the use of an operator can be defined as follows (Note that Λ, V, and ~ are Boolean AND, OR, and NOT operators, respectively). 3.4 Change Type If V1 and V2 are two different versions of the same page, then Change Ct on V2 with reference to V1, is defined as: Ct (V1, V2) = True if the change type t is detected as insert in V2 or delete in V1. False otherwise. The sentinel-type is the change type t selected from the set T where T = {any change, links, images, all words except , phrase:, keywords:, table: