Expressive Profile Specification and Its Semantics ... - Semantic Scholar

2 downloads 0 Views 312KB Size Report
At present most of the systems use a mailing list to send the same compiled changes to all ..... For example, for a page containing the list of fiction books, a user.
Expressive Profile Specification and Its Semantics for a Web Monitoring System* Ajay Eppili, Jyoti Jacob, Alpa Sachde, and Sharma Chakravarthy Information Technology Laboratory Computer Science and Engineering, UT Arlington, Texas, USA {eppili,jacob,sachde,sharma}@cse.uta.edu

Abstract. World wide web has gained a lot of prominence with respect to information retrieval and data delivery. With such a prolific growth, a user interested in a specific change has to continuously retrieve/pull information from the web and analyze it. This results in wastage of resources and more importantly the burden is on the user. Pull-based retrieval needs to be replaced with a pushbased paradigm for efficiency and notification of relevant information in a timely manner. WebVigiL is an efficient profile-based system to monitor, retrieve, detect and notify specific changes to HTML and XML pages on the web. In this paper, we describe the expressive profile specification language along with its semantics. We also present an efficient implementation of these profiles. Finally, we present the overall architecture of the WebVigiL system and its implementation status.

1 Introduction Information on the Internet, growing at a rapid rate, is spread over multiple repositories. This has greatly affected the way information is accessed, delivered and disseminated. Users, at present, are not only interested in the new information available on web pages but also in retrieving changes of interest in a timely manner. More specifically, users may only be interested in particular changes (such as keywords, phrases, links etc). Push and Pull paradigms [1] are traditionally used for monitoring the pages of interest. Pull Paradigm is an approach where the user performs an explicit action in the form of a query, transaction execution on a periodic basis on the pages of interest. Here, the burden of retrieving the required information is on the user and may result in changes being missed when a large number of web sites need to be monitored. In the push paradigm, the system is responsible for accepting user needs and informs the user (or a set of users) when something of interest happens. Although this approach reduces the burden on the user, naive use of a push paradigm results in informing users about the changes to web pages irrespective of the user’s interest. At present most of the systems use a mailing list to send the same compiled changes to all its subscribers. *

This work was supported, in part, by the Office of Naval Research & the SPAWAR System Center–San Diego & by the Rome Laboratory (grant F30602-01-2-05430), and by NSF (grant IIS-0123730).

P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 420–433, 2004. © Springer-Verlag Berlin Heidelberg 2004

Expressive Profile Specification and Its Semantics for a Web Monitoring System

421

Hence, an approach is needed which replaces periodic polling and notifies the user of the relevant changes in a timely manner. The emphasis in WebVigiL is on selective change notification. This entails notifying the user about the changes to the web pages based on user specified interest/policy. WebVigiL is a web monitoring system, which uses an appropriate combination of push and intelligent pull paradigm with the help of active capability to monitor customized changes to HTML and XML pages. WebVigiL intelligently pulls the information using a learning-based algorithm [2] from the web server based on user profile and propagates/pushes only the relevant information to the end user. In addition, WebVigiL is a scalable system, designed to detect even composite changes for a large number of users. An overview of the paradigm used and the basic approach taken for effective monitoring is discussed in [3]. This paper concentrates on the expressiveness of change specification, its semantics, and its implementation. In order for the user to specify notification and monitoring requirements, an expressive change specification language is needed. The remainder of the paper is organized as follows. Section 2 discusses related work. Section 3 discusses the syntax and semantics of the change specification language which captures the monitoring requirements of the user and in addition supports inheritance, event-based duration and composite changes. Section 4 gives an overview of the current architecture and status of the system. Section 5 concludes the paper with an emphasis on future work.

2 Related Work Many research groups have been working to address detecting changes to documents. GNU diff [4] detects changes between any two text files. Most of the previous work in change detection has dealt only with flat-files [5] and not structured or unstructured web documents. Several tools have been developed to detect changes between two versions of unstructured HTML documents [6]. Some change–monitoring tools such as ChangeDetection.com [7] have been developed using the push-pull paradigm. But these tools detect changes to the entire page instead of user specified components and the changes can be tracked only on limited pages. 2.1 Approaches for User Specification Present day users are interested in monitoring changes to pages and want to be notified based on his/her profile. Hence, an expressive language is necessary to specify user-intent on fetching, monitoring and propagating changes. WebCQ [8] detects customized changes between two given HTML pages and provides an expressive language for the user to specify his/her interests. But WebCQ only supports changes between the last two pages of interest. As a result, flexible and expressive compare options are not provided to the user. AT&T Internet Difference Engine [9] views a HTML document as a sequence of sentences and sentence-breaking markups. This approach may be expensive computationally as each sentence may need to be compared with all sentences in the document. WYSIGOT [10] is a commercial application that can be used to detect changes to HTML pages. It has to be installed on the local

422

Ajay Eppili et al.

machine, which is not always possible. This system gives an interface to specify the specifications for monitoring a web page. It has the feature to monitor an HTML page and also all the pages that it points to. But the granularity of change detection is at the page level. In [11], the authors allow the user to submit monitoring requests and continuous queries on the XML documents stored in the Xyleme repository. WebVigiL supports a life-span for change monitoring request which is akin to a continuous query. Change detection is continuously performed over the life-span. To the best of our knowledge, customized changes, inheritance, different reference selection or correlated specifications cannot be specified in Xyleme.

3 Change Specification Language The present day web user’s interest has evolved from mere retrieval of information to monitoring the changes on web pages that are of interest. As the web pages are distributed over large repositories, the emphasis is on selective and timely propagation of information/changes. Changes need to be notified to the user in different ways based on their profiles/policies. In addition, the notification of these changes may have to be sent to different devices that have different storage and communication bandwidths. The language for establishing the user policies should be able to accommodate the requirements of a heterogeneous distributed large network-centric environment. Hence, there is a need to define an expressive and extensible specification language wherein the user can specify details such as the web page(s) to be monitored, the type of change (keywords, phrases etc.) and the interval for comparing occurrence of changes. User should also be able to specify how, when, and where to be notified taking into consideration the quality of service factors such as timeliness, size vs. quality of notification. WebVigiL provides an expressive language with well-defined semantics for specifying the monitoring requirements of a user pertaining to the web [12]. Each monitoring request is termed a Sentinel. The change specification language developed for this purpose allows the user to create a monitoring request based on his/her requirements. The semantics of this language for WebVigiL have been formalized. Complete syntax of the language is shown in Fig 1. Following are a few monitoring scenarios that can be represented using the above sentinel specification language. Example 1: Alex wants to monitor http://www.uta.edu/spring04/cse/classes.htm for the keyword “cse5331” to take a decision for registering the course cse5331. The sentinel starts from May 15, 2004 to August 10, 2004 (summer semester) and she wants to be notified as soon as a change is detected. Sentinel (s1) for this scenario is as follows: Create Sentinel s1 Using http://www.uta.edu/ spring04/cse/classes.htm Monitor keyword (cse5331) Fetch 1 day From 05/15/04 To 08/10/04

Expressive Profile Specification and Its Semantics for a Web Monitoring System

423

Notify By email [email protected] Every best effort Compare pairwise ChangeHistory 3 Presentation dual frame Example 2: Alex wants to monitor the same URL as in Example 1 for regular updates on new courses getting added but is not interested in changes to images. As it is correlated with sentinel s1, the duration is specified between the start of s1 and the end of s1. The sentinel (s2) for the above scenario is: Create Sentinel s2 Using s1 Monitor Anychange AND (NOT) images Presentation only change

Fig. 1. Sentinel Syntax

3.1 Sentinel Name This is to specify a name for user’s request. The syntax of sentinel name is Create Sentinel . For every sentinel, the WebVigiL system generates a

424

Ajay Eppili et al.

unique identifier. In addition, the system also allows the user to specify a sentinel name. The user is required to specify a distinct name for all his sentinels. This name identifies a request uniquely. Further it facilitates the user to specify another sentinel in terms of his/her previously defined sentinels. 3.2 Sentinel Target The syntax of sentinel target is Using . The sentinel-target could be either a URL or a previously defined sentinel Si. If the new sentinel Sn, specifies the sentinel target as Si, then Sn inherits all properties of Si, unless the user overrides those properties in the current specification. In Example 1, Alex is interested in monitoring the course web page for the keyword ‘cse5331’. Alex should be able to specify this URL as the target on which the system monitors the changes on the keyword cse5331. Later Alex wants to get updates on the new classes being added to the page, as this may affect her decision for registering for the course cse5331. She should place another sentinel for the same URL but with different change criteria. As the second case is correlated with the first case, Alex can specify s1 as the sentinel target with a different change type. Sentinels are correlated if they inherit run time properties such as start and end time of a sentinel. Otherwise, they merely inherit static properties (e.g., URL, change type, etc. of the sentinel). The language allows the user to specify the reference web page or a previously placed sentinel as the target. 3.3 Sentinel Type WebVigiL allows the detection of customized changes in the form of sentinel type and provides explicit semantics for the user to specify his/her desired type of change. The syntax of sentinel type is given as: Monitor , where sentinel type is sentinel-type= [] [ ] In Example 1, Alex is interested in ‘cse5331’. Detecting changes to the entire page leads to wasteful computations and further sends unwanted information to Alex. In Example 2, Alex is interested in any change to the class web page but is not interested in the changes pertaining to images. WebVigiL handles such requests by introducing change type and operators in its change specification language. The contents of a web page can be any combination of objects such as set of words, links and images. Users can specify such objects using change type and use operators over these objects. Change Specification Language defines Primitive change and Composite change for a sentinel type. Primitive change: It is the detection of a single type of change between two versions of the same page. For keyword change, the user must specify a set of words. An exception list can also be given for any change. For phrase change, a set of phrases is specified. For regular expressions, a valid regular expression is given. Composite change: It comprises of a combination of distinct primitive change(s) specified on the same page, using one of the binary operators AND and OR. The

Expressive Profile Specification and Its Semantics for a Web Monitoring System

425

semantics of composite change formed by the use of an operator can be defined as follows (Note that Λ, V, and ~ are Boolean AND, OR, and NOT operators, respectively). 3.4 Change Type If V1 and V2 are two different versions of the same page, then Change Ct on V2 with reference to V1, is defined as: Ct (V1, V2) = True if the change type t is detected as insert in V2 or delete in V1. False otherwise. The sentinel-type is the change type t selected from the set T where T = {any change, links, images, all words except , phrase:, keywords:, table: , list :, regular expression: }. Based on the form of information that is usually available on web pages change types may be classified as links, images, keywords, phrases, all words, table, list, regular expression and any change based on the form of information. Links: Corresponds to a set of hypertext references. In HTML, links are presentationbased objects represented between the hypertext tag (). Given two versions of a page, if any of the old links are deleted in the new version or new links are inserted, a change is flagged. Images: Corresponds to a set of image references extracted from the image source. In HTML, images are represented by the image source tag (< IMG src=”.”>). The changes detected are similar to the links except that the images are monitored. Keywords: Corresponds to a set of unique words from the page. A change is flagged when any of the keyword (mentioned in the set of words) appears or disappears in a page with respect to the previous version of the same page. Phrase: Corresponds to a set of contiguous words from the page. A change is flagged on the appearance or disappearance of a given phrase in a page with respect to the previous version of the same page. Update to a phrase is also flagged depending on the percentage of words that has been modified in a phrase. If the number of words changed exceeds above a threshold, it is deemed as a delete (or disappearance). Table: Corresponds to the content of the page represented in a tabular format. Though the table is a presentation object, the changes are tracked on the contents of the table. Hence, whenever the table contents are changed, it is flagged as a table change. List: Corresponds to the contents of a page represented in a list format. The list format can be bullets or numbered. Any change detected on the set of words represented in a list format is flagged as a change. Regular expression : Expressed as valid regular expression syntax for querying and extracting specific information from the document data.

426

Ajay Eppili et al.

All words: A page can be divided into a set of words, links and images. Any change to the set of words between two versions of the same page is detected as all words change. All words encompass phrases, keywords and words in the table and list. While considering changes to all words, the presentation objects such as table and list are not considered and only the content in these presentation objects are taken into consideration. Anychange: Anychange encompasses all the above given types of changes. Changes to any of the defined set (i.e., all words, all links and all images) are flagged as anychange. Hence, the granularity is limited to a page for anychange. Any change is the superset of all changes. 3.5 Operators Users may want to detect more than one type of change on a given page or the nonoccurrence of a type of change. To facilitate such detections the change specification language includes unary and binary operators. NOT: A unary operator, which detects the non-occurrence of a change type. For a given change type t on version V2 with reference to version V1 of the same page the semantics of NOT are: (NOT Ct )(V1,V2) = ~Ct (V1,V2) OR: A binary operator representing disjunction of change types. It is denoted by Ct1 OR Ct2 for two primitive changes Ct1 and Ct2 specified on version V2 with reference to version V1 of the same page. A change is detected if either Ct1 is detected or Ct2 is detected. Formally, (Ct1 OR Ct2) (V1,V2) = Ct1(V1,V2) V Ct2(V1,V2), where t1, t2 are the types of changes and t1t2 AND: A binary operator representing conjunction of change types. It is denoted by Ct1 AND Ct2 for two primitive changes Ct1 and Ct2 specified on version V2 with reference to version V1 of the same page. A change is detected when both Ct1 and Ct2 are detected. Formally,(Ct1 AND Ct2) (V1,V2) = Ct1 (V1,V2) ΛCt1(V1,V2), where t1, t2 are types of changes and t1 t2 The unary operator NOT can be used to specify a constituent primitive change in a composite change. For example, for a page containing the list of fiction books, a user can specify a change type as: All words AND NOT phrase {“ Lord of the Rings”}. A change will be flagged only if given two versions of a page, at least some words may change such as insertion of a new book and author etc. but the phrase “Lord of the Rings” has not changed. Hence, the user is interested in monitoring the arrival of new books or removal of old books, only as long as the book “Lord of the Rings” is available.

Expressive Profile Specification and Its Semantics for a Web Monitoring System

427

3.6 Fetch Changes can be detected for a web page only when a new version of the page is fetched. New versions can be fetched based on the freshness of the page. The page properties (or meta-data) of a web page, such as the last modified date for static pages or checksum for dynamic pages define whether a page has been modified or not. The syntax of fetch is Fetch