{tball,douglis}@research.att.com ... The AT&T Internet Difference Engine (AIDE) is a system ... the reader's viewing history and the changes made to a. document ...
An Internet Difference Engine and its Applications Thomas Ball Fred Douglis AT&T Bell Laboratories {tball,douglis}@research.att.com http://www.research.att.com/orgs/ssr/people/douglis/aide/
Abstract Finding interesting information on the Internet is difficult. Keeping up-to-date on this information requires a mechanism to determine when it has changed and how it has changed. The AT&T Internet Difference Engine (AIDE) is a system that detects and displays changes to pages on the World Wide Web. A web tracking tool finds when pages have changed. A new tool called HtmlDiff displays those changes by determining the differences between two HTML pages and creating a new HTML page summarizing them. We describe the major components of the Internet Difference Engine, and then show how it has been applied to a variety of Web applications, including collaborative editing environments and automated “What’s New?” reports.
1. Introduction The AT&T Internet Difference Engine (AIDE) is a system for tracking and viewing changes to information available on the Internet [3]. In particular, AIDE automatically compares and tracks HyperText Markup Language (HTML) pages accessible via the World Wide Web (WWW). Pages on the WWW change with varying frequency. Some pages, such as weather and stocks, change on a regular and frequent basis. No tools are needed to determine that these pages have changed, and it is usually clear where to find new or updated information on these pages. Other pages, such as institutional or personal home pages, change infrequently at irregular intervals, and modifications to these pages tend to be scattered and harder to find. Current tools for dealing with changing WWW pages generally focus on the problem of determining when a page has changed, and bypass the issue of how a page has changed [6,8]. However, it is often the differences between the current and previous versions of a page that users really want to know about, not simply the fact that a
page has changed. Other tracking tools rely on the document author to solve the differencing problem, either by providing a “What’s New?” page, or a change summary (such as found in First Floor Software’s Smart Bookmarks: a section of an HTML document that summarizes recent changes and can be downloaded independently of a page[4]). These solutions do not adequately address the differencing problem. First, and most importantly, many content providers do not explicitly summarize changes to their documents. Second, changes that are determined manually by the document author may not highlight everything of interest to all readers. Third, since documents are written to and read from at different rates, information that is “new” to one person may be “old” to another. While separate chronological “What’s New?” pages tend not to have this problem, pages that are highlighted with icons or that include SmartMarks bulletins have difficulty in distinguishing among changes that occur over multiple days. In contrast to the above tools, AIDE provides per-user versioning and comparison of pages, as well as tracking and notification. AIDE uses a new tool called HtmlDiff to determine the changes between two pages and create a new page that accurately summarizes the differences. The combination of this automatic differencing capability with tracking and versioning of documents allows AIDE to create “personalized” change reports that reflect both the reader’s viewing history and the changes made to a document by its author. This paper summarizes the three major components of the AIDE system (differencing, tracking, and versioning). The main focus of the paper is on present and future applications of AIDE to the WWW, ranging from tracking users’ personal hotlists and creating organization-specific change updates to providing support for collaborative authoring environments and mobile web access.
Figure 1. HtmlDiff output of USENIX Association Home Page
2. The Internet Difference Engine The Internet Difference Engine consists of three components: an HTML comparison tool called HtmlDiff, a tracking tool called w3newer, and a database of HTML pages and user access history. HtmlDiff is the core of the system and is a differentiator in two respects: it is the part of the system that computes differences between pages; combined with the tracking and versioning components, HtmlDiff distinguishes AIDE from other tracking tools. As explored later in Section 3, there are various ways to combine and parameterize these components to provide different services.
2.1. Differencing HtmlDiff is a tool that compares two HTML pages to determine the differences between them (if any), and creates a new HTML page that summarizes the content that is shared between the two pages as well as deleted
and inserted material. Figures 1 and 2 show two examples of the output of HtmlDiff. HtmlDiff inserts a banner at the beginning of the document that contains an internal link to the first difference. The differences are chained together through internal links to facilitate quick access to the differences; this is useful for large pages that contain few changes. Red arrows point to deletions (deleted text is displayed in “struck-out” font), while green arrows point to insertions (inserted text is displayed in bold italics). HtmlDiff models HTML as a sequence of sentences and uses well-known heuristics and algorithms to match the sentences of one page against the sentences of a second page [1,5]. Sentences need not be identical to match, and word-level differences are computed for sentences that are very similar. For example, in the USENIX home page, the last sentence on the availability of past proceedings shows that the years for which proceedings are available has changed. HtmlDiff ignores changes to formatting, so that changes to markups that determine page layout or the formatting of text are not flagged as differences.
HtmlDiff detects just more than changes to text. The insertion of an image or the modification of a Universal Resource Locator (URL) in a hypertext reference are also flagged as changes. However, HtmlDiff currently does not support recursive differencing. If a document has not changed but contains a link to a document that has changed (including embedded images), this is not reflected in the merged page. We are currently adding support for such a capability.
2.2. Versioning AIDE uses the Revision Control System (RCS)[7] to maintain a version history for each page tracked by the system. One version history is kept for each page, regardless of how many users have registered an interest in that page. An external relational database tracks the versions of a page that each user has checked in to RCS. Figure 3 shows an HTML interface to RCS that provides the version history of one page. The user may click on any version to see it, or select any two versions and invoke HtmlDiff on them. Versioning can be done on the client or server side, trading off concerns of storage efficiency, processing demands, and information privacy.
2.3. Tracking W3newer, the tracking component of AIDE, determines when pages on the WWW have changed, using a variety
of techniques and data available to it. The tool was derived from a previous tool named w3new. Input to w3newer consists of a set of URLs to check (which can be specified in a variety of ways), and optional information that determines the frequency with which specific URLs are polled (by default, URLs are polled each time the program runs if they have not been checked in the past day). W3newer maintains a list of times at which URLs were previously determined to have been modified. The tool generates a summary that lists the modified URLs in reverse chronological order. For each URL, there are three basic options: Remember, which stores the current version of a page in the archive (if the user had not already specified for w3newer to automatically store it when changed); Diff, which invokes HtmlDiff on the current version and last archived version of a page; History, which retrieves a list of all archived versions of the page. Figure 4 shows an example of the output (without the Remember links). W3newer can run either as a client or server program. As a client program running on a user’s machine, it has greater access to interesting information, such as a user’s bookmark file and preferences. Each time w3newer runs, it can access the user’s bookmark file to find out which new URLs to poll or which to stop polling. However, having a large community of users individually running their own web tracking tool may generate unacceptable network loads. Running w3newer as part of a centralized server allows it to combine checks for multiple users interested in the same page
Figure 2. HtmlDiff output of a professor’s home page, showing a subtle yet important change to the page.
Figure 3. HTML interface to version repository into one network access, but it makes access to privileged user data more difficult.
3. Applications In this section, we discuss how we have used AIDE within AT&T (3.1-3.5) and other potential applications for the engine (3.6 - 3.7).
3.1. Personalized Bookmark Tracking The most general use of AIDE is as a tool for tracking the URLs in a user’s hotlist (or bookmark file). We have been using AIDE extensively to track our hotlists. In this case, we run w3newer locally so that it has access to our private browser hotlists, as discussed above. While some of the detected differences are rather dull (e.g., correction of spelling, movement of text), the information gained from the interesting cases more than makes up for the noise. In one instance, one of us received notification that a collaborator’s home page at the University of Wisconsin changed - the individual had received tenure and had changed his title on his home page (see Figure 2).
3.2. “What’s New?” Page We have used AIDE to regularly and automatically generate “What’s New?” reports for specific
communities inside AT&T. For example, AIDE generates a change report daily for pages in Bell Laboratories Research (see Figure 4). Whenever a page is detected as modified, the new version of the page automatically is checked into the version repository. These reports are “read-only”: visitors to the page are not allowed to check-in new versions of the pages. For each page tracked, only two options are provided: Diff and History. Such “What’s New?” reports are not personalized for each user, although such a capability is possible. Users may access the History link to see older changes.
3.3. Collaborative Editing The WikiWikiWeb [2] is an example of a collaborative editing environment on the WWW. It is modeled on the idea of blackboards - people may browse the blackboards, change them and add comments to them, or create a new blackboard on a new topic. Certain blackboards may be read or write protected to limit access and modification. Because they are closed systems (i.e., access to all blackboards takes place through CGI), these systems have the ability to track versions of documents, as well as the times when documents are accessed and modified. A clone of the WikiWikiWeb called WebWeaver has been created inside AT&T to support discussion on a variety of topics. Both systems maintain a list of blackboards that have recently changed to point users to
new information. For blackboards that are updated in an “append-only” manner, it is relatively easy to find changes, although when a blackboard grows large it is difficult to remember what one last read. Changes to other pages are not so nicely grouped. The HtmlDiff tool was incorporated into WebWeaver to provide the capability to see recent changes on any of the blackboards. The internal hypertext links inserted by HtmlDiff direct the user immediately to the changes. WebWeaver already stores its own version history for the documents it maintains, obviating the need for the versioning component of AIDE. However, we are investigating how to incorporate the tracking component so that WebWeaver provides “personalized” differences in a later release. Neither the WikiWikiWeb nor WebWeaver require users to identify themselves upon entering the system. In order to create personalized differences, a method for identification is needed.
3.4. Document Provider Page Log AIDE can also be used by document providers to give readers access to their page history, if so desired. A “Page Log” link at the bottom of a page gives access to previous versions of the page (as with the table in Figure 3), while a “Recent Updates” link invokes HtmlDiff on the current page and the latest archived page to show recent modifications to the page. Both of us use this capability to allow users to see what has changed
recently on our home pages.
3.5. Coordinating Distributed Work Because the WWW is a very effective and easy means of sharing documents, it is a natural mechanism for facilitating collaboration. Consider a software development project that is geographically distributed across several locations. The WWW can be used to collect requirements, process descriptions, standards, meeting notes, code, documentation, bug reports, and testing and performance results so that everyone involved with the project has a consistent and up-to-date view of the project, no matter where they are located. Documents transparently reside at different locations, so that their “owners” can easily update them. In such a project, many of the documents will frequently change and different members of the project will access different sets of documents at different rates, depending on their job. As documents change, each member will want to know what's new in their world, without having to waste time browsing documents. We have used AIDE to provide such support for software development organizations within AT&T.
3.6. Web Time Travel We are currently exploring how to use AIDE to construct a “time travel” filter for the WWW. The idea is to allow users to browse parts of the WWW as they
Figure 4. Example of w3newer output for a “What’s New?” page for Bell Laboratories
existed at some time in the past, and to see the differences between the web-of-the-past and the web-ofthe-present. Through a form, the user would specify a time in the past that they would like to visit. Browser requests for HTML pages would pass through the time travel filter, which would query AIDE to determine if there are different versions of the requested page for the specified times. If so, HtmlDiff would be invoked to generate a merged page that would be the output of the filter. Furthermore, links in a document would be filtered and annotated to show whether or not they point to a page for which there are differences.
3.7. Mobile Web Access Access to the WWW from mobile computers can benefit from prefetching pages before the computer disconnects from the network. This allows users to browse pages locally without having access to the WWW. One drawback is that local copies of pages become stale over time and need to be updated. Rather than requiring users to retrieve the current pages again, we envision using AIDE (at a central server site) to determine which pages have changed. When a mobile user checks back in, AIDE provides feedback about changes to the pages that the user copied, including information about the size of the documents and changes so that the user will have an idea of the penalty for fetching a fresh copy. HtmlDiff could also supply a page summarizing just the differences, rather than the merged page format, which includes common content from both versions. (This use of AIDE also applies to users whose Internet access is slow relative to the AIDE server, such as low-speed modem connections, not just disconnected operation. )
4. Conclusions There is little “meta” information available on the Internet about changes to document. As a result, users may miss important changes to information they care about. Even when information such as “What’s New?” pages are available, they are inadequate because they do not address the real need of users, which is to see a “personalized” change report. We developed the AT&T Internet Difference Engine to address these issues, providing a platform that combines tracking and notification of changes to pages on the WWW with versioning and differencing of HTML. AIDE has been used in a number of applications, including personalized and generic views of what is new throughout the WWW and collaborative environments. We are exploring how to use AIDE for specialized applications such as “time travel” and WWW access over slow networks.
Information about AIDE can be found at the URL given at the beginning of the paper. References [1] Church, K. Char_align: A program for aligning parallel texts at the character level. In Association for Computational Linguistics, pages 1-8, 1993. [2] Cunningham, W. WikiWikiWeb. http://c2.com/cgi-bin/ wiki [3] Douglis, F., Ball, T. Tracking and viewing modifications on the web, Proceedings of Usenix 1996 Technical Conference, January 1996. [4] FirstFloor Software. http://www.firstfloor.com/ [5] Hirschberg, D. S. Algorithms for the longest common subsequence problem. Journal of the ACM, Vol. 24 No. 4, pages 664-675, October 1977. [6] Specter, Inc. Webwatch. http://www.specter.com/users/ janos/webwatch/index.html [7] Tichy, W. RCS: A system for version control. Software Practice and Experience, Vol 15. No. 7, pages 637-654, July 1985 [8] Url-minder. http://www.netmind.com/URL-minder/ URL-minder.html