Proceedings of “Practical Applications of Knowledge Management” PAKeM’99, The Practical Applications Company, London, 1999.
WebMaster: Knowledge-based Verification of Web-pages Frank van Harmelen1 and Jos van der Meer2 1
AIdministrator & Vrije Universiteit Amsterdam,
[email protected] AIdministrator,
[email protected]
2
AIdministrator Abstract. Maintaining contents of Web sites is an open and urgent problem on the current World Wide Web as well as on company intra-nets. Although many current tools deal with problems such as broken links and missing images, very few solutions exist for maintaining the contents of Web sites and intra-nets. We present a knowledge-based approach to the verification of Web-page contents. The user exploits semantic markup in Webpages to formulate rules and constraints that must hold on the information in a site. An inference engine subsequently uses these rules to categorise Web-pages in an ontology of pages, while the constraints are used to define categories of pages which contain errors. We have constructed WebMaster, a software tool for knowledge-based verification of Web-pages. WebMaster allows the user to define rules and constraints in a graphical format, and is then able to use these rules to detect outdated, inconsistent and incomplete information in Web-pages. In this paper, we describe the various options for semantic markup on the Web, we define a precise logical and graphical format for rules and constraints, and we report on our practical experiences with WebMaster.
Acknowledgements The work reported in this paper has only been possible with the contributions from all current and past members of the WebMaster team at AIdministrator: Jan Bakker, Chris Fluit, Herko ter Horst, Walter van Iterson, Arjohn Kampman and Gert-Jan van de Streek.
Part I: “The business issue” 1 INTRODUCTION Motivation Maintaining contents of Web-pages is an obvious open problem. Anybody who has used the WWW has experienced the amounts of outdated, missing, and inconsistent information on many Web sites, even on those sites that are of crucial importance to individuals, companies or organisations. This holds equally strongly for company-internal intranets. In this paper we describe a software tool (provisionally named WebMaster) which supports an important aspect of Web maintenance, namely verification: the location of errors in Web-page contents. The functionality of WebMaster is in sharp contrast with most existing Web site maintenance tools. These existing tools deal with problems such as broken links, missing images, incorrect HTML etc, but unlike WebMaster, they do not analyse the semantic contents of the site1 . There are a number of reasons why maintaining high quality of the contents of a Web site is difficult: – Size: Even sites of modest size quickly run into hundreds of pages. Larger sites (and in particular intranet-sites as opposed to Internet-sites) easily reach into the many thousands of pages. With sites of such size, manual maintenance (and even error-location) is infeasible, and computer support is required. 1
For brevity, in the following we will speak of Web-sites where we mean both Internet sites and intranets
– Dynamics: Web sites are used as information sources on current affairs, commercial activities, front-line research results, the day-to-day running of organisations, etc. As a result, the information in many Web sites changes frequently. Again, manual maintenance and verification will not be able to catch up with the updates that are taking place on the site. – Multiple providers: In realistic Web sites, information comes from multiple providers, who are often distributed both within an organisation and geographically. This makes it easy for information in different parts of the site to get “out of sync”. – Lack of structure: A crucially important difference between modern Web sites and more traditional information-repositories such as databases is that the data typically found on Web sites is only weakly structured. A typical Web-page does not contain tables of highly regular data, but has a much more narrative flow and a much less rigid structure. The first three of these reasons indicate the need for computer support for verification of Web site contents. However, the weak structure of Web site contents (the fourth reason) makes such computer support hard to provide. For databases, methods have been developed to maintain the quality of high volumes of fast changing information from multiple providers, but the key to these methods is the very strict structure that a database imposes on its contents. These methods are not applicable to weakly structured data. In this paper we present an approach which does apply to weakly structured data. Aims The errors in the contents of Web sites that are caused by the above problems can be divided into three categories: – Outdated information: Much information on Web sites is time-dependent, and becomes outdated as time passes on. Obvious (and often encountered) examples of such information are announcements of events that have already happened, or “what’s new”-lists mentioning events that are no longer in the recent past. – Missing information: This category concerns information that is supposed to be present but is not. An example of this error that we will use in this paper is a company’s intranet which might require every employee to provide some CV information. Missing CVs would then fall into this category. This might take the form of a broken link to a non-existent CV-file, but not necessarily so, e.g. when no CV-link is given at all, or when the CV-information is usually included in the employee’s homepage, without a separate link. – Inconsistent information: The potential for inconsistent information arises when information is stated redundantly in multiple places in a Web site. This is very often unavoidable to make a Web site easy to read and navigate. Multiple copies of the same information quickly gives rise to inconsistencies when one location is updated and the other is not. Such redundant multiple copies of information are hard to spot when they are stated in different forms. An example that we will use in this paper is a company’s intranet, which lists for each employee the projects they are assigned to, and (redundantly) in another location for each project the employees of that project. For N employees and M projects, we must check on the order of N M relations, distributed over N M locations in order to guarantee consistency between employee information and project information. WebMaster aims at locating errors of all of the above three types. A category of problems that is missing from this list concerns incorrect information: information that does not correctly reflect that state of the world described by the Web site. Although many errors in the current Web fall into
this category, we have deliberately omitted it from the list of error-categories that WebMaster can locate. The reason for this is as follows: Each of the three above error-categories can be identified on the basis of knowledge about the Web site alone: inconsistent information can be detected by comparing different locations within the same site; missing information can be detected on the basis of rules stating which types of information must occur within a site; and outdated information can be identified by comparing temporal statements in the Web site with the external time. Because of this, each of these categories can indeed be identified effectively, as we will show in the remainder of this paper. For “incorrect information” however, it would be necessary to compare the contents of a Web site with the actual state of affairs in the external world described by the site. This would require a full “world-model” of the world described by the site, while the other three categories only require reasoning about a model of the Web site contents, and not of the external world 2 . Summarising, the aim of WebMaster is to support the maintenance of the contents of Web sites by verifying the contents of sites for outdated, missing and inconsistent information.
Part II: The Approach Taken 2 APPROACH In this section we will discuss different approach that could be taken to achieve the aims outlined in the previous section. We will argue why we have chosen one of these approaches as the basis for WebMaster. Guarded Updates. A rather direct approach to avoiding inconsistent and missing information is to ensure that all interactions of information providers with the Web site are strictly regulated. No freeformat editing of information is allowed. Instead, adding and removing information is only allowed using special purpose scripts or forms which ensure that strict rules are followed to guarantee consistency and completeness of information. For example, such special purpose editing tools may insist that CV information is always added whenever new employee information is entered, thus ensuring completeness. Similarly, whenever employees are assigned to a new project, the editing software may insist that information is updated both at the employee location and at the project location, thus ensuring consistency. There however are a number of disadvantages to this approach. Firstly, it severely limits the freedom of information providers: they are only allowed to make modifications within the limits of the editing software. Secondly, this approach tends to lead to a rather ad hoc collection of very Web site specific software, that is itself very difficult to maintain when the demands on the contents of the Web site change. Generated Sites. A second approach (the one which is currently the most often used in practice) is to store all information in a database, and then generate the pages of the Web site from the contents of the database (either off-line, or on-line in direct response to a reader’s query). The obvious advantage of this approach is that it reduces Web site maintenance to database maintenance, for which stable and accepted solutions exist. The equally obvious disadvantage is that this approach only works well for highly structured (or at least: structurable) information. Product-catalogues or volumes of technical data are typical examples. Informal estimates from commercial Web site builders are that at most 20% of information on a typical Web site is sufficiently structured for this approach to apply. When applied to less structured information, the only way to squeeze such information into a strict database schema is to make database-fields that contain large amounts of “free-text”, 2
Strictly speaking, the current date and time as required for identifying outdated information is part of a “world-model”, but this information is so simple that it poses no problems.
for which no further structure is indicated. For maintenance purposes, these free-text fields will behave as black boxes that do not lend themselves to analysis for consistency, completeness, etc. Knowledge-based Verification. Given the disadvantages of the previous two approaches, we have developed a third approach to maintaining the quality of information on Web sites. Whereas the previous two approaches aimed at preventing inconsistent and incomplete information, the verification approach aims to detect such information after it has appeared on the Web site. This is done on the basis of rules that must hold for the information found in the Web site. The steps in this approach are shown in figure 1. Besides providing Web site contents, an information provider also formulates rules that define the properties that should hold on this information. Such rules (also called integrity constraints) would for instance express that announcements should always mention dates later than today, that employees should list CV information, or that employee-lists from projects should correspond with project-lists from employees. As indicated in fig. 1, an inference engine then applies these integrity constraints to identify the places in the Web site which violate these constraints [19].
contents editor
contents
information provider
inference engine rules editor
violated constraints
rules
Fig. 1. Steps involved in the knowledge-based verification approach
This approach is indeed “knowledge-based”: the information providers use their domain specific knowledge to express which constraints should be imposed on the Web site contents. As usual in knowledge-based approaches, a strong point is that these integrity constraints can be highly domain specific since they are provided by domain experts, such as the information provider themselves.
3 SEMANTIC MARKUP In order to express integrity constraints on the contents of Web pages, an information provider must be able to refer to the semantic contents of such Web pages in a machine-accessible way. As has been pointed out by many authors [11, 4, 10], HTML pages as currently encountered on the Web are unsuitable for this purpose. At best, HTML markup is used to indicate the structure of a document ( H 1 , OL , etc), at worst it is used to prescribe the lay-out of a document ( B , FONT , BR , etc), but in neither case is HTML used to describe the semantic contents of the document. This lack of semantic markup is now widely recognised as a major barrier to the development of more intelligent document processing on the Web. In this section we will discuss two
ways in which such semantic markup can be added to Web-pages using W3C-standard technology. XML. One of the results of a general push towards more semantic structure on the Web has been the development of the XML markup language 3 . XML allows Web-page PERSON creators to use their own set of markup-tags. These tags can be chosen to reflect the domain specific semantics of NAME PROJECT LOCATION the information, rather than merely its lay-out. Fig 3 shows the same piece of information in HTML-markup (fig. 3a) TEL ROOM and in semantically well chosen XML (fig. 3b). From the example it is clear that in XML we can now Fig. 2. XML markup as a labelled tree recognise pieces of information such as a person’s name or telephone number. In essence, XML allows us to structure Web-pages as labelled trees, where the labels can be chosen by the information provider to reflect as much of the documents semantics as is required. The labelled tree for fig. 3b is shown in figure 2.
F. van Harmelen
works for project WebMaster and can be reached at tel. 47731 or in room T3.57
F. van Harmelen works for project WebMaster and can be reached at tel. 47731 or in room T3.57
Fig. 3. HTML and XML markup
Even though XML has only recently been established as an official W3C standard [4], and parts of its definition are not yet complete (e.g. style-sheets [6], and its link-model [17, 18]), it has already gained significant momentum: a number of books on XML have already appeared, it will be supported in the major browsers, and it has been endorsed by hardware and software producers and by traditional publishers. Software support (both academic and commercial) is rapidly growing4. Although currently not yet widely used, it seems likely that besides HTML, XML will become a major markup-language on the Web. HTML. Notwithstanding the quick advent of XML on the Web, knowledge-based verification would have very limited applicability if it were to rely solely on the success of XML. Rather to our surprise, it turns out to be very well possible to provide semantic markup in Web pages in HTML as well, in two different ways. Firstly, the little known SPAN -tag in HTML-4.0 is meant to indicate a possibly nested block-structure within an HTML-page. Attributes to such tags can be used to indicate the semantic significance of such a block(see fig. 4a). A second solution for semantic markup in HTML exploits the explicit possibility in HTML-4.0 to introduce non-standard tags, which must be ignored by HTML-browsers, but which can be exploited by software such as WebMaster. Using this, we obtain the solution in fig. 4b, which is an exact merge of the HTML layout-tags of fig. 3a and the semantic XML tags of fig. 3. Summarising, we can say that markup to indicate the semantics of Web-page contents is a necessary requirement for expressing rules and constraints on this contents. Such semantic markup 4
See http://www.oasis-open.org/cover/sgml-xml.html for an excellent set of pointers to XML resources.
F. van Harmelen
works for project WebMaster and can be reached at tel. 47731 or in room T3.57
F. van Harmelen
works for project WebMaster and can be reached at tel. 47731 or in room T3.57
Fig. 4. Semantic markup in HTML (with semantic tags underlined)
can be expressed in XML (which has been designed with this specific purpose in mind). More surprisingly, the same semantic markup can also be expressed in standard HTML. After having shown how to express semantic markup, the question remains how such semantic markup should be obtained. We will sidestep this important question in this paper, but instead refer to other projects which have investigated this question [7, 5, 1].
4 ONTOLOGIES: TYPES AND CONSTRAINTS Now that we know how to express semantic markup in Web-pages, the next step in the knowledgebased verification of Web-page contents is to express rules and constraints on this contents. These rules and constraints will capture the users knowledge on the required contents of these pages, and will be used by an inference engine to determine potential errors (constraint violations) in the site(see fig. 1). In this section, we will describe the formalism we have developed for expressing rules (to categorise pages into types) and constraints (to determine potential errors in pages). 4.1 Types: describing an ontology of Web-pages As the first step towards identifying errors in a Web site, it is useful to divide Web-pages into categories, where pages within a given category share certain properties. By organising such categories in a hierarchy of subcategories, we get a type-hierarchy (or: ontology). An example is shown in figure 5. Such ontologies are well understood and often used as modelling devices in fields such as Knowledge Engineering and Software Engineering.
r3
all pages
outdated pages
r1
missing CVs
CV pages
homepages
r 4
r2
before 1998
Fig. 5. An example page-ontology
The top of WebMaster’s ontologies is always the type of “all pages” in the site (i.e. the universal type). Ontologies must be a tree (i.e. no multiple supertypes are allowed). To allow maximal flexibility in ontological modelling, we do not require subtypes to be either exhaustive or exclusive. In other words: sibling-types may overlap (in fig. 5, a particular page might be both a home-page and an outdated page), and some elements of the supertype may not belong to any of the subtypes (i.e. in fig 5 pages may exist that are neither outdated, nor CV pages or home-pages). Similar to type-systems based on concept-logics [3, 16], types in WebMaster are defined intensionally, by stating which properties a Web-page must have in order to belong to a certain type. These intensional type-definitions are stated as rules. For example, rule r1 in fig. 5 defines the type of all homepages. The format of these rules will be discussed in section 5. An inference engine (see section 7) uses these intensional definitions to decide type-membership for pages. This is in sharp contrast with other approaches. Frame-systems, and ontology-tools such as [2] only allow extensional definitions of ontological categories: each individual must be explicitly assigned to a particular type. As we will describe now, an intensional type-system is at the heart of the knowledge-based verification approach, since the combination of intensional type-definitions plus inference engine can be used to locate errors in a Web-site5 . 4.2 Constraint-types: describing categories of errors Constraint types are special types meant to indicate error-categories. Whereas normal types group together all pages that share a given property, constraint types group together all pages that fail to satisfy a given property, where this property is again specified in the rule defining the constraint type. For example in fig. 5, the type “missing CVs” consists of all homepages (ie. pages satisfying rule r1 ) which fail to satisfy rule r4 . In fig. 5, constraint types are indicated by rectangles and normal types by rounded boxes. Typically, normal types are used to group pages into meaningful categories, while constraint types are used to collect pages that contain a particular type of error. Both normal types and constraint types can be further divided into subtypes. This is shown in fig. 5 for normal types, but it is often also useful for constraint types. This allows to subdivide error-types into gradually more refined and smaller types of errors.
5 RULES As mentioned above, both types and constraint-types are defined intensionally by rules that express which properties must hold (or fail to hold) on a page. In this section we will describe the formalism that is used in WebMaster for expressing such rules. As already mentioned earlier, these rules will refer to the semantic markup of the pages that must be categorised. More precisely, the rules will be phrased in terms of the labelled tree structure of these pages (fig. 2). A trade-off must be struck between the expressiveness of these rules (to allow the information providers to express powerful constraints) and the efficiency with which the rules can be tested on specific Web-pages by the inference engine. Following a suggestion in [19], we have chosen the following general logical form for our rules and constraints:
x y
Pi xk yl !
z
i
Q j xk zm
(1)
Q j xk zm
(2)
j
or equivalently
x y
Pi xk yl " i
5
#
z j
WebMaster also allows extensional definition of types which is sometimes useful for very small types. The type containing all root-pages of a site would be an example of a type which is easier defined by extensional enumeration.
where the x, y and z are sets of variables, and each of the Pi and Q j are binary predicates. The variables may be quantified over a given type T , for which we will use the notation xk $ T (and similar for ). The binary predicates Pi and Q j can express one of the following types of relations: – Arbitrary nesting of tags: The predicate descendant TAG x /TAG y is true iff the tagged structure TAG x / TAG occurs somewhere within y. For example, if we take for y the XML text of fig. 3b, then descendant TEL 47731 /TEL y is true. – Direct nesting of tags: The predicate child TAG x /TAG y is true iff the tagged structure TAG x / TAG is one of the direct children of y. If we again take for y the text of fig. 3b, then child NAME F. van Harmelen /NAME y is true, but not child TEL 47731 /TEL y – Simple binary operations: We will also need simple binary tests on tag-contents or on entire page. We will use the following in the remainder of this paper: % string-tests on tag-content such as string equality, substring, initial-substring, etc. % comparisons on ordered types such as integers, clock- and calendar-times, etc. % tests on links between pages such as direct and indirect links between pages. This class of formulae is less expressive than full first order logic over the predicates Pi and Q j (because of the limited nesting of the quantifiers), but is more expressive than Horn Logic (because of the existential quantifier in the right-hand side of the implication). As examples of such rules, we will now give some of the rule-definitions required for the ontology of fig. 5. Rule r1 : Homepages. As the simplest example possible, let us assume that homepages can be identified simply because they contain the tag HOMEPAGE somewhere inside:
x $ all-pages : &
z : descendant
HOMEPAGE
z /HOMEPAGE x
This rule simply demands the presence of the HOMEPAGE -tag anywhere in the page. According to fig. 5, any page fulfilling this demand will be a member of the type home-pages. In terms of the general schema above, the sets x and z consist of just a single variable, the sets y and Pi are empty (so the left-hand side of the implication is trivially true, indicated by & ), and only one Qi predicate is used, namely descendant (' . Clearly, this example rule is so simple that it could still have been performed by a plain textsearch engine (simply searching for the string “ HOMEPAGE ”). The second example already goes beyond the capabilities of a text-based search engine: Rule r2 : Homepages before 1998. Any homepage which is last modified before 1 Jan 1998 can be identified easily, assuming that homepages contain a MODIFIED -tag mentioning the last modification date of the page:
x $ home-pages : &
d : descendant MODIFIED d /MODIFIED x *) d + 01-01-1998
Since the test on the date d + 01-01-1998 should only be applied to dates that appear in the context of a MODIFIED , this simple rule already goes beyond the capabilities of a text-based search engine. Again, this is a rule with an empty left-hand side. The following rule is our first example of a complete rule: Rule r3 : Outdated pages. Obviously, dates mentioned in announcements must be in the future, or more precisely: if any page contains anywhere an ANNOUNCE -tag, and if that ANNOUNCE -tag directly contains a DATE -tag, then that date must be in the future:
x $ all-pages d :
#
a : descendant ANNOUNCE a /ANNOUNCE x *) child DATE d /DATE a d , today .-
Since this definition concerns a constraint-type in figure 5, any page that fails to satisfy the above demand on announcement-dates belongs to the constraint-type of outdated-pages The above rule allows that ANNOUNCE -tags do not contain DATE ’s, in which case no constraint is applied, the rule succeeds trivially, and therefore the page does not belong to the constraint type. Alternatively, we might enforce that ANNOUNCE -tags contain DATE ’s, as well as these dates being in the future. This can easily be obtained by moving the conditions on presence of the DATE -tags from the left- to the right-hand side of the rule:
x $ all-pages a : descendant d : child
#
ANNOUNCE
DATE
a /ANNOUNCE x
d /DATE a /) d ,
today
0-
Both of these rules are reasonable, and it is up to the information provider to decide which rule is applicable to their Web site. In any case, our notation allows to express either version equally well.
Rule r4 : Missing CV-page. As discussed earlier, we might require that every homepage must link to the CV page of the corresponding employee. The required situation is depicted in figure 6. This property of home-pages can be enforced by the following constraint rule:
h $ home-pages n :
#
p : descendant PERSON p /PERSON h *) child NAME n /NAME p
c $ CV-pages p 12 n 1 : descendant PERSON p 13 / PERSON c *) child NAME n1 /NAME c *) n 4 n1 ) links-to h c 5
This rule states that if a homepage contains anywhere within it a NAME -tag occuring directly inside a PERSON -tag (i.e. the left-hand side of the rule), then (i) there should exist a CV-page6 which (ii) should contain a NAME tag which directly inside a PERSON -tag, (iii), the names appearing in both pages should be equal, and (iv) there should be a link from the home-page to the CV-page. Again, this rule defines a constraint type in fig. 5, so any homepage violating this constraint will belong to the constraint-type missing-CVs.
= home page ... 6 6 6 PERSON 7 ... NAME 7 n /NAME 7 ... 6 /PERSON 7 ...
links-to
CV page ... 6 6 6 PERSON 7 ... NAME 7 n8 /NAME 7 ... 6 /PERSON 7 ...
Fig. 6. Required situation for rule r4
Rule r5 : link back to root. A final example shows that our rules and constraints can also be used to check the connectivity in a site. Consider the following example:
6
p $ all pages : links-to p *9;:2