Information Services for the Web: Building and Maintaining Domain Models. Avigdor Gal ...... J. Mylopoulos, A. Gal, K. Kontogiannis, and M. Stanley. A generic integration archi- ... Morgan Kaufmann, San Francisco, CA, 1996. 27. eXtensible ...
International Journal of Cooperative Information Systems cfWorld Scienti c Publishing Company
Information Services for the Web: Building and Maintaining Domain Models Avigdor Gal Department of MSIS, Rutgers University, 94 Rockafellar Road Piscataway, NJ 08854-8054, USA
Scott Kerr Object Technology International Inc., 2670 Queensview Drive Ottawa, ONT K2B 8K1, Canada and John Mylopoulos Department of Computer Science, University of Toronto, 10 King's College Road Toronto, ONT M5S 3H5, Canada Received Revised
(to be inserted by Publisher)
The World Wide Web serves as a leading vehicle for information dissemination by offering information services, such as product information, group interactions, or sales transactions. Three major factors aect the performance and reliability of information services for the Web, namely the distribution of information which has resulted from the globalization of information systems, the heterogeneity of information sources, and the sources' instability caused by autonomous evolution. This paper focuses on integrating existing information sources, available via the Web, in the delivery of information services. The primary objective of the paper is to provide mechanisms for structuring and maintaining domain models for Web applications. These mechanisms are based on conceptual modeling techniques, where concepts are being de ned and re ned within a metadata repository through the use of instantiation, specialization and attribution. Also, active databases techniques are exploited to provide robust mechanisms for maintaining a consistent domain model in a rapidly evolving environment, such as the Web. Therefore, the main contribution of the paper lies in the provision of an architecture for semi-automatic generation and maintenance of user-oriented, semantic-based domain models that describe distributed heterogeneous information sources. Keywords : Information services, WWW, Conceptual modeling, Active databases
1. Introduction
The World Wide Web (hereafter Web) serves as a leading vehicle for information dissemination16 by oering information services, such as product information, group interactions, or sales transactions. Three major factors aect the performance and reliability of information services for the Web, namely the distribution 1
of information, the heterogeneity of information sources, and the sources' instability. The distribution of information stemmed from the globalization of information systems and is hampered by unreliable network technologies.a The heterogeneity of information sources results in a host of technical issues ranging from syntactic interoperability among dierent legacy systems to semantic issues that deal with the meaning of the data that are being integrated. Finally, the sources' instability is caused by autonomous evolution of information systems as well as autonomous decisions regarding availability, as re ected in the independent removal of Web resources as well as the independent decisions regarding a server's availability. In particular, it has been estimated that the average lifetime of a URL is just 44 days.14 This paper focuses on integrating existing information sources, available via the Web, in the delivery of information services. The information sources may include databases, formatted or plain ASCII les, and other computer-based data such as Java applets, embedded CGI scripts, stream media, 3D graphics, and clickable maps. These sources may be multiple, distributed and heterogeneous. They may also contain legacy data in that their original designers are long-gone and their semantics are only partially understood. The primary objective of the paper is to provide mechanisms for structuring and maintaining domain models for Web applications. The idea of using a domain model as a means towards integrating several heterogeneous databases is well known and well accepted. The new challenges that arise for cooperative information systems in general, and the Web in particular, is that this domain model needs to be built bottom up through analysis of the information sources. Moreover, it needs to continuously evolve as new information sources become available and relevant to the information service being delivered, while existing information sources are being rapidly modi ed. Researchers and practitioners alike are coming to realize that there can be no solution to the delivery of information services unless one tackles head-on these problems (see, for example, the idea of subject-based search engines17 ). Therefore, we suggest a mechanism that is based on conceptual modeling techniques, where concepts are being de ned and re ned within a metadata repository through the use of instantiation, specialization and attribution. Also, active database techniques are exploited to provide robust mechanisms for maintaining a consistent domain model in a rapidly evolving environment, such as the Web. The main contribution of the paper lies in the provision of an architecture for semi-automatic generation and maintenance of user-oriented, semantic-based domain models that describe distributed heterogeneous information sources. Several projects in this area, e.g., WebSQL19 and Microcosm,20 considered using the structure of the Web for generating information services. WebSQL is an SQL-like query language which allows the use of Web concepts (e.g., hyperlinks and URLs) as part aFor example, in a Computerworld survey of 103 network managers from June 1998 it was revealed that corporate networks alone suered an average of 14 hours of downtime during a 12 month period.
2
of its syntax. Microcosm is a link manager that allows the user to add links among Web artifacts without modifying the Web artifacts themselves. Neither model provides the essential mechanisms to overcome the gap between structural (syntactic) interoperability and semantic interoperability. The latter cannot be attained by embedding Web concepts in a query language or by extending the exibility of structural mechanisms. Rather, it should consist of a domain model that re ects, in a transparent fashion, Web sources' modi cations. In addition, link managers lack the active capabilities of semi-automatically maintaining a consistent view of the underlying Web resource. Some attempts in related areas to provide \semantic coating" to at les1 bypass the maintenance problem by generating a virtual database schema that is materialized only at retrieval time. However, the need for designer intervention, and the shaky availability of Web resources cannot guarantee real-time processing to provide the required semantic model. Several research projects and commercialproducts, e.g., Ptolomaeus24 and HotSuite,12 provide the capability of mapping a Web site and generating a Web site map upon request. However, such tools do not support semantic capabilities and therefore lack the domain modeling aspect of the proposed architecture. Also, once constructed, the site-map remains static unless a re-mapping is sought by the user, while the proposed model provides a reactive propagation of Web modi cations to the semantic level. Moreover, the proposed maintenance tool reduces the occurrence of errors common to the Web environment, e.g., the attempt of browsing a non-accessible page, without the need for control over the Web resources. WAG5 25 allows the user to query the Web, rather than browsing it, and therefore attempts to provide a transparent layer for Web-based information services. Nevertheless, WAG uses a database (on the client's side) which is far less powerful, semantic-wise, than a repository (which is used in this work). Also, WAG is passive and does not provide any mechanism for propagating Web modi cations to the semantic model, and therefore it falls into the same category as site-map generators. Hyper-G, along with its authoring tool Hyperwave,18 is a framework for collaborative development of Web applications. It allows a cleaner design and implementation of Web applications through the use of a database that maintains consistency among various Web resources. A similar approach, although more databaseoriented, was taken in the Strudel research project.7 Strudel supports declarative speci cation of a Web site's content and structure and automatically generates a browsable Web site from a speci cation. However, in order for either Hyper-G or Strudel to serve successfully, some stringent requirements need to be ful lled. These requirements include the collaboration of the participating applications, and in the case of Hyper-G, the use of it for all system components. Obviously, such a system is highly limited in scope under the Web's current state of aairs. In contrast, the proposed architecture allows for the Web's prominent assumption (i.e., sources' independence) while supporting consistency to the degree possible. Therefore, the proposed architecture is not limited to a prede ned subset of the Web resources, ;
3
and allows exible extension and retraction of its scope. As a concrete case-study we consider the design of semantic structures of organizational Web sites. Web resources made available through such Web sites have more structure in their design than personal Web pages since they share common subject domains. Therefore, this added structure can be exploited in creating domain models. On the other hand, the heterogeneous nature of organizational Web sites calls for the use of an architecture that is capable of resolving heterogeneous issues. As a concrete case study we introduce an architecture for a speci c domain, involving a university Web site. We examine the public Web information that a university provides, and design an information service that oers a reorganization of that information on a semantic basis. The rest of the paper is organized as follows. The proposed architecture is described in the following three sections, where the information model is provided in Section 2, followed by the querying facility in Section 3 and the maintenance mechanism in Section 4. Implementation issues are described in detail in Section 5. Section 6 summarizes the contributions of the paper and oers directions for future work.
2. The information model There are three types of information that are stored in the repository: Web artifacts, domain concepts and change propagation rules for maintaining semantic consistency as the site changes. In this section we elaborate on the rst two types. A discussion of the third type is deferred to Section 4. Web artifacts are purely syntactic entities found in HTML (Hypertext Markup Language) or XML (eXtensible Markup Language)27 les, and HTTP (Hypertext Transfer Protocol) or FTP (File Transfer protocol) requests. The repository representations of these artifacts are automatically generated and regenerated by a Web extraction tool that is supplied by a third party and integrated into our architecture. Such tools are generally syntactically based, i.e., they do not re ect the subject matter (domain) the Web documents are concerned with.b HTML les serve as the Web's main resource. Other Web resources include Java applets, CGI scripts, stream media, graphics les, and clickable maps. While Java applets and stream media feed a local application with pre-processed input and therefore are considered to be a black box on the client's end,c HTML/XML les are semi-structured data les, whose contents can be analyzed and utilized by the client. In what follows, we shall refer to Web artifacts whose content can be analyzed as documents. Graphics provide an alternative approach towards data representation, which is easily embedded within this framework. However, its characteristics (e.g., captured objects b While XML has the capability of tagging information based on speci c application's ontology, these tags are not readily visible to the user. cSophisticated software engineering tools3 may be applied to understand the capabilities of Java applets and present them to the user. However, this requires that the Java code itself is available to the client, which is rarely the case.
4
and colors) are more dicult to extract than HTML characteristics, and therefore we restrict the treatment of graphics les in this work. The embodiment of sophisticated tools to analyze graphics les would enable metadata extraction from these Web artifacts. In particular, clickable maps that are a combination of graphics and CGI scripts can be analyzed further by using the form list elds. XML les enhance the capabilities of Web-based application management systems, since the pre-de ned tags serve as an extension to the application ontology and become useful in grouping parts of XML les under a single semantic concept. Hyperlinks containing FTP requests are treated similarly to those containing HTTP requests. However, due to the semantic-less organization of FTP sites, an analysis of such a site yields little semantic information and therefore is reduced to the minimum possible. Domain concepts re ect the ontology of the application's domain. A concept is possibly associated with other concepts through domain attributes that re ect the inter-relationships among domain concepts. For instance, the concept \University of Toronto" is associated with the concept \UofT Admissions" through the former concept's attribute admissions. Also, a domain concept has one or more associated Web artifacts that are bound to the concept by an attribute webArtifacts. Domain concepts are initially generated by examining the hyperlink graph topology and contents of each Web site. Each artifact in a site initially generates a corresponding domain concept, where its name is taken from the Web document title. The hyperlinks (and tagging in the case of XML) in the document generate the attributes of the concept, which are labeled links to other concepts.d Domain concepts relate to the semantic content provided by several Web sites. The concepts vary independently among sites with the concerns and practices of each Web site manager. Therefore, even within the domain of a single organization (e.g., a university or a corporate intranet), integrating multiple Web sites into a common domain model is a challenge. This problem will most likely become even more acute as more and more XML les replace HTML les. With XML les, the concepts that are available through the additional tagging should be scrutinised to ensure semantic compatibility among Web sites. The domain expert (the designer or someone working with her) can reconcile emphasis, paradigm dierences, and other impedance mismatches in a politically expedient or best- t manner. As detailed herein, having a conceptual modeling tool (e.g., a Telos repository21 ) to support the conceptual model assists the designer, in that semantic consistency checks and related data operations can be done at a sophisticated level that the designer need not be aware of. In conceptual modeling languages, instantiation is used to separate abstract concepts (e.g., \University") from concrete instances (e.g., \University of Toronto"). d As an example of sophisticated tools for identifying domain concepts, the reader is referred to the Intelligent WebWare project at the University of Washington. The project investigates methods, inspired by the elds of Arti cial Intelligence and InformationRetrieval, for making the Web easier to navigate. In particular, methods of clustering were suggested (e.g.28) as an alternative method of presenting search results.
5
Fig. 1. Domain concepts and Web artifacts
6
Figure 1 demonstrates this, using an analysis of the University of Toronto home page (http://www.toronto.edu/). The abstract concepts and artifacts are Telos metaclasses. The concrete ones are Telos simple classes. \Instance of" arrows in the diagram indicate instantiation between these levels. Orthogonal to this, we can specialize abstract concepts, e.g., \Web Document" to \Home Page" (not shown in the diagram). These concepts can then be instantiated, e.g., \University of Toronto Home Page." In our application schema we associate concepts (\University") with Web documents (WebDocument). This exists at both the abstract and concrete levels. Therefore, the attribute webArtifacts has a source \University" and a target WebDocument. At the concrete level, an instance of University (\University of Toronto") has an instance of the metaclass WebDocument (\U of T Home Page") as its attribute value for webArtifacts. A major drawback in using a repository to de ne a common schema for several integrated systems involves the complexity of manually building meta-schemata and in populating them. To alleviate this problem, we have developed the ConceptEditor, (see Section 5.2) that along with Telos facilitates the automatic creation of the meta-schemata and their subsequent population. Telos objects (metaclasses, simple classes and tokens) and attributes are maintained by Java methods for constructing, deleting, querying, and modifying the objects as if the repository were embedded within the Java application. Telos objects can also be de ned in the traditional le-format of the Telos language. As described in Section 3.1, the use of the ConceptEditor changes the role of the designer from an initiator to a consultant, similarly to the suggested role of librarians in designing Web-based ontologies.17 Depending on the speci c Web extraction tool, the Telos metamodel classes may include the following items: Web artifacts (e.g., HTML pages). URLs: uniform resource locators pointing to artifacts. HTML/XML titles. Web artifact \last modi ed" dates. Web Document anchors: HTML tags containing pointers (e.g., to embedded images). Hyperlinks in Web Documents (consisting of URLs and their labels). Hyperlink labels. File names (as part of URLs). File formats/MIME types (e.g., plain text, HTML, postscript, application). Web sites (corresponding to the \host" portion of a URL). \Name" anchors within documents (e.g.,