Migrating data-intensive Web Sites into the Semantic Web - CiteSeerX

42 downloads 18116 Views 257KB Size Report
The Semantic Web is one of today's hot keywords. It is about bringing “[...] structure to the meaningful content of Web pages, creating an environment where ...
Migrating data-intensive Web Sites into the Semantic Web Ljiljana Stojanovic

Nenad Stojanovic

Raphael Volz

FZI Research Center for Information Technologies

Institute AIFB University of Karlsruhe 76128 Karlsruhe, Germany

Institute AIFB University of Karlsruhe 76128 Karlsruhe, Germany

[email protected]

[email protected]

University of Karlsruhe 76131 Karlsruhe, Germany

[email protected] ABSTRACT

The Semantic Web is intended to enable machine processability of web content and seems to be a solution for many drawbacks of the current Web. It is based on metadata that describe the formal semantics of Web contents. We present a novel, integrated and automated approach for migrating data-intensive Web applications into the Semantic Web. This approach can be applied to a broad range of today’s business Web sites.

Categories and Subject Descriptors D.2.2 [Software Engineering]: Tools and techniques H.2.8 [Database application]

General Terms Design, Management

Keywords Semantic Web, Ontology, Data-intensive Web Sites

1. INTRODUCTION The Semantic Web is one of today’s hot keywords. It is about bringing “[...] structure to the meaningful content of Web pages, creating an environment where software agents roaming from page to page can readily carry out sophisticated tasks for users.” [SW00]. To enable this web sites are enhanced with metadata that provide formal semantics for Web content. The key technology involved here are ontologies. Ontologies provide shared domain models, which are understandable to both human being and machines by providing a shared conceptualization of a specific domain. Using ontologies, content is made suitable for machine consumption, as opposed to the content found on the web today, which is only intended for human consumption. This paper is about moving existing data-intensive web sites into the Semantic Web. In today’s Web, business applications have moved away from static, fixed web pages to those that are dynamically generated at the time of user requests. This kind of web sites is also known as data-intensive [Fra99] web sites and typically realized using relational databases. One of the most common applications for Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

SAC 2002, Madrid, Spain © 2002 ACM 1-58113-445-2/02/03...$5.00

data-intensive web sites are e-commerce applications for example online shopping sites. Data-intensive web sites have numerous benefits, i.e. a simplified maintenance of the web design (due to complete separation between data and layout), the automated updating of web content etc. Unfortunally this approach carries limitations as well [Atz98]. The most prominent [Com01] problems are probably: Data-intensive web sites form an invisible web. Search engine crawlers usually do not read dynamically generated URLs, thus pages are not included in search engine indexes. Consequently they are invisible. The content of the database-driven web sites is not machine-understandable – information presented by using HTML is intended for user consumption only. Consequently they are not a part of the Semantic Web [SW00]. In this paper we focus on the second problem and give an approach for an (automated) migration of data-intensive web sites into the Semantic Web. To facilitate this, our approach maps relational database schemas into ontologies that can form the conceptual backbone for metadata annotations that are automatically created from the database instances. The benefits of the proposed approach are manifold: The process of providing metadata, called semantic annotation [Ha01], is automated and thus inexpensive and fast. Consequently the content of dynamic web pages is machine-understandable and therefore visible for specialized search engines. The communication of content information across information systems is simplified due to the metadata representation. The paper is organized as follows: First we introduce the architecture of the Semantic Web and present the related technologies. Section 3 presents the data models involved in the mapping approach, thereby clarifying differences and similarities. Section 4 explains the overall migration architecture and details the mapping process. Section 5 identifies related work. We conclude summarizing our contribution and presenting further research challenges.

2. THE SEMANTIC WEB The term „Semantic Web” encompasses efforts to build a new WWW architecture that enhances content with formal semantics. This will enable automated agents to reason about Web content, and carry out more intelligent tasks on behalf of the user. “Expressing meaning” is the main task of the Semantic Web. In order to achieve this objective several layers of representational structures are needed. Figure 1 presents the layers of the Semantic Web:

the XML layer is used as a syntax layer the RDF layer represents the data layer the ontology layer, based on a formal common agreement, specifies the meaning of the data the logic layer provides rules that enable intelligent reasoning the proof layer supports the exchange of “proofs” in interagent communication

Figure 1. Semantic Web Layers (inspired by [SW00]) Current research is mainly focused on the first three layers – which is also the focus of this paper. Important technologies for developing the Semantic Web are already in place: the eXtensible Markup Language (XML) and the Resource Description Framework (RDF), which provide the first two layers.

2.1 The XML syntax layer XML allows users to add arbitrary structure to their documents but says nothing about what the structures mean. Tag-names per se do not provide semantics. The Semantic Web utilizes XML for syntax purposes only.

2.2 The RDF data layer The Resource Description Framework [RDF] is an infrastructure that enables encoding, exchange and reuse of structured metadata. Principally, information is stored in the form of RDF statements, which represent data in an uniform way (subject, predicate, object) and facilitate machine understandability. The abstract RDF data model represents a directed labeled graph. This abstract model is serialization independent, though the standard serialization uses XML. Due to the generality of the data model RDF offers modeling primitives that can be extended according to the needs at hand. RDF Schema [W3CRDFS] provides basic class hierarchies and relations between classes and objects. In general, the modeling primitives of RDF Schema lack formal semantics. This makes the interpretation of how to use them properly an error-prone process. A more detailed description of RDF Schema is given in section 3.3.

2.3 Specifying meaning - the ontology layer The third basic component of the Semantic Web are ontologies. In Artificial Intelligence and Web research the term ontology describes a formal, shared conceptualization of a particular domain of interest [Gru93]. They are well-suited to describe heterogeneous, distributed and semistructured information sources that can be found on the Web. By defining shared and common domain theories, ontologies help both people and machines to communicate concisely, supporting the exchange of semantics and not only of syntax. It is therefore important that any semantics on the Web are based on an

explicitly specified ontology. By this way consumer and producer agents (which are assumed for the Semantic Web) can reach a shared understanding by exchanging ontologies that provide the vocabulary needed for communication. A basic building block for ontologies are concepts, that are typically hierarchically organized in a concept hierarchy. An ontology is constituted by the following entities: Two disjoint finite sets Χ and Ρ whose elements are called concepts and relations A concept hierarchy HΧ: HΧ is a directed relation HΧ ⊆ Χ×Χ which is called concept hierarchy or taxonomy. HΧ (Χ1, Χ2) specifies that Χ1 is a subconcept of Χ2 A function rel: Ρ→Χ×Χ, that relates concepts nontaxonomically. Two auxiliary functions dom and range: Ρ→Χ give the domain of Ρ and range of Ρ, respectively A set of ontological axioms AO, expressed in an appropriate logical language, e.g. first order logic. Several representation languages are currently proposed. One recent proposal extending RDF and RDF Schema is the DAML+OIL language [DAML+OIL] that unifies the epistemologically rich modeling primitives of frames, the formal semantics and efficient reasoning support of description logics and is mapped to the standard Web metadata language proposals. Formal semantics for ontologies is a sine qua non, in our implementation we use Frame-Logics [FL] and its concrete implementation in the SilRI inference engine [SILRI] to provide this for the above definitions. Axioms are correspondingly given in Frame Logics. See section 3.2 for a more detailed description of Frame Logic. Further, a knowledge base consisting of instances of the concepts and relations between these instances can be provided. A more comprehensive explanation of the ontology structure introduced above and the definition of an associated knowledge base structure is given in [Maed01a].An ontology is stored using an extension of the standard Web metadata language proposals (RDF Schema).

2.4 The logic and proof layers Although not in scope of this paper, we would like to mention these layers briefly. The logic layer consists of rules that enable inferences, e.g. to choose courses of action and answer questions. The proof layer is required to provide explanations about the answers given by automated agents that consume the provided information. Naturally, you might want to check the results deduced by your agent, this requires the translation of its internal reasoning mechanisms into an unifying proof representation language.

3. A CLOSER LOOK ON DATA MODELS This section provides a sound formal basis for the data models involved in the mapping task and clarifies whether we can expect information preservation.

3.1 The relational model The underlying model of relational databases is the relational model. We extend the usual ([AH], [LL]) formal definition of the relational model with some constructs typically found in SQLDDLs. The model is constituted by: A finite set R called Relations A finite set A called Attributes

A function att: R → 2 A that defines the attributes contained in a specific relation ri ∈ R A function key: R → 2 A that defines which attributes are primary keys in a given relation, thus key( ri ) ⊆ att( ri ) must hold A set T of atomic data types A function type: A → R that gives the type of each attribute. The reader may note that SQL-DDLs are typically more expressive. It is possible to specify further constraints (like DEFAULT , NOT NULL, UNIQUE etc.). We consider some of these constraints, in fact we convert these constraints into corresponding Frame-Logic rules and try to preserve all semantics specified in the original table definitions. Due to the static nature of ontologies no dynamical aspects in SQL-DDLs can be converted, thus triggers, referential actions (like ON UPDATE etc.) and assertions cannot be mapped. For each constraint we have a function of the same name:

R → 2 A that specifies to which attributes it is applied. In SQL-DDLs it is also possible to specify referential integrity constraints, which create foreign keys. This information is especially useful for the mapping process as it indicates ontological relations. The reader may note that referential integrity constraints enforce that inclusion dependencies [Dat81] are valid at all times. Therefore our relational model also contains: A set of inclusion dependencies I where each element has

r1, r2 ∈ R , ((r1 , A1 ), (r2 , A2 )) with A1 = {a11 , a12 , a13 ,...} and A2 = {a 21 , a 22 , a 23 ,...} , A1 ⊆ att( r1 ) and A2 ⊆ att( r2 ), | A1 | = | A2 | and type( a1i )

the

form

= type( a 2i ) Ic denotes the transitive closure of I.

3.2 Frame Logics This is the target data model for the mapping approach. The resulting ontology is written in Frame Logic. Frame Logic [FL] was developed to combine the rich data-modelling primitives of object-oriented databases with logical languages as developed for deductive databases. Due to lack of space we can only briefly mention those properties that are important for our mapping procedure – the interested reader can refer to [FL] for a complete description. Frame Logic provides object identity, complex objects, inheritance, polymorphic types, methods, encapsulation and integrates these features into a logic-based framework. The syntax of Frame Logic is higher-order, which, among other things, allows an integrated view on data and schema. Both can be manipulated and defined using the same declarative language. Frame Logic does not specify basic types, everything is an object. No distinctions between attributes and associations are made, all relationships between objects are modelled by method applications, i.e. applying the method “schoolID” to the object “raphael” yields the object “AIFB”. N-ary relationships can be modelled using method parameters, we do not use this feature as it cannot be expressed in RDF. Figure 2 shows a small example ontology written in Frame Logic. In our mapping approach only the shown features are used.

Student::Object. PhDStudent::Student. // PhDStudent is sub class of Student raphael : PhDStudent. // raphael is an instance of PhDStudent School::Object. AIFB : School. raphael [familyname ->> volz; schoolID ->> AIFB]. // applying method "schoolID" to raphael yields the object "AIFB" ljilja:PhDStudent[familyname->>stojanovic;schoolID->>FZI]. FZI : School. nenad:PhDStudent[familyname->>stojanovic;schoolID->>AIFB]. // A sample Rule specifying that "schoolID" is the inverse of "studentID" FORALL X, Y X:Student[schoolID->> Y] Y : School [studentID ->> X]. // A sample query, asking for all PhDStudents of AIFB (the result is: raphael, nenad). FORALL X > “AIFB”].

Figure 2. A sample ontology in F-Logic ([FLTUT], adopted)

3.3 RDF Schema Basically, the RDF Schema specification [W3CRDFS] provides an object-oriented modelling language with some unique features: No distinction is made between attributes and associations, both are called properties. Only binary relationships can be modelled with properties, n-ary relationships require auxiliary classes. The type system only contains two basic types (rdf:Literal and rdf:Resource). Classes and properties are of type Resource. Unlike mainstream OO models, not only class hierarchies but also property hierarchies can be specified in RDF Schema. We do not use the feature of property hierarchies in our mapping approach and follow [SILRI] to translate between RDF Schema documents and Frame Logic ontologies. Figure 3 shows an example for this mapping. The reason for using only a subset of the specification is that it is highly ambiguous and especially hard to formalize (c.f. [RDFSRev]).

3.4 Preserving information Representation in RDF Schema: