The State of the Art on Representation and Query ... - CiteSeerX

16 downloads 1909 Views 829KB Size Report
Aug 10, 2000 - {name: "Frank", tel: 47731, email: "[email protected]"}. When we .... href, which is replaced by the value of the href attribute of the same anchor.
On-To-Knowledge: Content-Driven Knowledge Management Tools through Evolving Ontologies

The State of the Art on Representation and Query Languages for Semistructured Data Jeen Broekstra Christiaan Fluit Frank van Harmelen jeen.broekstra, christiaan.fluit, frank.van.harmelen  @aidministrator.nl

Summary. During the past few years, research into representation of semistructured (meta)data has taken an enormous flight. This document presents an overview of a couple of the most promising approaches in this field, such as XML and RDF. We also investigate the current research into query languages tailored to semistructured data , by first examining what properties such a language should have and then reviewing the most prominent existing approaches, such as XSLT, XQL, XML-QL, Lorel and the proprietary rule format used by aidministrator to describe constraints on the structure and contents of tree-structured documents, and we take a look at the efforts undertaken to specify a query language specifically for RDF.

Document Id. Project Issuer Date Status Distribution



deliverable 8 EU-IST On-To-Knowledge IST-1999-10132 aidministrator nederland b.v. 10th August 2000 Final Public

Copyright c 2000 aidministrator nederland b.v.

On-To-Knowledge Consortium This document is part of a research project partially funded by the IST Programme of the Commission of the European Communities as project number IST-1999-10132. The partners in this project are: Vrije Universiteit Amsterdam VUA (coordinator, NL), University of Karlsruhe (Germany), Schweizerische Lebensversicherungs- und Rentenanstalt/Swiss Life (Switzerland), British Telecommunications plc (UK), CognIT a.s. (Norway), EnerSearch AB (Sweden), AIdministrator Nederland BV (NL). Vrije Universiteit Amsterdam (VUA) Division of Mathematics and Informatics W&I De Boelelaan 1081a 1081 HV Amsterdam The Netherlands Tel: +31 20 4447718, Fax: +31 20 872 27 22 Contactperson: Dieter Fensel E-mail: [email protected]

University of Karlsruhe Institute AIFB Kaiserstr. 12 D-76128 Karlsruhe Germany Tel: +49 721 608392, Fax: +49 721 693717 Contactperson: R. Studer E-mail: [email protected]

Schweizerische Lebensversicherungsund Rentenanstalt / Swiss Life Swiss Life Information Systems Research Group General Guisan-Quai 40 8022 Zurich ¨ Switzerland Tel: +41 1 284 4061, Fax: +41 1 284 6913 Contactperson: Ulrich Reimer E-mail: [email protected]

British Telecommunications plc BT Adastral Park Martlesham Heath IP5 3RE Ipswich United Kingdom Tel: +44 1473 605536 Fax: +44 1473 642459 Contactperson: John Davies E-mail: [email protected]

CognIT a.s Busterudgt 1. N-1754 Halden Norway Tel: +47 69 1770 44, Fax: +47 669 006 12 Contactperson: Bernt. A.Bremdal E-mail: [email protected]

EnerSearch AB Carl Gustafsv¨ag 1 SE 205 09 Malmo¨ Sweden Tel: +46 40 25 58 25; Fax: +46 40 611 51 84 Contactperson: Hans Ottosson, Fredrik Ygge E-mail: hans.ottosson,[email protected]

AIdministrator Nederland BV Julianaplein 14B 3817 CS Amersfoort The Netherlands Tel: +31 33 4659987, Fax: +31 33 4659987 Contactperson: Jos van der Meer E-mail: [email protected]

aidministrator nederland b.v. julianaplein 14b 3817 cs amersfoort the netherlands tel. (+31)(0)33 4659987 fax. (+31)(0)33 4659987 http://www.aidministrator.nl

Acknowledgements The research reported here was carried out in the course of the On-To-Knowledge project. This project is partially funded by the IST Programme of the Commission of the European Communities as project number IST-1999-10132. The partners in this project are: Vrije Universiteit Amsterdam VUA (coordinator, NL), University of Karlsruhe (Germany), Swiss Life (Switzerland), BT plc (UK), CognIT a.s. (Norway), EnerSearch AB (Sweden), AIdministrator Nederland BV (NL). The authors of this report wish to thank Arjohn Kampman, Herko ter Horst and Stefan Decker for their most helpful feedback and comments on earlier versions of this report.

Contents 1 Representation languages 1.1 Introduction . . . . . . . . . . . . . . . . 1.2 HTML . . . . . . . . . . . . . . . . . . . 1.3 Ontobroker . . . . . . . . . . . . . . . . . 1.3.1 Remarks and observations . . . . 1.4 SHOE . . . . . . . . . . . . . . . . . . . . 1.4.1 Differences with Ontobroker . . 1.4.2 Remarks and observations . . . . 1.5 XML . . . . . . . . . . . . . . . . . . . . 1.5.1 XML Schema . . . . . . . . . . . 1.6 RDF . . . . . . . . . . . . . . . . . . . . . 1.6.1 RDF Schema . . . . . . . . . . . . 1.6.2 Remarks and observations . . . . 1.7 A comparison of approaches . . . . . . . 1.7.1 Supported by Web technology . 1.7.2 Avoiding duplication of data . . 1.7.3 Separation of data and meta data 1.7.4 Data models . . . . . . . . . . . . 1.7.5 Ontologies . . . . . . . . . . . . . 1.7.6 Conclusions . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

3 3 4 4 6 8 10 10 11 12 13 14 16 16 16 16 17 18 18 18

2 Query languages 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 2.2 General properties of QLs for semistructured data 2.2.1 Path expressions . . . . . . . . . . . . . . . 2.3 Why not just SQL? . . . . . . . . . . . . . . . . . . 2.4 Querying XML . . . . . . . . . . . . . . . . . . . . 2.4.1 XSL . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 XQL . . . . . . . . . . . . . . . . . . . . . . 2.4.3 XML-QL . . . . . . . . . . . . . . . . . . . . 2.4.4 Conclusion . . . . . . . . . . . . . . . . . . 2.5 Lore . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 OEM . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Lorel . . . . . . . . . . . . . . . . . . . . . . 2.5.3 Architecture . . . . . . . . . . . . . . . . . . 2.5.4 Remarks and observations . . . . . . . . . . 2.6 aidministrator’s visual rule format . . . . . . . . . 2.6.1 The logical structure . . . . . . . . . . . . . 2.6.2 The visual structure . . . . . . . . . . . . . 2.6.3 Comparison with requirements . . . . . . . 2.7 Querying RDF . . . . . . . . . . . . . . . . . . . . . 2.7.1 The SQL/XQL style approach . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

21 21 21 22 22 23 23 25 26 27 27 27 28 30 30 31 31 32 33 34 35

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

1

2.8

2.7.2 The declarative approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.3 RQL in depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A comparison of approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 Conclusions

36 39 41 43

Chapter 1

Representation languages 1.1 Introduction The lack of machine-accessible semantics is a major barrier to the development of more intelligent document processing on the Web. Current HTML markup is used only to indicate the structure and layout of documents, but not the semantics of the information presented. The On-To-Knowledge project aims to use the power of ontologies to facilitate navigation and querying of semistructured data . However, while ontologies provide essential background information on a domain, factual instances of concepts still need to be recognized in the available data. In other words: a representation language for semistructured data is required. What is semistructured data ? Semistructured data can be explained as ”schemaless” or ”self-describing”, indicating that there is no seperate description of the type or strucutre of data. This is in contrast with structured approaches (such as, e.g. databases), where usually the structure of the data is described first in a seperate schema, and then instances of this schema are created. In semistructured data we directly describe the data using a simple syntax. Usually, this means semistructured data consists of simple label-value pairs (cf. [Abiteboul et al., 1999]), for example: {name: "Frank", tel: 47731, email: "[email protected]"} When we allow the values themselves to be structures, we get a graph data model with a tree-like structure: {name: {first: "Frank", last: "van Harmelen"}, tel: 47731, email: "[email protected]" } No uniqueness constraint is placed on the labels (as one might expect in more conventional approaches), allowing multiple label-value pairs with the same label. One of the main strenghts of semistructured data is the ability to accomodate variations in structure. These variations typically consist of missing data, duplicated fields, or minor changes in representation. Structure of this chapter Several methods have been developed (or are currently under development) to enable semantic annotation of Web documents. Of these, we will discuss two approaches from the AI community (Ontobroker, section 1.3, and SHOE, section 1.4), and the suggested approaches by the W3C, 3

CHAPTER 1. REPRESENTATION LANGUAGES

namely HTML (section 1.2), XML (section 1.5) and RDF (section 1.6). The approaches by the W3C combine Web-based technology with suggested methodology from the Databases community (such as, for example, schema definitions).

1.2 HTML This section is an excerpt from [van Harmelen and Fensel, 1999]. META tags Historically, the first attempt at representing semantic aspects inside Web documents are the HTML META-tags. Their intended use is limited to stating global properties that apply to the entire document, for example: This expresses that the author of the entire document is Frank. In its original form, this mechanism is rather inflexible and only allows assertion of global properties that hold for the entire document. SPAN and DIV elements According to the HTML 4.0 specification [Raggett et al., 1999] the SPAN element ”is a generic container of any text element offering a generic mechanism for adding structure to documents”. Using the standard CLASS attribute one can write semantic annotations such as: This page is written by Frank van Harmelen. His tel.nr. is 47731, room nr. T3.57 Although intended for specifying layout (enabling simple generic stylesheet mechanisms), the HTML 4.0 specification already suggests the use of the SPAN element to express semantic structure of a document, so this use of the SPAN tag should not be considered as inappropriate. The DIV element works practically identically to the SPAN element. In fact, the only difference between the two is that SPAN is an inline element whereas DIV is a block element. This is a parser-level distinction that is not really important for our discussion. A detailed explanation of the issue can be found in [Raggett et al., 1999].

1.3 Ontobroker Ontobroker [Fensel et al., 1998, Fensel et al., 1999] is a project from the University of Karlsruhe dedicated to building a set of languages and tools for enabling and enhancing query and inference services on the web. One of these languages provides the ability to semantically annotate ontological information present in existing webpages. This language will be the main topic of this section. At http://ontobroker.aifb.uni-karlsruhe.de/demos.html some example Ontobroker query services can be found. 4

10th August 2000

deliverable 8

The State of the Art on Representation and Query Languages for Semistructured Data

Ontologies play a major role in the Ontobroker project. Within a certain community, referred to as an ontogroup, an ontology is used to model a shared view on the relevant entities in that domain. Such an ontology consists of a hierarchy of classes, attributes, relations and logical axioms describing the domain, and is described in Frame Logic [Kifer et al., 1995]. This ontology is essentially defined centrally, i.e. all people within the community need to use this ontology in order to participate; there is no mechanism for making personal refinements or views. This is a straightforward example of an Ontobroker ontology, describing two classes and their typed attributes: Object[]. Person::Object. Researcher::Person. Person[ name =>> STRING]. Researcher[ affiliation =>> Organization; cooperatesWith =>> Researcher]. Organization[ ... ]. ... Such an ontology would probably be stored at the server providing the query and inference services. In order to provide instances for the ontology classes, attributes and relations, an annotation language is offered. The two design criteria for the design of this language were: 1. It should integrate smoothly in HTML, the current de facto standard for web pages. 2. It should reuse existing information in these webpages as much as possible, thus preventing duplication of information (as needed in some other approaches like SHOE and RDF). The annotation language is called HTML  and extends the HTML language with an extra onto attribute for the HTML anchors (A elements). For declaring the Researcher instance “Frank van Harmelen”, all one has to do is putting a statement like somewhere in any document. This asserts the existance of an instance of class Researcher with object identifier http://www.cs.vu.nl/˜frankh/. The name attribute of this Researcher instance can then be declared as follows: The value of an attribute can be a literal such as a STRING or another object identifier. We provide no examples of defining and instantiating relations, since we dit not encounter any specifications nor examples of how to do that. This way of annotating documents satisfies the first design criterion (smooth integration in HTML), since all onto attributes are safely ignored by the browser. In order to fulfil the second criterion (prevention of duplication of information), macros are available that allow for shorter and better maintainable annotations. For example, it is often the case that an object identifier in a statement is equal to the URL of the page containing the statement, e.g. a statement in Frank’s homepage mentioning that “Frank van Harmelen” is a Researcher with the URL of that homepage as an object identifier. Using the page macro, this can be abbreviated to: deliverable 8

10th August 2000

5

CHAPTER 1. REPRESENTATION LANGUAGES

When the annotations are extracted, all page macros will be replaced by the URL of the page. Other available macros are: 

tag and name, which are both replaced by the URL of the page appended with a hash and the value of the name attribute of the same anchor. 

href, which is replaced by the value of the href attribute of the same anchor. 

body, which is replaced by the text content of the anchor, i.e. the text inbetween
and .

These macros can be used when declaring instances as well as attribute values, e.g. as in Hi, my name is Frank van Harmelen Although the ontology might specify that an attribute has either a literal or an object identifier as its value, both will be accepted by annotation processors, since in reality it may happen quite often that the object identifier of a real world object is not yet established or known. Consider for example a Publication class with an author attribute of type Person: it may be that there is no object representing the author, so that a string value is the best you can do. Another problem with object identification is the fact that URLs are typically used as identifiers. Although it makes sense in a web context to use them for this purpose, the problem is that URLs are often not unique: the same webpage may be reachable through several different URLs, e.g. http://www.somehost.com/ and http://www.somehost.com/index.html. This is a major issue when using macros instead of explicit object identifiers. There is no obvious solution for this problem, since the rules for equality of URLs may differ from host to host. This problem is acknowledged but ignored by the Ontobroker team. The Ontobroker tool set consists of a number of various tools. A crawler has been made that gathers annotated pages from the Internet, extracts the annotations and stores them in a database. A search and inference engine is able to execute F-logic queries on this database, in terms of ontological entities. This engine uses knowledge from the ontology to enhance the query and produce more results, e.g. by using subclass relationships, transitivity of relations, etc. The query language also allows meta-level queries such as “give me all available attributes of the Researcher class”. Two query front-ends are available, one allowing expert users to directly input this F-logic query, the other one assisting naive users with formulating such a query using forms and a graphical representation of the class hierarchy. The latter interface makes the query engine accessible for people who are not familiar with F-logic as well as for people who have no knowledge about the ontology in advance. A simple example F-logic query on the ontology discussed earlier is: FORALL FH, R, NAME >’’Frank van Harmelen’’] and R:Researcher[name->>NAME;cooperatesWith->>FH]. which retrieves the object identifiers and names of all Researchers cooperating with “Frank van Harmelen”.

1.3.1 Remarks and observations An advantage of HTML  is clearly the fact that it integrates into existing webpages, preventing the need for a seperate metadata storage with its associated problems like information duplication and – consequently – maintenance. However, it remains to be seen whether a separate storage really is a problem within the Ontoknowledge project, especially when the metadata is generated automatically. Additionally, it might not always be possible to adapt the original source, e.g. when the owners do not give 6

10th August 2000

deliverable 8

The State of the Art on Representation and Query Languages for Semistructured Data

permission for it or when another document format is used such as PDF or Word’s proprietary format, ruling out the approach offered by HTML  . Furthermore, the mechanisms offered by HTML  for reusing existing information are rather limited. Two kinds of problems may occur. First, a fragment of a webpage might mention a lot of information about a single object at once, e.g.: D. Fensel, S. Decker, M. Erdmann, R. Studer. Ontobroker: The Very High Idea. In: Proceedings of the 11th International Flairs Conference (FLAIRS-98), Sanibal Island, Florida, May 1998. This information could be used for introducing a Publication instance and immediately filling in its attributes. In approaches like XML this information would be very straightforward to model, e.g. by a PUBLICATION element with AUTHOR child elements, etc. However, due to the lacking capability of nesting ontological statements (anchors are not allowed to nest) the object identifiers have to be repeated over and over again. Using lots of macros, the best one can do would probably be (omitting layout markup): D. Fensel, S. Decker, M. Erdmann, R. Studer.

Suggest Documents